CN110750685A

CN110750685A - Data warehousing method and device

Info

Publication number: CN110750685A
Application number: CN201910936699.5A
Authority: CN
Inventors: 王静云; 郭景赞; 田元兵; 乔自知; 郭省力; 李德屹; 李京辉; 金雨超
Original assignee: China United Network Communications Group Co Ltd
Current assignee: China United Network Communications Group Co Ltd
Priority date: 2019-09-29
Filing date: 2019-09-29
Publication date: 2020-02-04
Anticipated expiration: 2039-09-29
Also published as: CN110750685B

Abstract

The embodiment of the invention discloses a data storage method and device, and relates to the field of computers. The automatic storage of the network data can be realized by adopting a multi-dimensional hierarchical clustering algorithm, and the operation and maintenance cost is reduced. The method comprises the following steps: acquiring a network data sample of a user terminal, and extracting a list of data with the least category from the network data sample as a characteristic sample; performing similarity calculation on the characteristic samples to generate a similarity matrix of the characteristic samples, and classifying data in the characteristic samples according to the similarity matrix to generate a new data group; and inputting the new data group to the corresponding position of the database table according to the preset sequence of the hierarchical clustering tree. The embodiment of the invention is applied to a network system.

Description

Data warehousing method and device

Technical Field

The embodiment of the invention relates to the field of computers, in particular to a data storage method and device.

Background

With the development of 5G, the massive nature and the type complexity of network data urgently need to realize an intelligent data analysis and processing system. For the original data reported by each manufacturer, a large amount of labor and time cost are needed for processing and warehousing the analyzed data to adapt to different types of network data which change in real time. The current data warehousing operation needs manual configuration of a base table structure (table name, and fields and types of the base table), however, the database table name, table fields, field types, and changes of field sequences and the like can cause failure of warehousing. Due to the diversity characteristic of network data types and the requirement of real-time property, the existing network data storage method consumes a large amount of labor and time cost to construct a database table structure, has low automation capacity and cannot meet the requirement of intelligent operation and maintenance.

Disclosure of Invention

The embodiment of the invention provides a data storage method and device, which can realize automatic storage of network data by adopting a multi-dimensional hierarchical clustering algorithm and reduce operation and maintenance cost.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, a method for warehousing data is provided, the method comprising: acquiring a network data sample of a user terminal, and extracting a list of data with the least category from the network data sample as a characteristic sample; performing similarity calculation on the characteristic samples to generate a similarity matrix of the characteristic samples, and classifying data in the characteristic samples according to the similarity matrix to generate a new data group; and inputting the new data group to the corresponding position of the database table according to the preset sequence of the hierarchical clustering tree.

Firstly, acquiring a network data sample of a user terminal, and extracting a list of data with the least category from the network data sample as a characteristic sample; so that the network data is more standard when being stored in the database subsequently. Then, carrying out similarity calculation on the characteristic samples to generate a similarity matrix of the characteristic samples, and classifying data in the characteristic samples according to the similarity matrix to generate a new data group; to improve the automation of network data classification. Finally, inputting the new data group to the corresponding position of the database table according to the preset sequence of the hierarchical clustering tree; therefore, the problem that a large amount of manpower and time cost are needed to construct the database table structure in the prior art is solved, and the operation and maintenance cost of network data warehousing is reduced.

In a second aspect, a data warehousing apparatus is provided, the data warehousing apparatus comprising: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a network data sample of a user terminal; the processing unit is used for extracting a list of data with the least category from the network data samples acquired by the acquisition unit to serve as a feature sample; the processing unit is also used for carrying out similarity calculation on the characteristic samples to generate a similarity matrix of the characteristic samples and classifying data in the characteristic samples according to the similarity matrix to generate a new data group; and the processing unit is also used for inputting the new data group to the corresponding position of the database table according to the preset sequence of the hierarchical clustering tree.

It can be understood that, the data warehousing device provided above is configured to execute the method corresponding to the first aspect provided above, and therefore, the beneficial effects that can be achieved by the data warehousing device may refer to the beneficial effects of the method corresponding to the first aspect above and the beneficial effects of the corresponding schemes in the following detailed description, which are not described herein again.

In a third aspect, a data warehousing device is provided, where the data warehousing device structurally includes a processor and a memory, the memory is coupled to the processor and is used to store necessary program instructions and data of the data warehousing device, and the processor is used to execute the program instructions stored in the memory, so that the data warehousing device executes the method of the first aspect.

In a fourth aspect, a computer storage medium is provided, in which computer program code is stored, which, when run on a data-warehousing apparatus, causes the data-warehousing apparatus to perform the method of the first aspect.

A fifth aspect provides a computer program product having stored thereon the above computer software instructions, which, when run on a data-warehousing device, cause the data-warehousing device to execute a program as described above for the method of the first aspect.

Drawings

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of a data warehousing method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a data warehousing apparatus according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of another data warehousing device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of another data warehousing device according to an embodiment of the present invention.

Detailed Description

It should be noted that, in the embodiments of the present invention, words such as "exemplary" or "for example" are used to indicate examples, illustrations or explanations. Any embodiment or design described as "exemplary" or "e.g.," an embodiment of the present invention is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

It should be noted that, in the embodiments of the present invention, "of", "corresponding" and "corresponding" may be sometimes used in combination, and it should be noted that, when the difference is not emphasized, the intended meaning is consistent.

For the convenience of clearly describing the technical solutions of the embodiments of the present invention, in the embodiments of the present invention, the words "first", "second", and the like are used for distinguishing the same items or similar items with basically the same functions and actions, and those skilled in the art can understand that the words "first", "second", and the like are not limited in number or execution order.

According to the research results of internet data centers for many years, the global network data volume is doubled approximately every 2 years, the network data volume generated each year is increased exponentially, the network data acceleration rate accords with the moore law, and the total global network data volume is estimated to reach 35ZB in 2020. How to effectively acquire, load, analyze and process the massive network data becomes an important link and foundation of big data application. The existing network data storage method consumes a large amount of manpower and time cost to construct the database table structure, so that the automation capability is low, and the requirement of intelligent operation and maintenance cannot be met.

To solve the problems in the prior art, referring to fig. 1, an embodiment of the present invention provides a data entry method, including:

101. and acquiring a network data sample of the user terminal, and extracting a list of data with the least category from the network data sample as a feature sample.

The obtaining of the network data sample of the user terminal specifically includes: collecting network data of user terminal and according to field key for network data_mValue of the characteristic value_mnAnalyzing the list format to generate a network data sample; where m represents a column and n represents a row.

It should be noted that the network data of the user terminal at least includes one or more of the following items: user terminal network configuration data, access network data, core network data, and the like. The key or value may be in the form of a numeric value type or may relate to various types of strings (e.g., chinese or english, etc.).

Illustratively, after analyzing and processing a network data file of a user terminal, the data input of the data acquisition module generates a uniform list format with key-value pairs as data types, and inputs the key-value list format as the data characteristics of the data cleaning module into a network data sample_m-value_mn]. Wherein m represents a column, n represents a row and can also represent a category, and the types of the processed network data files comprise xml, csv, txt and the like. Key-value pair List [ keys-values]Is a list and is composed of a plurality of key value pairs, and the ith sample is expressed mathematically as: [ key ]₁-value₁₁，key₂-value₂₁，key₃-value₃₁，…，key_m-value_m1]. Each key or field key represents a data feature of one dimension, and value is a value corresponding to the field key. Network data samples as shown in equation (1-1); wherein. Each row represents an input sample and each column represents a one-dimensional feature sample.

Optionally, in consideration of the fact that the network data may have error, invalid and other characteristic values during collection; therefore, after collecting the network data of the user terminal, the method further includes: and cleaning repeated and invalid characteristic values in the network data by adopting a machine learning algorithm.

102. And performing similarity calculation on the characteristic samples to generate a similarity matrix of the characteristic samples, and classifying data in the characteristic samples according to the similarity matrix to generate a new data group.

The similarity calculation of the characteristic samples is performed to generate a similarity matrix of the characteristic samples, and the similarity calculation specifically includes:

and (3) performing similarity calculation on the feature samples according to the following formulas (1-2) and (1-3) to generate a similarity matrix of the feature samples:

D(x,x)＝D(y,y)＝0(1-3)；

where x and y represent different data in the feature sample, respectively.

103. And inputting the new data group to the corresponding position of the database table according to the preset sequence of the hierarchical clustering tree.

It should be noted that the preset sequence of the hierarchical clustering tree may be to input the network data in the hierarchical clustering tree into the database table according to the hierarchical division structure of the hierarchical clustering tree (for example, the hierarchical clustering tree is layered according to the sequence from top to bottom). Specifically, the network data is correspondingly input into a table name, a table field and a table of the data type of the field constructed by hierarchical clustering.

For better understanding, the embodiments of the present invention are exemplarily illustrated for the above method, and specifically, the following is provided:

the network data of the user terminal is collected as shown in table 1.

TABLE 1

Analyzing the network data sample in the table 1 to generate the key shown in the table 2_m-value_mnIn a list format.

TABLE 2

Sample numbering
			1	begin Time:2019-06-18T03:00:00+08:00	duration:PTS900
2	begin Time:2019-06-18T03:00:00+08:00	duration:PTS3600
			3	begin Time:2019-06-18T04:00:00+08:00	duration:PTS1200
4	begin Time:2019-06-18T03:00:00+08:00	duration:PTS900
			5	begin Time:2019-06-18T04:00:00+08:00	duration:PTS900
6	begin Time:2019-06-18T04:00:00+08:00	duration:PTS3600

Since key is the corresponding data class value in beginTime_nThe key is the data type value corresponding to the duration_nAre classified into three categories. Thus, now take key as beginTime example to compute the similarity matrix between corresponding values, then the values_nAs shown in table 3:

TABLE 3

value₁	2019-06-18T03:00:00+08:00
		value₂	2019-06-18T03:00:00+08:00
value₃	2019-06-18T04:00:00+08:00
		value₄	2019-06-18T03:00:00+08:00
value₅	2019-06-18T04:00:00+08:00
		value₆	2019-06-18T04:00:00+08:00

Similarity calculation is performed on the data in table 3 according to formulas (1-2) and (1-3), and a matrix shown as follows is generated:

then, the new class data set generated according to the matrix includes two groups in total: one group is value₁、value₂、value₄(ii) a Another group is value₃、value₅、value₆. If the preset sequence of the hierarchical clustering tree is that the network data are input into the database table from top to bottom; establishing a base table with the name of T03 according to the table name, the table fields and the data types of the fields, wherein T03 is the corresponding data type of the table name, the beginTime is the table fields and the character string is the beginTime field; english is Tablename T03; fields: { begingTime: string }, Tablename is table name; the files contains the fields of the table and the corresponding field type information. Then value will be₁、value₂、value₄Inputting corresponding network data into a library table with the name of T03, wherein the library table comprises a beginTime field, and the data type of the beginTime is a character string; value₃、value₅、value₆The same is true. It should be noted that the above description is only for beginTime-value_nThe analysis was performed so that the library table contained only network data corresponding to beginTime (e.g., the library table T03 contained only network data corresponding to beginTime 2019-06-18T03:00:00+08: 00).

More specifically, value corresponding to begin Time_nAnalyzing the obtained value of the same type₁、value₂、value₄For example, the value is further analyzed according to the above steps in combination with the network data corresponding to the duration in Table 1₁、value₂、value₄Is divided into two groups, one group is value₁、value₄(ii) a One group is value₂. Therefore, a library table with a library table name of T03_ PTS900 is constructed, which contains a beginTime and duration fields, wherein the data types of beginTime and duration are both character strings; english is represented as Tablename T03_ PTS 900; fields: { begingTime: string, duration: string }, where "_" may be replaced with another concatenated character. And will value₁And value₄Corresponding network data is input into the library table. The above-mentioned pair beginTime-value_nRun-value based on analysis_nThe analysis is also performed, so the table contains the network data corresponding to begin Time and the network data corresponding to duration (e.g. the table T03 \ u \)The PTS900 includes network data corresponding to beginTime 2019-06-18T03:00:00+08:00 and PTS900 corresponding to duration).

In addition, after a row of data with the least category is extracted from the network data sample and used as the feature sample, and the step 101-.

Firstly, acquiring a network data sample of a user terminal, and extracting a list of data with the least category from the network data sample as a characteristic sample; so that the network data is more standard when being stored in the database subsequently. Then, carrying out similarity calculation on the characteristic samples to generate a similarity matrix of the characteristic samples, and classifying data in the characteristic samples according to the similarity matrix to generate a new data group; to improve the automation of network data classification. Finally, inputting the new data group to the corresponding position of the database table according to the preset sequence of the hierarchical clustering tree; the problem that a large amount of manpower and time cost are consumed to construct a database table structure in the prior art is solved, and the operation and maintenance cost of network data warehousing is reduced.

The embodiment of the present invention may perform the division of the functional modules on the data warehousing device according to the method embodiment, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, the division of the modules in the embodiment of the present invention is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

In the case of dividing each functional module according to each function, fig. 2 shows a schematic diagram of a possible structure of the data warehousing device 20 related to the above embodiment, where the data warehousing device 20 includes:

an obtaining unit 201, configured to obtain a network data sample of a user terminal.

The processing unit 202 is configured to extract a list of data with the least category from the network data samples acquired by the acquiring unit 201 as a feature sample.

The processing unit 202 is further configured to perform similarity calculation on the feature samples to generate a similarity matrix of the feature samples, and classify data in the feature samples according to the similarity matrix to generate a new class data set.

The processing unit 202 is further configured to input the new class data group to a corresponding position of the database table according to a preset sequence of the hierarchical clustering tree.

In an exemplary scheme, the processing unit 202 is specifically configured to perform similarity calculation on the feature samples according to the following formula to generate a similarity matrix of the feature samples:

D(x,x)＝D(y,y)＝0；

where x and y represent different data in the feature sample, respectively.

In an exemplary scheme, the obtaining unit 201 is specifically configured to collect network data of a user terminal, and perform key-based processing on the network data_mValue of the characteristic value_mnAnalyzing the list format to generate a network data sample; where m represents a column and n represents a row.

In an exemplary scheme, the processing unit 202 is further configured to perform a machine learning algorithm on the network data to perform a cleaning process on the network data to process repeated and invalid feature values in the network data.

Since the data storage device in the embodiment of the present invention may be applied to implement the method embodiment, the technical effect obtained by the data storage device may also refer to the method embodiment, and the embodiment of the present invention is not described herein again.

In the case of an integrated unit, fig. 3 shows a possible structural schematic diagram of the data warehousing device 20 involved in the above embodiment. The data warehousing device 20 includes: a processing module 301, a communication module 302 and a storage module 303. The processing module 301 is used for controlling and managing the actions of the data warehousing device 20, for example, the processing module 301 is used for supporting the data warehousing device 20 to execute the processes 102 and 103 in fig. 1. The communication module 302 is used to support communication between the data warehousing device 20 and other entities. The storage module 303 is used for storing program codes and data of the data warehousing device 20.

The processing module 301 may be a processor or a controller, such as a Central Processing Unit (CPU), a general-purpose processor, a Digital Signal Processor (DSP), an application-specific integrated circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. A processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a DSP and a microprocessor, or the like. The communication module 302 may be a transceiver, a transceiving circuit or a communication interface, etc. The storage module 303 may be a memory.

When the processing module 301 is a processor as shown in fig. 4, the communication module 302 is a transceiver as shown in fig. 4, and the storage module 303 is a memory as shown in fig. 4, the data warehousing device 20 according to the embodiment of the present application may be the following data warehousing device 20.

Referring to fig. 4, the data warehousing device 20 includes: a processor 401, a transceiver 402, a memory 403, and a bus 404.

The processor 401, the transceiver 402 and the memory 403 are connected to each other through a bus 404; the bus 404 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The processor 401 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more ics for controlling the execution of programs in accordance with the present invention.

The memory 403 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integral to the processor.

The memory 403 is used for storing application program codes for executing the scheme of the application, and the processor 401 controls the execution. The transceiver 402 is configured to receive content input by an external device, and the processor 401 is configured to execute application program codes stored in the memory 403, so as to implement the data entry method in the embodiment of the present application.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented using a software program, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer-readable storage media can be any available media that can be accessed by a computer or can comprise one or more data storage devices, such as servers, data centers, and the like, that can be integrated with the media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The embodiment of the present invention further provides a computer program product, which can be directly loaded into the memory and contains software codes, and the computer program product can implement the data storage method after being loaded and executed by the computer.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for data warehousing, comprising:

acquiring a network data sample of a user terminal, and extracting a list of data with the least category from the network data sample as a characteristic sample;

performing similarity calculation on the characteristic samples to generate a similarity matrix of the characteristic samples, and classifying data in the characteristic samples according to the similarity matrix to generate a new data group;

and inputting the new data group to a corresponding position of a database table according to a preset sequence of the hierarchical clustering tree.

2. The data warehousing method according to claim 1, wherein performing similarity calculation on the feature samples to generate a similarity matrix of the feature samples specifically includes:

performing similarity calculation on the feature samples according to the following formula to generate a similarity matrix of the feature samples:

D(x,x)＝D(y,y)＝0；

wherein x and y represent different data in the feature sample, respectively.

3. The method according to claim 1, wherein the obtaining the network data sample of the user terminal specifically includes:

collecting the network data of the user terminal and according to the field key to the network data_mValue of the characteristic value_mnAnalyzing the list format to generate the network data sample; where m represents a column and n represents a row.

4. The method according to claim 3, further comprising, after collecting the network data of the user terminal:

and cleaning the repeated and invalid characteristic values in the network data by adopting a machine learning algorithm.

5. A data warehousing apparatus, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a network data sample of a user terminal;

the processing unit is used for extracting a list of data with the least category from the network data samples acquired by the acquisition unit to serve as a feature sample;

the processing unit is further configured to perform similarity calculation on the feature samples to generate a similarity matrix of the feature samples, and classify data in the feature samples according to the similarity matrix to generate a new class data set;

and the processing unit is also used for inputting the new data group to the corresponding position of the database table according to the preset sequence of the hierarchical clustering tree.

6. The data warehousing device of claim 5, comprising:

the processing unit is specifically configured to perform similarity calculation on the feature sample according to the following formula to generate a similarity matrix of the feature sample:

D(x,x)＝D(y,y)＝0；

wherein x and y represent different data in the feature sample, respectively.

7. The data warehousing device of claim 5, comprising:

the acquiring unit is specifically configured to acquire network data of the user terminal and key the network data according to a keyword_mValue of the characteristic value_mnAnalyzing the list format to generate the network data sample; where m represents a column and n represents a row.

8. The data warehousing device of claim 5, comprising:

the processing unit is further used for cleaning the network data by adopting a machine learning algorithm to process repeated and invalid characteristic values in the network data.

9. A data-warehousing device characterized in that the data-warehousing device comprises a processor and a memory, the memory is coupled to the processor and used for storing necessary program instructions and data of the data-warehousing device, and the processor is used for executing the program instructions stored in the memory to make the data-warehousing device execute the data-warehousing method as claimed in any one of claims 1 to 4.

10. A computer storage medium having computer program code stored therein, which when run on a data-warehousing apparatus causes the data-warehousing apparatus to perform the data-warehousing method of any one of claims 1-4.

11. A computer program product having computer software instructions stored thereon for causing a data-warehousing apparatus to perform the data-warehousing method of any one of claims 1-4 when the computer software instructions are run on the data-warehousing apparatus.