WO2018145228A1 - Quality of access reports in distributed storage systems - Google Patents

Quality of access reports in distributed storage systems Download PDF

Info

Publication number
WO2018145228A1
WO2018145228A1 PCT/CN2017/000155 CN2017000155W WO2018145228A1 WO 2018145228 A1 WO2018145228 A1 WO 2018145228A1 CN 2017000155 W CN2017000155 W CN 2017000155W WO 2018145228 A1 WO2018145228 A1 WO 2018145228A1
Authority
WO
WIPO (PCT)
Prior art keywords
measure
storage component
computer storage
data
determining
Prior art date
Application number
PCT/CN2017/000155
Other languages
French (fr)
Inventor
Kuien LIU
Haozhou WANG
Original Assignee
Pivotal Software, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pivotal Software, Inc. filed Critical Pivotal Software, Inc.
Priority to PCT/CN2017/000155 priority Critical patent/WO2018145228A1/en
Publication of WO2018145228A1 publication Critical patent/WO2018145228A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Definitions

  • This disclosure relates to distributed storage systems.
  • a distributed storage component serves one or more client systems that may be located remotely from the storage component.
  • the performance of the client systems may depend on the performance associated with both a client system and the storage component.
  • Various performance measures can indicate quality of access from the client system to the storage component. To estimate the quality of access to the storage component and improve that quality, the client system needs to choose the right performance measures that provide accurate indication about the quality of access to the storage component.
  • This specification describes techniques for estimating and improving quality of access to a distributed storage component.
  • a system estimates the quality of access to the storage component through determining one or more of a measure of diversity of file types stored on the storage component, a measure of diversity of sizes of files stored on the storage component, a measure of transport capability associated with accessing the storage component.
  • the system generates a report estimating the quality of access to the distributed storage component using the performance measures associated with suggestions about improving the quality of access.
  • the disclosed techniques enable estimating and improving quality of access to a distributed computer storage component.
  • the disclosed techniques improve upon conventional technologies by providing a user with more detailed status of the storage including, for example, the data model, network capacity and data distribution patterns. The more detailed status can enable the user to fine-tune the system to gain higher performance in data transfer than a conventional system can achieve.
  • difference in sizes of files, e.g., objects can hamper parallel processing performance of operations on the data.
  • the disclosed techniques improve upon conventional technologies by computing a measure of data skew to describe the size of files and using the measure to suggest improvements to file structures in an object-based storage model.
  • FIG. 1 is a block diagram illustrating an example distributed storage system.
  • FIG. 2 is a block diagram illustrating an example performance estimation engine.
  • FIG. 3 is a flowchart of an example process of generating a report on the quality of access to a computer storage component based on a measure of data diversity for the computer storage component.
  • FIG. 4 is a flowchart of an example process of generating a report on the quality of access to a computer storage component based on a measure of data skew for the computer storage component.
  • FIG. 5 is a flowchart of an example process of generating a report on the quality of access to a computer storage component by a client system based on a measure of transport capability for the client system and the computer storage component.
  • FIG. 6 is an example report on quality of access to a computer storage component by a client system.
  • FIG. 1 is a block diagram illustrating an example distributed storage system 100.
  • the distributed storage system 100 is configured to estimate the performance of a distributed storage medium in serving the storage needs of a client system.
  • Each component of the distributed storage system 100 can be implemented on one or more computers each including one or more computer processors.
  • the distributed storage system 100 includes a data warehouse system 101 and a cloud storage medium 104.
  • the data warehouse system 101 is an example of a client system that stores data on the cloud storage medium 104.
  • the cloud storage medium 104 is a computer storage component that stores data in one or more data files 106 and communicates with client systems, such as the data warehouse system 101, through an access interface 105.
  • the data warehouse system 101 includes a data warehouse 102.
  • the data warehouse 102 includes a master node 111 and multiple segment nodes, including a first segment node 14lA, a second segment node 14lB, and a third segment node 141C.
  • the master node 111 receives a request to the data warehouse 102, processes the request to formulate a task, divides the task to subtasks, and assigns the subtasks to the segment nodes 141A-C.
  • the segment nodes 141A-C perform the subtasks by communicating with the cloud storage medium 104 through the access interface 105.
  • the master node 111 and the segment nodes 141A-C communicate and share data using an interconnect switch 131.
  • the master node 111 and the individual segment nodes 141A-C can each be computer nodes with separate operating system, processing unit, storage unit, and memory unit components.
  • the data warehouse system 101 also includes an access performance estimation engine 103.
  • the access performance estimate engine 103 estimates the performance of the cloud storage medium 104 in serving the data warehouse system 101 by determining one or more performance measures.
  • the access performance estimate engine 103 then generates a report on the quality of access to the cloud storage medium 104 by the data warehouse 102 based on the performance measures.
  • An example access performance estimation engine 103 is described in greater detail below with reference to FIG. 2.
  • FIG. 2 is a block diagram illustrating an example access performance estimate engine 103.
  • Each component of the access performance estimation engine 103 can be implemented on one or more computers each including one or more computer processors.
  • the access performance estimation engine 103 includes a data diversity module 201, a data skew module 202, a transport capability module 203, a bias module 204, and a report generation module 205.
  • the data diversity module 201, the data skew module 202, the transport capability module 203, and the bias module 204 each determine a respective measure estimating the performance of the cloud storage medium 104 in serving the data warehouse system 101.
  • the report generation module 205 generates a report on the quality of access to the cloud storage medium 104 by the data warehouse 102 using the produced performance measures.
  • the data diversity module 201 determines a measure of data diversity for the cloud storage medium 104.
  • the measure of data diversity for the cloud storage medium 104 describes the diversity of file types, e.g., different file formats, associated with a group of the data files 106 stored on the cloud storage medium 104.
  • the measure of data diversity for the cloud storage medium 104 may describe a count of file types for a group of files that exist in a given path or segment of the cloud storage medium 104. More file types correspond to higher data diversity.
  • the type of a file is a property of a file determined based on the way the file encodes information.
  • Examples of computer file types include CSV, AVRO, TEXT, and PRAQUE types. Determining a measure of data diversity for a cloud storage medium 104 is described in greater detail below with reference to FIG. 3.
  • the data skew module 202 determines a measure of data skew for the cloud storage medium 104.
  • the measure of data skew for the cloud storage medium 104 indicates a distribution of data size among a group of the data files 106 stored on the cloud storage medium 104. Determining a measure of data skew for a cloud storage medium 104 is described in greater detail below with reference to FIG. 4.
  • the transport capability module 203 determines a measure of transport capability for the data warehouse system 101 and the cloud storage medium 104.
  • the measure of transport capability for the data warehouse system 101 and the cloud storage medium 104 describes one or both of a measure of effective network throughput between the data warehouse system 101 and the cloud storage medium 104 and a measure of data movement cost on the cloud storage medium 104.
  • the measure of effective network throughput for the data warehouse system 101 and the cloud storage medium 104 describes a maximum rate of communication between the data warehouse system 101 and the cloud storage medium 104.
  • the measure of data movement cost on the cloud storage medium 104 describes a network cost incurred by moving one or more of the data files 106 on the cloud storage medium 104. Determining the measure of transport capability for the data warehouse system 101 and the cloud storage medium 104 is described in greater detail below with reference to FIG. 5.
  • the bias module 204 obtains an expected performance for the cloud storage medium 104.
  • the bias module 204 generates a measured performance for the cloud storage medium 104 to compare to the expected performance.
  • the bias module 204 then computes a measure of deviation of the measured performance from expected performance.
  • the expected performance for the cloud storage medium 104 can be a measure of ideal or recommended performance for the cloud storage medium 104 given a configuration of the cloud storage medium 104.
  • the report generation module 205 generates a report on the quality of access to the cloud storage medium 104 by the data warehouse 102 in serving the data warehouse system 101 based on the performance measures produced by one or more of the data diversity module 201, the data skew module 202, the transport capability module 203, and the bias module 204.
  • the generated report may also include recommendations for improving the performance of the cloud storage medium 104 in serving the data warehouse system 101.
  • An example performance report is described in greater detail below with reference to FIG. 6.
  • FIG. 3 is a flowchart of an example process 300 of generating a report on the quality of access to a computer storage component based on a measure of data diversity for the computer storage component.
  • the process 300 may be performed by a system of one or more computers.
  • a distributed storage system e.g., the distributed storage system 100 of FIG. 1, programmed in accordance with this specification can perform process 300.
  • the system identifies (302) the computer storage component.
  • the computer storage component stores multiple files including a first group of files and serves a client system.
  • the first group of files may include files that exist in a given path or segment of the computer storage component. Identifying the computer storage component can occur in response to a user input of selecting the computer storage component from multiple computer storage components for generating a report.
  • the system determines (304) a respective file type for each file of the first group of files stored on the computer storage component.
  • file types include CSV, AVRO, TEXT, and PRAQUE types.
  • the system determines (306) a measure of data diversity associated with the computer storage component based on a number of file types of the first group of files and the count of the files having each of the number of file types.
  • the system determines the measure of data diversity D based on the following formula as shown below in Equation 1.
  • R is the number of file types of the first group of files and p i is a measure of abundance of a file type i.
  • the measure of abundance of a file type is a ratio of how many files out of the first group of files have the file type i.
  • the system generates (308) a report on estimated quality of access of a client system to the computer storage component based on the measure of data diversity.
  • the report includes a suggestion on reducing the data diversity to improve the quality of access, where reducing the data diversity includes reducing the number of file types of the first group of files.
  • the system can provide the report for storage or provide the report to a user device for presentation on a display device or on a printer.
  • FIG. 4 is a flowchart of an example process 400 for generating a report on the quality of access to a computer storage component based on a measure of data skew for the computer storage component.
  • the process 400 may be performed by a system of one or more computers.
  • a distributed storage system e.g., the distributed storage system 100 of FIG. 1, appropriately programmed in accordance with this specification can perform process 400.
  • the system identifies (402) a computer storage component.
  • the computer storage component stores multiple files including a first group of files and serves a client system.
  • the system determines (404) a respective file size for each file of the first group of files stored on the computer storage component.
  • the size of a file is a measure of the amount of data stored in that file, or how much storage space the file consumes.
  • the system determines (406) a measure of data skew for the computer storage component based on each file size associated with each file of the first group of files.
  • the system determines the measure of data skew for the computer storage component based on a measure of central tendency e.g., a mean or median of each file size associated with each file of the first group of files stored on the computer storage component and a measure of variance of each file size associated with each file of the first group of files. In some of those implementations, the system divides the measure of statistical variation by the measure of central tendency to generate the measure of data skew for the computer storage component.
  • a measure of central tendency e.g., a mean or median of each file size associated with each file of the first group of files stored on the computer storage component and a measure of variance of each file size associated with each file of the first group of files.
  • the measure of data diversity is a coefficient of variation, sometimes referred to as relative standard deviation of each file size associated with each file of the first group of files stored on the computer storage component.
  • the system determines the coefficient of variation based on the following formula as shown below in Equation 2:
  • C v is the coefficient of variation associated with each file size of each file of the first group of files stored on the computer storage component
  • is the standard deviation of each file size
  • is the mean of each file size
  • the system generates (408) a report on estimated quality of access of the client system to the computer storage component based on the measure of data diversity.
  • the report includes a suggestion on reducing the measure of data skew to improve the quality of access.
  • FIG. 5 is a flowchart of an example process 500 for generating a report on the quality of access to a computer storage component by a client system based on a measure of transport capability for the client system and the computer storage component.
  • the process 500 may be performed by a system of one or more computers.
  • a distributed storage system e.g., the distributed storage system 100 of FIG. 1, appropriately programmed in accordance with this specification can perform process 500.
  • the system identifies (502) a computer storage component.
  • the computer storage component serves the client system.
  • the system determines (504) a measure of effective network throughput between the storage component and the client system.
  • the system determines the measure of effective network throughput by performing the following operations.
  • the system uploads a data file from the client system to the computer storage component, obtains a data receive window (RWIN) and a round-trip time (RTT) associated with the upload, and determines the measure of network throughput based on the RWIN and RTT associated with the upload.
  • the RWIN associated with a data transfer is an amount of data that a recipient of the data transfer can accept without acknowledging the sender of the data transfer.
  • the RTT associated with a data transfer is a length of time between the sender sending the transferred data and receiving an acknowledgment of the receipt of the transferred data by the recipient.
  • the system divides the RWIN associated with the upload by the RTT associated with the upload to determine the measure of network throughput.
  • the system downloads at least a portion of a data file from the computer storage component to the client system, obtains a data receive window (RWIN) and a round-trip time (RTT) associated with the download, and determines the measure of network throughput based on the RWIN and RTT associated with the download. In some of those implementations, the system divides the RWIN associated with the download by the RTT associated with the download to determine the measure of network throughput.
  • RWIN data receive window
  • RTT round-trip time
  • the system may determine a portion of a data file to download from the computer storage component to the client system based on sampling data fields from the data file in accordance with a sampling rate.
  • the system may generate the sampling rate or obtain the sampling rate from an end user.
  • the system determines (506) a measure of data movement cost for the computer storage component.
  • the system copies a data file from a first location on the computer storage medium to a second location on the computer storage medium.
  • the system obtains a data receive window (RWIN) and a round-trip time (RTT) associated with the copying.
  • the system determines the measure of data movement cost based on the RWIN and RTT associated with the copying.
  • the system divides the RWIN associated with the copying by the RTT associated with the copying to determine the measure of data movement cost.
  • the system determines (508) the measure of transport capability for the computer storage component and the client system based the measure of effective network throughput and the measure of data movement cost. In some implementations, the system multiplies the measure of effective network throughput and the measure of data movement cost to generate the measure of transport capability.
  • the system generates (510) a report on estimated quality of access of the client system to the computer storage component based on the measure of transport capability.
  • the report can include a suggestion on reducing the measure of transport capability to improve the quality of access.
  • FIG. 6 is an example report 600 on quality of access to a computer storage component by a client system.
  • the report 600 may be generated by a system of one or more computers.
  • a distributed storage system e.g., the distributed storage system 100 of FIG. 1, appropriately programmed in accordance with this specification can generate the report 600.
  • the system can provide the report 600 for presentation on a display device or a printer, or for storage for subsequent analysis.
  • the report 600 includes a header section 601, a status section 602, and a suggestions section 603.
  • the header section 601 includes information about a client system (in the field designated by the word “From” ) , a computer storage component (in the field designated by the word “To” ) , and a time frame within which the system measured the performance measures noted in the status section 602 (in the field designated by the word “Time” ) .
  • the status section 602 includes the values of performance measures pertaining to status of quality of access to the computer storage component by the client system (e.g., data skew and data diversity) .
  • the status section 602 also includes, for each performance measure, detail information about performance values and metrics pertaining to the measure.
  • detail information for the data skew measure include the total number of a group of files on the computer storage medium, the minimum, maximum, and average of the sizes of those files, and a visual representation of the distribution of the file sizes.
  • the detail information for the data diversity measure includes the count of total number of file types for a group of files on the computer storage medium, the names of those file types, and the respective abundance ratio of those file types.
  • the suggestions section 603 include one or more suggestions for improving estimated quality of access of the client system to the computer storage component.
  • the system may generate the suggestions in the suggestions section 603 based on the performance measures included in the status section 602. For instance, the first suggestion in FIG. 6 (i.e., “data defragment, merge small files into proper sizes” ) is based on the data skew measure, while the second suggestion (i.e., “data transformation, convert JSON to CSV” ) is based on the data diversity measure.
  • Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
  • Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory program carrier for execution by, or to control the operation of, data processing apparatus.
  • the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
  • the computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
  • data processing apparatus refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
  • the apparatus can also be or further include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit) .
  • the apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
  • a computer program which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • a computer program may, but need not, correspond to a file in a file system.
  • a program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub-programs, or portions of code.
  • a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
  • the processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output.
  • the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit) .
  • Computers suitable for the execution of a computer program include, by way of example, general or special purpose microprocessors or both, or any other kind of central processing unit.
  • a central processing unit will receive instructions and data from a read-only memory or a random access memory or both.
  • the essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
  • mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
  • a computer need not have such devices.
  • a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA) , a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.
  • PDA personal digital assistant
  • GPS Global Positioning System
  • USB universal serial bus
  • Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
  • semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
  • magnetic disks e.g., internal hard disks or removable disks
  • magneto-optical disks e.g., CD-ROM and DVD-ROM disks.
  • the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
  • a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and a pointing device e.g., a mouse or a trackball
  • Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
  • a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to
  • Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components.
  • the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN) , e.g., the Internet.
  • LAN local area network
  • WAN wide area network
  • the computing system can include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the user device, which acts as a client.
  • Data generated at the user device e.g., a result of the user interaction, can be received from the user device at the server.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Debugging And Monitoring (AREA)

Abstract

Systems, methods, and computer program products for estimating and improving quality of access to distributed storage component. A system estimates the quality of access to the storage component through determining one or more of a measure of diversity of file types stored on the storage component, a measure of diversity of sizes of files stored on the storage component, a measure of transport capability associated with accessing the storage component. The system generates a report estimating the quality of access to the distributed storage component using the performance measures that includes suggestions about improving the quality of access.

Description

QUALITY OF ACCESS REPORTS IN DISTRIBUTED STORAGE SYSTEMS
This disclosure relates to distributed storage systems.
In distributed storage systems, a distributed storage component serves one or more client systems that may be located remotely from the storage component. The performance of the client systems may depend on the performance associated with both a client system and the storage component. Various performance measures can indicate quality of access from the client system to the storage component. To estimate the quality of access to the storage component and improve that quality, the client system needs to choose the right performance measures that provide accurate indication about the quality of access to the storage component.
SUMMARY
This specification describes techniques for estimating and improving quality of access to a distributed storage component. A system estimates the quality of access to the storage component through determining one or more of a measure of diversity of file types stored on the storage component, a measure of diversity of sizes of files stored on the storage component, a measure of transport capability associated with accessing the storage component. The system generates a report estimating the quality of access to the distributed storage component using the performance measures associated with suggestions about improving the quality of access.
The subject matter described in this specification can be implemented in various embodiments so as to realize one or more of the following advantages. The disclosed techniques enable estimating and improving quality of access to a distributed computer storage component. The disclosed techniques improve upon conventional technologies by providing a user with more detailed status of the storage including, for example, the data model, network capacity and data distribution patterns. The more detailed status can enable the user to fine-tune the system to gain higher performance in data transfer than a conventional system can achieve. In computer storage systems that use object-based storage models, difference in sizes of files, e.g., objects, can hamper parallel processing performance of operations on the data. The disclosed techniques improve upon conventional technologies by computing a measure of data skew to describe the size of files and using the measure to suggest improvements to file structures in an object-based storage model.
The details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram illustrating an example distributed storage system.
FIG. 2 is a block diagram illustrating an example performance estimation engine.
FIG. 3 is a flowchart of an example process of generating a report on the quality of access to a computer storage component based on a measure of data diversity for the computer storage component.
FIG. 4 is a flowchart of an example process of generating a report on the quality of access to a computer storage component based on a measure of data skew for the computer storage component.
FIG. 5 is a flowchart of an example process of generating a report on the quality of access to a computer storage component by a client system based on a measure of transport capability for the client system and the computer storage component.
FIG. 6 is an example report on quality of access to a computer storage component by a client system.
Like reference numbers and designations in the various drawings indicate like elements.
DETAILED DESCRIPTION
FIG. 1 is a block diagram illustrating an example distributed storage system 100. The distributed storage system 100 is configured to estimate the performance of a distributed storage medium in serving the storage needs of a client system. Each component of the distributed storage system 100 can be implemented on one or more computers each including one or more computer processors.
The distributed storage system 100 includes a data warehouse system 101 and a cloud storage medium 104. The data warehouse system 101 is an example of a client system that stores data on the cloud storage medium 104. The cloud storage medium 104 is a computer storage component that stores data in one or more data files  106 and communicates with client systems, such as the data warehouse system 101, through an access interface 105.
The data warehouse system 101 includes a data warehouse 102. The data warehouse 102 includes a master node 111 and multiple segment nodes, including a first segment node 14lA, a second segment node 14lB, and a third segment node 141C. The master node 111 receives a request to the data warehouse 102, processes the request to formulate a task, divides the task to subtasks, and assigns the subtasks to the segment nodes 141A-C. The segment nodes 141A-C perform the subtasks by communicating with the cloud storage medium 104 through the access interface 105. The master node 111 and the segment nodes 141A-C communicate and share data using an interconnect switch 131. The master node 111 and the individual segment nodes 141A-C can each be computer nodes with separate operating system, processing unit, storage unit, and memory unit components.
The data warehouse system 101 also includes an access performance estimation engine 103. The access performance estimate engine 103 estimates the performance of the cloud storage medium 104 in serving the data warehouse system 101 by determining one or more performance measures. The access performance estimate engine 103 then generates a report on the quality of access to the cloud storage medium 104 by the data warehouse 102 based on the performance measures. An example access performance estimation engine 103 is described in greater detail below with reference to FIG. 2.
FIG. 2 is a block diagram illustrating an example access performance estimate engine 103. Each component of the access performance estimation engine 103 can be implemented on one or more computers each including one or more computer processors.
The access performance estimation engine 103 includes a data diversity module 201, a data skew module 202, a transport capability module 203, a bias module 204, and a report generation module 205.
The data diversity module 201, the data skew module 202, the transport capability module 203, and the bias module 204 each determine a respective measure estimating the performance of the cloud storage medium 104 in serving the data warehouse system 101. The report generation module 205 generates a report on the quality of access to the cloud storage medium 104 by the data warehouse 102 using the produced performance measures.
The data diversity module 201 determines a measure of data diversity for the cloud storage medium 104. The measure of data diversity for the cloud storage medium 104 describes the diversity of file types, e.g., different file formats, associated with a group of the data files 106 stored on the cloud storage medium 104. For instance, the measure of data diversity for the cloud storage medium 104 may describe a count of file types for a group of files that exist in a given path or segment of the cloud storage medium 104. More file types correspond to higher data diversity.
The type of a file is a property of a file determined based on the way the file encodes information. Examples of computer file types include CSV, AVRO, TEXT, and PRAQUE types. Determining a measure of data diversity for a cloud storage medium 104 is described in greater detail below with reference to FIG. 3.
The data skew module 202 determines a measure of data skew for the cloud storage medium 104. The measure of data skew for the cloud storage medium 104 indicates a distribution of data size among a group of the data files 106 stored on the cloud storage medium 104. Determining a measure of data skew for a cloud storage medium 104 is described in greater detail below with reference to FIG. 4.
The transport capability module 203 determines a measure of transport capability for the data warehouse system 101 and the cloud storage medium 104. The measure of transport capability for the data warehouse system 101 and the cloud storage medium 104 describes one or both of a measure of effective network throughput between the data warehouse system 101 and the cloud storage medium 104 and a measure of data movement cost on the cloud storage medium 104. The measure of effective network throughput for the data warehouse system 101 and the cloud storage medium 104 describes a maximum rate of communication between the data warehouse system 101 and the cloud storage medium 104. The measure of data movement cost on the cloud storage medium 104 describes a network cost incurred by moving one or more of the data files 106 on the cloud storage medium 104. Determining the measure of transport capability for the data warehouse system 101 and the cloud storage medium 104 is described in greater detail below with reference to FIG. 5.
The bias module 204 obtains an expected performance for the cloud storage medium 104. The bias module 204 generates a measured performance for the cloud storage medium 104 to compare to the expected performance. The bias module 204 then computes a measure of deviation of the measured performance from expected performance. The expected performance for the cloud storage medium 104 can be a  measure of ideal or recommended performance for the cloud storage medium 104 given a configuration of the cloud storage medium 104.
The report generation module 205 generates a report on the quality of access to the cloud storage medium 104 by the data warehouse 102 in serving the data warehouse system 101 based on the performance measures produced by one or more of the data diversity module 201, the data skew module 202, the transport capability module 203, and the bias module 204. The generated report may also include recommendations for improving the performance of the cloud storage medium 104 in serving the data warehouse system 101. An example performance report is described in greater detail below with reference to FIG. 6.
FIG. 3 is a flowchart of an example process 300 of generating a report on the quality of access to a computer storage component based on a measure of data diversity for the computer storage component. The process 300 may be performed by a system of one or more computers. For instance, a distributed storage system, e.g., the distributed storage system 100 of FIG. 1, programmed in accordance with this specification can perform process 300.
The system identifies (302) the computer storage component. The computer storage component stores multiple files including a first group of files and serves a client system. The first group of files may include files that exist in a given path or segment of the computer storage component. Identifying the computer storage component can occur in response to a user input of selecting the computer storage component from multiple computer storage components for generating a report.
The system determines (304) a respective file type for each file of the first group of files stored on the computer storage component. Examples of file types include CSV, AVRO, TEXT, and PRAQUE types.
The system determines (306) a measure of data diversity associated with the computer storage component based on a number of file types of the first group of files and the count of the files having each of the number of file types. In some implementations, the system determines the measure of data diversity D based on the following formula as shown below in Equation 1.
Figure PCTCN2017000155-appb-000001
where R is the number of file types of the first group of files and pi is a measure of abundance of a file type i. The measure of abundance of a file type is a ratio of how many files out of the first group of files have the file type i.
The system generates (308) a report on estimated quality of access of a client system to the computer storage component based on the measure of data diversity. The report includes a suggestion on reducing the data diversity to improve the quality of access, where reducing the data diversity includes reducing the number of file types of the first group of files. The system can provide the report for storage or provide the report to a user device for presentation on a display device or on a printer.
FIG. 4 is a flowchart of an example process 400 for generating a report on the quality of access to a computer storage component based on a measure of data skew for the computer storage component. The process 400 may be performed by a system of one or more computers. For instance, a distributed storage system, e.g., the distributed storage system 100 of FIG. 1, appropriately programmed in accordance with this specification can perform process 400.
The system identifies (402) a computer storage component. The computer storage component stores multiple files including a first group of files and serves a client system.
The system determines (404) a respective file size for each file of the first group of files stored on the computer storage component. The size of a file is a measure of the amount of data stored in that file, or how much storage space the file consumes.
The system determines (406) a measure of data skew for the computer storage component based on each file size associated with each file of the first group of files.
In some implementations, the system determines the measure of data skew for the computer storage component based on a measure of central tendency e.g., a mean or median of each file size associated with each file of the first group of files stored on the computer storage component and a measure of variance of each file size associated with each file of the first group of files. In some of those implementations, the system divides the measure of statistical variation by the measure of central tendency to generate the measure of data skew for the computer storage component.
In some implementations, the measure of data diversity is a coefficient of variation, sometimes referred to as relative standard deviation of each file size associated with each file of the first group of files stored on the computer storage component. In  some implementations, the system determines the coefficient of variation based on the following formula as shown below in Equation 2:
Figure PCTCN2017000155-appb-000002
where Cv is the coefficient of variation associated with each file size of each file of the first group of files stored on the computer storage component, σ is the standard deviation of each file size, and μ is the mean of each file size.
The system generates (408) a report on estimated quality of access of the client system to the computer storage component based on the measure of data diversity. The report includes a suggestion on reducing the measure of data skew to improve the quality of access.
FIG. 5 is a flowchart of an example process 500 for generating a report on the quality of access to a computer storage component by a client system based on a measure of transport capability for the client system and the computer storage component. The process 500 may be performed by a system of one or more computers. For instance, a distributed storage system, e.g., the distributed storage system 100 of FIG. 1, appropriately programmed in accordance with this specification can perform process 500.
The system identifies (502) a computer storage component. The computer storage component serves the client system.
The system determines (504) a measure of effective network throughput between the storage component and the client system.
In some implementations, the system determines the measure of effective network throughput by performing the following operations. The system uploads a data file from the client system to the computer storage component, obtains a data receive window (RWIN) and a round-trip time (RTT) associated with the upload, and determines the measure of network throughput based on the RWIN and RTT associated with the upload. The RWIN associated with a data transfer is an amount of data that a recipient of the data transfer can accept without acknowledging the sender of the data transfer. The RTT associated with a data transfer is a length of time between the sender sending the transferred data and receiving an acknowledgment of the receipt of the transferred data by the recipient. In some of those implementations, the system divides the RWIN associated with the upload by the RTT associated with the upload to determine the measure of network throughput.
In some implementations, the system downloads at least a portion of a data file from the computer storage component to the client system, obtains a data receive window (RWIN) and a round-trip time (RTT) associated with the download, and determines the measure of network throughput based on the RWIN and RTT associated with the download. In some of those implementations, the system divides the RWIN associated with the download by the RTT associated with the download to determine the measure of network throughput.
The system may determine a portion of a data file to download from the computer storage component to the client system based on sampling data fields from the data file in accordance with a sampling rate. The system may generate the sampling rate or obtain the sampling rate from an end user.
The system determines (506) a measure of data movement cost for the computer storage component. In some implementations, the system copies a data file from a first location on the computer storage medium to a second location on the computer storage medium. The system obtains a data receive window (RWIN) and a round-trip time (RTT) associated with the copying. The system then determines the measure of data movement cost based on the RWIN and RTT associated with the copying. In some of those implementations, the system divides the RWIN associated with the copying by the RTT associated with the copying to determine the measure of data movement cost.
The system determines (508) the measure of transport capability for the computer storage component and the client system based the measure of effective network throughput and the measure of data movement cost. In some implementations, the system multiplies the measure of effective network throughput and the measure of data movement cost to generate the measure of transport capability.
The system generates (510) a report on estimated quality of access of the client system to the computer storage component based on the measure of transport capability. The report can include a suggestion on reducing the measure of transport capability to improve the quality of access.
FIG. 6 is an example report 600 on quality of access to a computer storage component by a client system. The report 600 may be generated by a system of one or more computers. For instance, a distributed storage system, e.g., the distributed storage system 100 of FIG. 1, appropriately programmed in accordance with this specification  can generate the report 600. The system can provide the report 600 for presentation on a display device or a printer, or for storage for subsequent analysis.
The report 600 includes a header section 601, a status section 602, and a suggestions section 603.
The header section 601 includes information about a client system (in the field designated by the word “From” ) , a computer storage component (in the field designated by the word “To” ) , and a time frame within which the system measured the performance measures noted in the status section 602 (in the field designated by the word “Time” ) .
The status section 602 includes the values of performance measures pertaining to status of quality of access to the computer storage component by the client system (e.g., data skew and data diversity) . The status section 602 also includes, for each performance measure, detail information about performance values and metrics pertaining to the measure.
For instance, detail information for the data skew measure include the total number of a group of files on the computer storage medium, the minimum, maximum, and average of the sizes of those files, and a visual representation of the distribution of the file sizes. The detail information for the data diversity measure includes the count of total number of file types for a group of files on the computer storage medium, the names of those file types, and the respective abundance ratio of those file types.
The suggestions section 603 include one or more suggestions for improving estimated quality of access of the client system to the computer storage component. The system may generate the suggestions in the suggestions section 603 based on the performance measures included in the status section 602. For instance, the first suggestion in FIG. 6 (i.e., “data defragment, merge small files into proper sizes” ) is based on the data skew measure, while the second suggestion (i.e., “data transformation, convert JSON to CSV” ) is based on the data diversity measure.
Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory program carrier for  execution by, or to control the operation of, data processing apparatus. Alternatively, or in addition, the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be or further include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit) . The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
A computer program, which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub-programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be  implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit) .
Computers suitable for the execution of a computer program include, by way of example, general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA) , a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.
Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s device in response to requests received from the web browser.
Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN) , e.g., the Internet.
The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the user device, which acts as a client. Data generated at the user device, e.g., a result of the user interaction, can be received from the user device at the server.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system  modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.
What is claimed is:

Claims (20)

  1. A computer-implemented method comprising:
    identifying a computer storage component, the computer storage component storing a plurality of files including a first group of files and serving a client system;
    determining, for each file of the first group of files stored on the computer storage component, a respective file type;
    determining, from a number of file types associated with the first group of files stored on the computer storage component and a respective count of files having each file type, a measure of data diversity associated with the computer storage component; and
    generating a report on estimated quality of access of the client system to the computer storage component based on the measure of data diversity, the report including a suggestion on reducing the measure of data diversity to improve the quality of access, wherein reducing the measure of data diversity includes reducing the number of file types of the first group of files.
  2. The computer-implemented method of claim 1, wherein determining the measure of data diversity associated with the computer storage component comprises:
    determining, for each file type associated with the first group of files, a respective measure of proportional abundance on the computer storage component based on the respective count of the first group of files having each file type and a total count of the first group of files on the computer storage component;
    combining the respective measures of proportional abundance for the file types; and
    determining the measure of data diversity based on the combined measures of proportional abundance.
  3. The computer-implemented method of claim 1, further comprising:
    determining, for each file of the first group of files stored on the computer storage component, a respective file size indicator; and
    determining, from each respective file size indicator associated with each file of the first group of files stored on the computer storage component, a measure of data skew associated with the computer storage component,
    wherein generating the report on the quality of access to the computer storage component is further based on the measure of data skew associated with the computer  storage component and the report includes a suggestion on reducing the measure of data skew to improve the quality of access.
  4. The computer-implemented method of claim 3, wherein determining the measure of data skew associated with the computer storage component comprises:
    determining a measure of central tendency of each file size indicator associated with each file of the first group of files stored on the computer storage component;
    determining a measure of statistical variance of each file size indicator associated with each file of the first group of files stored on the computer storage component; and
    determining the measure of data skew based on the measure of central tendency and the measure of statistical variance.
  5. The computer-implemented method of claim 1, further comprising:
    determining a measure of effective network throughput between the computer storage component and the client system;
    determining a measure of data movement cost on the computer storage component; and
    determining a measure of transport capability for the computer storage component and the client system based on:
    the measure of effective network throughput between the computer storage component and the client system, and
    the measure of data movement cost on the computer storage component,
    wherein generating the report on the quality of access to the computer storage component is further based on the measure of transport capability for the computer storage component and the client system and the report includes a suggestion on reducing the measure of transport capability to improve the quality of access.
  6. The computer-implemented method of claim 5, wherein determining each measure of network throughput comprises:
    uploading a data file from the client system to the computer storage component;
    obtaining a data receive window (RWIN) and a round-trip time (RTT) associated with the upload; and
    determining the measure of network throughput based on the RWIN and RTT associated with the upload.
  7. The computer-implemented method of claim 5, wherein determining each measure of network throughput comprises:
    downloading at least a portion of a data file from the computer storage component to the client system;
    obtaining a data receive window (RWIN) and a round-trip time (RTT) associated with the download; and
    determining the measure of network throughput based on the RWIN and RTT associated with the download.
  8. The computer-implemented method of claim 7, wherein downloading at least a portion of the data file from the computer storage component to the client system comprises:
    obtaining a sampling rate;
    sampling one or more data fields from the data file according to the sampling rate to generate a sampled portion of the data file; and
    downloading the sampled portion of the data file from the computer storage component to the client system.
  9. The computer-implemented method of claim 5, wherein determining the measure of data movement cost comprises:
    copying a data file from a first location on the computer storage component to a second location on the computer storage component;
    obtaining a data receive window (RWIN) and a round-trip time (RTT) associated with the copying; and
    determining the measure of data movement cost based on the RWIN and RTT associated with the copying.
  10. The computer-implemented method of claim 1, further comprising:
    obtaining an expected performance for the computer storage component;
    generating a measured performance corresponding to the expected performance;
    computing a measure of deviation of the measured performance from the expected performance; and
    wherein generating the report on the quality of access to the computer storage component is further based on the measure of deviation of the measured performance  from the expected performance and the report includes a suggestion on reducing the measure of deviation to improve the quality of access.
  11. The computer-implemented method of claim 1, wherein the client system is a data warehousing platform.
  12. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising:
    identifying a computer storage component, the computer storage component storing a plurality of files including a first group of files and serving a client system;
    determining, for each file of the first group of files stored on the computer storage component, a respective file type;
    determining, from a number of file types associated with the first group of files stored on the computer storage component and a respective count of files having each file type, a measure of data diversity associated with the computer storage component; and
    generating a report on estimated quality of access of the client system to the computer storage component based on the measure of data diversity, the report including a suggestion on reducing the measure of data diversity to improve the quality of access, wherein reducing the measure of data diversity includes reducing the number of file types of the first group of files.
  13. The system of claim 12, wherein determining the measure of data diversity associated with the computer storage component comprises:
    determining, for each file type associated with the first group of files, a respective measure of proportional abundance on the computer storage component based on the respective count of the first group of files having each file type and a total count of the first group of files on the computer storage component;
    combining the respective measures of proportional abundance for the file types; and
    determining the measure of data diversity based on the combined measures of proportional abundance.
  14. The system of claim 12, wherein the operations further comprise:
    determining, for each file of the first group of files stored on the computer storage  component, a respective file size indicator; and
    determining, from each respective file size indicator associated with each file of the first group of files stored on the computer storage component, a measure of data skew associated with the computer storage component,
    wherein generating the report on the quality of access to the computer storage component is further based on the measure of data skew associated with the computer storage component and the report includes a suggestion on reducing the measure of data skew to improve the quality of access.
  15. The system of claim 14, wherein determining the measure of data skew associated with the computer storage component comprises:
    determining a measure of central tendency of each file size indicator associated with each file of the first group of files stored on the computer storage component;
    determining a measure of statistical variance of each file size indicator associated with each file of the first group of files stored on the computer storage component; and
    determining the measure of data skew based on the measure of central tendency and the measure of statistical variance.
  16. The system of claim 12, wherein the operations further comprise:
    further comprising:
    determining a measure of effective network throughput between the computer storage component and the client system;
    determining a measure of data movement cost on the computer storage component; and
    determining a measure of transport capability for the computer storage component and the client system based on:
    the measure of effective network throughput between the computer storage component and the client system, and
    the measure of data movement cost on the computer storage component,
    wherein generating the report on the quality of access to the computer storage component is further based on the measure of transport capability for the computer storage component and the client system and the report includes a suggestion on reducing the measure of transport capability to improve the quality of access.
  17. The system of claim 16, wherein determining each measure of network throughput comprises:
    uploading a data file from the client system to the computer storage component;
    obtaining a data receive window (RWIN) and a round-trip time (RTT) associated with the upload; and
    determining the measure of network throughput based on the RWIN and RTT associated with the upload.
  18. The system of claim 16, wherein determining the measure of data movement cost comprises:
    copying a data file from a first location on the computer storage component to a second location on the computer storage component;
    obtaining a data receive window (RWIN) and a round-trip time (RTT) associated with the copying; and
    determining the measure of data movement cost based on the RWIN and RTT associated with the copying.
  19. The system of claim 12, wherein the operations further comprise:
    obtaining an expected performance for the computer storage component;
    generating a measured performance corresponding to the expected performance;
    computing a measure of deviation of the measured performance from the expected performance; and
    wherein generating the report on the quality of access to the computer storage component is further based on the measure of deviation of the measured performance from the expected performance and the report includes a suggestion on reducing the measure of deviation to improve the quality of access.
  20. A non-transitory computer storage medium encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations comprising:
    identifying a computer storage component, the computer storage component storing a plurality of files including a first group of files and serving a client system;
    determining, for each file of the first group of files stored on the computer storage component, a respective file type;
    determining, from a number of file types associated with the first group of files  stored on the computer storage component and a respective count of files having each file type, a measure of data diversity associated with the computer storage component; and
    generating a report on estimated quality of access of the client system to the computer storage component based on the measure of data diversity, the report including a suggestion on reducing the measure of data diversity to improve the quality of access, wherein reducing the measure of data diversity includes reducing the number of file types of the first group of files.
PCT/CN2017/000155 2017-02-13 2017-02-13 Quality of access reports in distributed storage systems WO2018145228A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/000155 WO2018145228A1 (en) 2017-02-13 2017-02-13 Quality of access reports in distributed storage systems

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/000155 WO2018145228A1 (en) 2017-02-13 2017-02-13 Quality of access reports in distributed storage systems

Publications (1)

Publication Number Publication Date
WO2018145228A1 true WO2018145228A1 (en) 2018-08-16

Family

ID=63106906

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/000155 WO2018145228A1 (en) 2017-02-13 2017-02-13 Quality of access reports in distributed storage systems

Country Status (1)

Country Link
WO (1) WO2018145228A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110018978A (en) * 2019-04-15 2019-07-16 北京硬壳科技有限公司 Data transmission method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103389884A (en) * 2013-07-29 2013-11-13 华为技术有限公司 Method for processing input/output request, host, server and virtual machine
CN103747047A (en) * 2013-12-24 2014-04-23 乐视网信息技术(北京)股份有限公司 CDN file storage method, file distribution control center and system thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103389884A (en) * 2013-07-29 2013-11-13 华为技术有限公司 Method for processing input/output request, host, server and virtual machine
CN103747047A (en) * 2013-12-24 2014-04-23 乐视网信息技术(北京)股份有限公司 CDN file storage method, file distribution control center and system thereof

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110018978A (en) * 2019-04-15 2019-07-16 北京硬壳科技有限公司 Data transmission method and system
CN110018978B (en) * 2019-04-15 2021-08-13 北京硬壳科技有限公司 Data transmission method and system

Similar Documents

Publication Publication Date Title
US9721214B1 (en) Training a model using parameter server shards
US11568250B2 (en) Training neural networks using a prioritized experience memory
US10296825B2 (en) Dueling deep neural networks
US20230334293A1 (en) Neural network for processing graph data
US10055506B2 (en) System and method for enhanced accuracy cardinality estimation
US8209271B1 (en) Predictive model training on large datasets
WO2020077573A1 (en) Secret sharing with no trusted initializer
US8209274B1 (en) Predictive model importation
US20210158211A1 (en) Linear time algorithms for privacy preserving convex optimization
US11676060B2 (en) Digital content interaction prediction and training that addresses imbalanced classes
US11361046B2 (en) Machine learning classification of an application link as broken or working
US20170316432A1 (en) A/b testing on demand
US10310748B2 (en) Determining data locality in a distributed system using aggregation of locality summaries
US11688077B2 (en) Adaptive object tracking policy
US20130204905A1 (en) Remapping locality-sensitive hash vectors to compact bit vectors
US10664753B2 (en) Neural episodic control
EP2601622A1 (en) Predicting a user behavior number of a word
US11068564B2 (en) Method and system to identify irregularities in the distribution of electronic files within provider networks
US20210073235A1 (en) Incremental data retrieval based on structural metadata
WO2018145228A1 (en) Quality of access reports in distributed storage systems
US7496476B2 (en) Method and system for analyzing performance of an information processing system
US20180046940A1 (en) Optimized machine learning system
US20170278128A1 (en) Dynamic alerting for experiments ramping
US9560136B2 (en) High speed communication protocol
Wang et al. Data-driven determination of the number of jumps in regression curves

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17895656

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17895656

Country of ref document: EP

Kind code of ref document: A1