US11989210B2 - Providing energy efficient dynamic redundancy elimination for stored data - Google Patents
Providing energy efficient dynamic redundancy elimination for stored data Download PDFInfo
- Publication number
- US11989210B2 US11989210B2 US17/812,028 US202217812028A US11989210B2 US 11989210 B2 US11989210 B2 US 11989210B2 US 202217812028 A US202217812028 A US 202217812028A US 11989210 B2 US11989210 B2 US 11989210B2
- Authority
- US
- United States
- Prior art keywords
- data
- data objects
- objects
- semantically
- semantic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Creation or modification of classes or clusters
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Definitions
- Software applications may generate and/or store data in data structures (e.g., databases, tables, lists, and/or the like). Carbon emissions generated by a process of managing and storing data in data structures is a major contributing factor in overall energy costs of maintaining software applications across industries.
- data structures e.g., databases, tables, lists, and/or the like.
- the method may include receiving data objects from an object corpus stored in a data structure, and identifying unique segments within the data objects as elements.
- the method may include replacing all equivalent segments with one representative segment, and generating an embedding space based on unique elements and mappings of the data objects to embeddings.
- the method may include estimating semantic proximities among the data objects based on the mappings of the data objects to the embeddings, and building a semantic cohesion network among the data objects based on the semantic proximities among the data objects.
- the method may include identifying semantically cohesive data clusters in the semantic cohesion network, and sorting the data objects in the semantically cohesive data clusters to generate semantically cohesive and sorted data clusters.
- the method may include receiving a new data object, and determining, from the semantically cohesive and sorted data clusters, a home data cluster for the new data object.
- the method may include determining whether the new data object is semantically similar, within a threshold, to a data object in the home data cluster, and storing bookkeeping details of the new data object in the data structure based on the new data object being semantically similar to the data object in the home data cluster.
- the device may include one or more memories and one or more processors coupled to the one or more memories.
- the one or more processors may be configured to receive data objects from an object corpus stored in a data structure, and identify unique segments within the data objects as elements.
- the one or more processors may be configured to replace all equivalent segments with one representative segment, and generate an embedding space based on unique elements and mappings of the data objects to embeddings.
- the one or more processors may be configured to estimate semantic proximities among the data objects based on the mappings of the data objects to the embeddings, and build a semantic cohesion network among the data objects based on the semantic proximities among the data objects.
- the semantic cohesion network may include a set of nodes corresponding to the data objects, links between the set of nodes that are based on the semantic proximities among the data objects, and weights associated with the links.
- the one or more processors may be configured to identify semantically cohesive data clusters in the semantic cohesion network, and sort the data objects in the semantically cohesive data clusters to generate semantically cohesive and sorted data clusters.
- the one or more processors may be configured to receive a new data object, and determine, from the semantically cohesive and sorted data clusters, a home data cluster for the new data object.
- the one or more processors may be configured to determine whether the new data object is semantically similar, within a threshold, to a data object in the home data cluster, and store bookkeeping details of the new data object in the data structure based on the new data object being semantically similar to the data object in the home data cluster.
- Some implementations described herein relate to a non-transitory computer-readable medium that stores a set of instructions for a device.
- the set of instructions when executed by one or more processors of the device, may cause the device to receive data objects from an object corpus stored in a data structure, and identify unique segments within the data objects as elements.
- the set of instructions when executed by one or more processors of the device, may cause the device to replace all equivalent segments with one representative segment, and generate an embedding space based on unique elements and mappings of the data objects to embeddings.
- the set of instructions when executed by one or more processors of the device, may cause the device to estimate semantic proximities among the data objects based on the mappings of the data objects to the embeddings, and build a semantic cohesion network among the data objects based on the semantic proximities among the data objects.
- the set of instructions when executed by one or more processors of the device, may cause the device to identify semantically cohesive data clusters in the semantic cohesion network, and sort the data objects in the semantically cohesive data clusters to generate semantically cohesive and sorted data clusters.
- the set of instructions when executed by one or more processors of the device, may cause the device to receive a new data object, and determine, from the semantically cohesive and sorted data clusters, a home data cluster for the new data object.
- the set of instructions when executed by one or more processors of the device, may cause the device to determine whether the new data object is semantically similar, within a threshold, to a data object in the home data cluster, and selectively store bookkeeping details of the new data object in the data structure based on the new data object being semantically similar to the data object in the home data cluster, or prevent the new data object from being stored in the data structure based on the new data object being semantically similar to the data object in the home data cluster.
- FIGS. 1 A- 1 H are diagrams of an example implementation described herein.
- FIG. 2 is a diagram of an example environment in which systems and/or methods described herein may be implemented.
- FIG. 3 is a diagram of example components of one or more devices of FIG. 2 .
- FIG. 4 is a flowchart of an example process for providing energy efficient dynamic redundancy elimination for stored data.
- a majority of data stored in data structures fails to provide practically useful insights without application of resource intensive analytics of the data. Therefore, storage and management of such data in data structures is increasingly becoming an overhead with high computational and energy costs.
- Current techniques for storing data focus on removal of redundant data via database deduplication techniques (e.g., that operate at a level of meta-characteristics of data objects for identifying duplicates by matching size, type, modification date, and/or the like), compression of data objects for specific data types (e.g., audio encoding techniques), generic compression techniques (e.g., that operate at syntactic levels by identifying repeated byte sequences in a file), and/or the like.
- database deduplication techniques e.g., that operate at a level of meta-characteristics of data objects for identifying duplicates by matching size, type, modification date, and/or the like
- compression of data objects for specific data types e.g., audio encoding techniques
- generic compression techniques e.g., that operate at syntactic levels by identifying
- computing resources e.g., processing resources, memory resources, communication resources, and/or the like
- networking resources e.g., networking resources, and/or the like associated with performing data analytics on large data structures with redundant data, unnecessarily storing large quantities of redundant data in data structures, unnecessarily storing large quantities of useless data in data structures, integrating data in data structures, and/or the like.
- the redundancy elimination system may receive data objects from an object corpus stored in a data structure, and may identify unique segments within the data objects as elements.
- the redundancy elimination system may replace all equivalent segments with one representative segment, and may generate an embedding space based on unique elements and mappings of the data objects to embeddings.
- the redundancy elimination system may estimate semantic proximities among the data objects based on the mappings of the data objects to the embeddings, and may build a semantic cohesion network among the data objects based on the semantic proximities among the data objects.
- the redundancy elimination system may identify semantically cohesive data clusters in the semantic cohesion network, and may sort the data objects in the semantically cohesive data clusters to generate semantically cohesive and sorted data clusters.
- the redundancy elimination system may receive a new data object, and may determine, from the semantically cohesive and sorted data clusters, a home data cluster for the new data object.
- the redundancy elimination system may determine whether the new data object is semantically similar, within a threshold, to a data object in the home data cluster, and may store bookkeeping details of the new data object in the data structure based on the new data object being semantically similar to the data object in the home data cluster.
- the redundancy elimination system provides energy efficient dynamic redundancy elimination for stored data.
- the redundancy elimination system may dynamically identify semantically redundant data objects as the data objects are generated and are received for storage. By efficiently identifying redundancies in the data objects, the redundancy elimination system may provide energy cost savings by a factor of at least two relative to the current techniques.
- the redundancy elimination system may identify redundancies in the data objects based on semantic matching at a level of constituent elements of the data objects (e.g., unique phrases in a text document and relative information contained in those phrases).
- the redundancy elimination system may semantically compress a data structure as an element matrix by storing data objects in terms of constituent elements and information content of the data objects.
- FIGS. 1 A- 1 H are diagrams of an example 100 associated with providing energy efficient dynamic redundancy elimination for stored data.
- example 100 includes a redundancy elimination system associated with a user device and a data structure.
- the redundancy elimination system may include a system that provides energy efficient dynamic redundancy elimination for stored data. Further details of the redundancy elimination system, the user device, and the data structure are provided elsewhere herein.
- the redundancy elimination system may receive data objects from an object corpus stored in a data structure.
- the data structure may store the data objects of the object corpus.
- the data objects may include data objects generated by software applications and/or stored by the software applications in the data structure.
- the redundancy elimination system may continuously receive the data objects from the data structure, may periodically receive the data objects from the data structure, may receive the data objects based on providing a request to the data structure, and/or the like.
- a data object may include a region of storage that contains a value or a group of values. Each value may be accessed using an identifier or a more complex expression that refers to the data object.
- the data objects may include unique identifiers, data types, and attributes. In this way, the data objects may vary across data structures and different programming languages.
- the redundancy elimination system may identify unique segments within the data objects as elements and may replace all equivalent segments with one representative segment. For example, the redundancy elimination system may identify the unique segments within the data objects as the elements based on data types associated with the data objects.
- the redundancy elimination system may identify the unique segments within the data objects as words or phrases; if a data object is an image, the redundancy elimination system may identify the unique segments within the data objects as various identified objects within the image; if a data object is a record, the redundancy elimination system may identify the unique segments within the data objects as fields; if a data object is an audio file, the redundancy elimination system may identify the unique segments within the data objects as different types of sound patterns, such as speech, music, background noise, silence, and/or the like; if a data object is a complex data type (e.g., an extensible markup language (XML) file, the redundancy elimination system may identify the unique segments within the data objects as constituent elements in the schema; and/or the like.
- XML extensible markup language
- the redundancy elimination system may replace all equivalent segments (e.g., elements) with one representative segment in the entire object corpus.
- the equivalent segments e.g., which may be replaced with one representative segment
- the equivalent segments e.g., which may be replaced with one representative segment
- the redundancy elimination system may generate an embedding space based on unique elements and mappings of the data objects to embeddings. For example, the redundancy elimination system may estimate an information theoretic significance for each element within a data object of the object corpus using a technique, such as BM25.
- BM25 is a ranking function used by search engines to estimate a relevance of a document to a given search query.
- bm25(w) may be the information theoretic significance for an element w.
- the redundancy elimination system may build embeddings of unique elements within the object corpus (e.g., by applying a word2vec model or a Glove model for textual data object).
- e(w) may be an embedding for the element w.
- the redundancy elimination system may transform each embedding using an information theoretic significance of the element associated with each embedding, as follows: e ( w ) ⁇ bm 25( w )* e ( w )
- the redundancy elimination system may generate embeddings of the data objects. For example, for each data object A in the object corpus, the redundancy elimination system may map the data object A into the embedding space based on embeddings of constituent elements of the data object A, as follows:
- e ⁇ ( A ) ⁇ w ⁇ A ⁇ n w * e ⁇ ( w ) ⁇ " ⁇ [LeftBracketingBar]” A ⁇ " ⁇ [RightBracketingBar]” , where
- the redundancy elimination system may estimate semantic proximities among the data objects based on the mappings of the data objects to the embeddings. For example, the redundancy elimination system may measure the semantic proximities among the data objects based on the mappings of the data objects to the embeddings.
- two data objects A 1 and A 2 in the object corpus may include mappings to the embedding space as e(A 1 ) and e(A 2 ).
- the redundancy elimination system may estimate a semantic proximity between data objects A 1 and A 2 as:
- the redundancy elimination system may build a semantic cohesion network among the data objects based on the semantic proximities among the data objects.
- V DB may include a set of nodes corresponding to the data objects, such that node v i may correspond to data object A 1 .
- E DB ⁇ V DB ⁇ V DB may include a set of undirected links between a pair of nodes having a semantic proximity ⁇ , where ⁇ may correspond to a threshold parameter (e.g., 0.75, 0.80, 0.85, and/or the like).
- the redundancy elimination system may identify maximal groups of the data objects, such that data objects within each group are strongly semantically and cohesively connected to each other and such that the groups include maximum possible data objects (e.g., if a new data object is added to a group, the group may not remain semantically cohesive for at least one of the existing data objects in the group).
- the redundancy elimination system may sort the data objects in the data cluster relative to a quantity of distinct constituent elements that includes the data objects. For example, the redundancy elimination system may sort text files based upon different phrases appearing in each text file. Such intra-cluster sorting of the data objects may reduce a time required by the redundancy elimination system to detect an identical data object when a new data object is received, as described below.
- the redundancy elimination system may receive a new data object.
- a software application executing on the user device may generate the new data object.
- the user device may provide the new data object to the redundancy elimination system, and the redundancy elimination system may receive the new data object from the user device.
- the new data object may include a region of storage that contains a value or a group of values.
- the redundancy elimination system may determine, from the semantically cohesive and sorted data clusters, a home data cluster for the new data object. For example, when the new data object is received, the redundancy elimination system may identify the home data cluster (C home ) for the new data object.
- the home data cluster may be a data cluster of data objects with a maximum proximity with the new data object (o new ). In this way, the redundancy elimination system may ensure computational resource optimization, in contrast to current techniques for deduplication (e.g., which estimate a distance of the new data object from all data objects in the object corpus to determine whether the new data object is a duplicate).
- a sorting parameter e.g., size of the text
- the redundancy elimination system may determine whether the data object (o id ) in the home data cluster is semantically identical to the new data object (o new ) (e.g., whether a proximity of the data object old and the new data object o new is greater than a predetermined threshold ( ⁇ high ⁇ 0 . . . 1) set by the redundancy elimination system).
- the redundancy elimination system may store bookkeeping details of the new data object in the data structure based on the new data object being semantically similar to the data object in the home data cluster. For example, if the redundancy elimination system determines that the data object (o id ) in the home data cluster is semantically identical to the new data object (o new ) (e.g., that the proximity of the data object o id and the new data object o new is greater than the predetermined threshold ( ⁇ high ⁇ 0 . . . 1)), the redundancy elimination system may determine that the new data object is redundant and need not be stored in the data structure.
- the redundancy elimination system may not store the new data object in the data structure but may store bookkeeping details of the new data object in the data structure.
- the bookkeeping details of the new data object may include a name of the new data object, a size of the new data object, a name of the application that generated the new data object, and/or the like, and may require very little storage space in the data structure relative to the storage space required for the new data object.
- the redundancy elimination system may execute a chunking process to identify semantically unique elements in the new data object based on the new data object not being semantically similar to the data object in the home data cluster. For example, if the redundancy elimination system determines that no data object (o id ) in the home data cluster is semantically identical to the new data object ((mew) (e.g., that the proximity of the data object o id and the new data object (mew is less than or equal to the predetermined threshold ( ⁇ high ⁇ 0 . . .
- the redundancy elimination system may determine that the new data object is to be stored in the data structure and may execute the chunking process to identify the semantically unique elements in the new data object.
- //add new rows for all unique elements in o new c + c+ 1//add new column o new
- the redundancy elimination system may store the semantically unique elements of the new data object in the data structure.
- the redundancy elimination system may store the semantically unique elements of the new data object in the data structure based on the new data object not being semantically similar to the data object in the home data cluster.
- the redundancy elimination system may add the new data object to the home data cluster, and may update a centroid of the home data cluster based on the addition of the new data object to the home data cluster.
- the redundancy elimination system may provide computational gain relative to current data storage techniques.
- a computational gain (gain comp ) when a new object is received may be provided by:
- DB(o 1 ), . . . , DB(o i ), at the time when new objects are received, are ⁇ 1 times larger than home data clusters of o1, . . . , oi
- the computational gain may be approximated as: gain comp ⁇ or ( ⁇ 1)*100%.
- the redundancy elimination system provides energy efficient dynamic redundancy elimination for stored data.
- the redundancy elimination system may dynamically identify semantically redundant data objects as the data objects are generated and are received for storage. By efficiently identifying redundancies in the data object, the redundancy elimination system may provide energy cost savings by a factor of at least two relative to the current techniques.
- the redundancy elimination system may identify redundancies in the data objects based on semantic matching at a level of constituent elements of the data objects (e.g., unique phrases in a text document and relative information contained in those phrases).
- the redundancy elimination system may semantically compress a data structure as an element matrix by storing data objects in terms of constituent elements and information content of the data objects.
- FIGS. 1 A- 1 H are provided as an example. Other examples may differ from what is described with regard to FIGS. 1 A- 1 H .
- the number and arrangement of devices shown in FIGS. 1 A- 1 H are provided as an example. In practice, there may be additional devices, fewer devices, different devices, or differently arranged devices than those shown in FIGS. 1 A- 1 H .
- two or more devices shown in FIGS. 1 A- 1 H may be implemented within a single device, or a single device shown in FIGS. 1 A- 1 H may be implemented as multiple, distributed devices.
- a set of devices (e.g., one or more devices) shown in FIGS. 1 A- 1 H may perform one or more functions described as being performed by another set of devices shown in FIGS. 1 A- 1 H .
- FIG. 2 is a diagram of an example environment 200 in which systems and/or methods described herein may be implemented.
- the environment 200 may include a redundancy elimination system 201 , which may include one or more elements of and/or may execute within a cloud computing system 202 .
- the cloud computing system 202 may include one or more elements 203 - 213 , as described in more detail below.
- the environment 200 may include a network 220 , a user device 230 , and/or a data structure 240 . Devices and/or elements of the environment 200 may interconnect via wired connections and/or wireless connections.
- the cloud computing system 202 includes computing hardware 203 , a resource management component 204 , a host operating system (OS) 205 , and/or one or more virtual computing systems 206 .
- the resource management component 204 may perform virtualization (e.g., abstraction) of the computing hardware 203 to create the one or more virtual computing systems 206 .
- virtualization e.g., abstraction
- the resource management component 204 enables a single computing device (e.g., a computer, a server, and/or the like) to operate like multiple computing devices, such as by creating multiple isolated virtual computing systems 206 from the computing hardware 203 of the single computing device. In this way, the computing hardware 203 can operate more efficiently, with lower power consumption, higher reliability, higher availability, higher utilization, greater flexibility, and lower cost than using separate computing devices.
- the computing hardware 203 includes hardware and corresponding resources from one or more computing devices.
- the computing hardware 203 may include hardware from a single computing device (e.g., a single server) or from multiple computing devices (e.g., multiple servers), such as multiple computing devices in one or more data centers.
- the computing hardware 203 may include one or more processors 207 , one or more memories 208 , one or more storage components 209 , and/or one or more networking components 210 . Examples of a processor, a memory, a storage component, and a networking component (e.g., a communication component) are described elsewhere herein.
- the resource management component 204 includes a virtualization application (e.g., executing on hardware, such as the computing hardware 203 ) capable of virtualizing the computing hardware 203 to start, stop, and/or manage the one or more virtual computing systems 206 .
- the resource management component 204 may include a hypervisor (e.g., a bare-metal or Type 1 hypervisor, a hosted or Type 2 hypervisor, and/or the like) or a virtual machine monitor, such as when the virtual computing systems 206 are virtual machines 211 .
- the resource management component 204 may include a container manager, such as when the virtual computing systems 206 are containers 212 .
- the resource management component 204 executes within and/or in coordination with a host operating system 205 .
- a virtual computing system 206 includes a virtual environment that enables cloud-based execution of operations and/or processes described herein using computing hardware 203 .
- a virtual computing system 206 may include a virtual machine 211 , a container 212 , a hybrid environment 213 that includes a virtual machine and a container, and/or the like.
- a virtual computing system 206 may execute one or more applications using a file system that includes binary files, software libraries, and/or other resources required to execute applications on a guest operating system (e.g., within the virtual computing system 206 ) or the host operating system 205 .
- the redundancy elimination system 201 may include one or more elements 203 - 213 of the cloud computing system 202 , may execute within the cloud computing system 202 , and/or may be hosted within the cloud computing system 202 , in some implementations, the redundancy elimination system 201 may not be cloud-based (e.g., may be implemented outside of a cloud computing system) or may be partially cloud-based.
- the redundancy elimination system 201 may include one or more devices that are not part of the cloud computing system 202 , such as a device 300 of FIG. 3 , which may include a standalone server or another type of computing device.
- the redundancy elimination system 201 may perform one or more operations and/or processes described in more detail elsewhere herein.
- the network 220 includes one or more wired and/or wireless networks.
- the network 220 may include a cellular network, a public land mobile network (PLMN), a local area network (LAN), a wide area network (WAN), a private network, the Internet, and/or the like, and/or a combination of these or other types of networks.
- PLMN public land mobile network
- LAN local area network
- WAN wide area network
- private network the Internet, and/or the like, and/or a combination of these or other types of networks.
- the network 220 enables communication among the devices of the environment 200 .
- the user device 230 includes one or more devices capable of receiving, generating, storing, processing, and/or providing information, as described elsewhere herein.
- the user device 230 may include a communication device and/or a computing device.
- the user device 230 may include a wireless communication device, a mobile phone, a user equipment, a laptop computer, a tablet computer, a desktop computer, a gaming console, a set-top box, a wearable communication device (e.g., a smart wristwatch, a pair of smart eyeglasses, a head mounted display, or a virtual reality headset), or a similar type of device.
- the data structure 240 includes one or more devices capable of receiving, generating, storing, processing, and/or providing information, as described elsewhere herein.
- the data structure 240 may include a communication device and/or a computing device.
- the data structure 240 may include a database, a server, a database server, an application server, a client server, a web server, a host server, a proxy server, a virtual server (e.g., executing on computing hardware), a server in a cloud computing system, a device that includes computing hardware used in a cloud computing environment, or a similar type of device.
- the data structure 240 may communicate with one or more other devices of the environment 200 , as described elsewhere herein.
- the number and arrangement of devices and networks shown in FIG. 2 are provided as an example. In practice, there may be additional devices and/or networks, fewer devices and/or networks, different devices and/or networks, or differently arranged devices and/or networks than those shown in FIG. 2 . Furthermore, two or more devices shown in FIG. 2 may be implemented within a single device, or a single device shown in FIG. 2 may be implemented as multiple, distributed devices. Additionally, or alternatively, a set of devices (e.g., one or more devices) of the environment 200 may perform one or more functions described as being performed by another set of devices of the environment 200 .
- FIG. 3 is a diagram of example components of a device 300 , which may correspond to the redundancy elimination system 201 , the user device 230 , and/or the data structure 240 .
- the redundancy elimination system 201 , the user device 230 , and/or the data structure 240 may include one or more devices 300 and/or one or more components of the device 300 .
- the device 300 may include a bus 310 , a processor 320 , a memory 330 , an input component 340 , an output component 350 , and a communication component 360 .
- the bus 310 includes a component that enables wired and/or wireless communication among the components of device 300 .
- the processor 320 includes a central processing unit, a graphics processing unit, a microprocessor, a controller, a microcontroller, a digital signal processor, a field-programmable gate array, an application-specific integrated circuit, and/or another type of processing component.
- the processor 320 is implemented in hardware, firmware, or a combination of hardware and software. In some implementations, the processor 320 includes one or more processors capable of being programmed to perform a function.
- the memory 330 includes a random-access memory, a read only memory, and/or another type of memory (e.g., a flash memory, a magnetic memory, and/or an optical memory).
- the input component 340 enables the device 300 to receive input, such as user input and/or sensed inputs.
- the input component 340 may include a touch screen, a keyboard, a keypad, a mouse, a button, a microphone, a switch, a sensor, a global positioning system component, an accelerometer, a gyroscope, an actuator, and/or the like.
- the output component 350 enables the device 300 to provide output, such as via a display, a speaker, and/or one or more light-emitting diodes.
- the communication component 360 enables the device 300 to communicate with other devices, such as via a wired connection and/or a wireless connection.
- the communication component 360 may include a receiver, a transmitter, a transceiver, a modem, a network interface card, an antenna, and/or the like.
- the device 300 may perform one or more processes described herein.
- a non-transitory computer-readable medium e.g., the memory 330
- the processor 320 may execute the set of instructions to perform one or more processes described herein.
- execution of the set of instructions, by one or more processors 320 causes the one or more processors 320 and/or the device 300 to perform one or more processes described herein.
- hardwired circuitry may be used instead of or in combination with the instructions to perform one or more processes described herein. Thus, implementations described herein are not limited to any specific combination of hardware circuitry and software.
- the number and arrangement of components shown in FIG. 3 are provided as an example.
- the device 300 may include additional components, fewer components, different components, or differently arranged components than those shown in FIG. 3 .
- a set of components (e.g., one or more components) of the device 300 may perform one or more functions described as being performed by another set of components of the device 300 .
- FIG. 4 is a flowchart of an example process 400 for providing energy efficient dynamic redundancy elimination for stored data.
- one or more process blocks of FIG. 4 may be performed by a device (e.g., the redundancy elimination system 201 ).
- one or more process blocks of FIG. 4 may be performed by another device or a group of devices separate from or including the device, such as a user device (e.g., the user device 230 ).
- one or more process blocks of FIG. 4 may be performed by one or more components of the device 300 , such as the processor 320 , the memory 330 , the input component 340 , the output component 350 , and/or the communication component 360 .
- process 400 may include receiving data objects from an object corpus stored in a data structure (block 405 ).
- the device may receive data objects from an object corpus stored in a data structure, as described above.
- process 400 may include identifying unique segments within the data objects as elements (block 410 ).
- the device may identify unique segments within the data objects as elements, as described above.
- the elements include one or more of words or phrases for data objects that are text documents, identifying objects for data objects that are images, fields for data objects that are records, sounding patterns for data objects that are audio files, or elements in a schema for data objects that are complex data types.
- process 400 may include replacing all equivalent segments with one representative segment (block 415 ).
- the device may replace all equivalent segments with one representative segment, as described above.
- process 400 may include generating an embedding space based on unique elements and mappings of the data objects to embeddings (block 420 ).
- the device may generate an embedding space based on unique elements and mappings of the data objects to embeddings, as described above.
- generating the embedding space based on the unique elements and the mappings of the data objects to the embeddings includes estimating an information theoretic significance for each element within each data object with respect to the object corpus, building embeddings of the unique elements within the object corpus, transforming each embedding based on an information theoretic significance associated with each embedding, and mapping the data objects to the embeddings to generate the embedding space.
- process 400 may include estimating semantic proximities among the data objects based on the mappings of the data objects to the embeddings (block 425 ).
- the device may estimate semantic proximities among the data objects based on the mappings of the data objects to the embeddings, as described above.
- estimating the semantic proximities among the data objects based on the mappings of the data objects to the embeddings includes one or more of estimating semantic proximities among numeric data objects based on absolute differences between values of the numeric data objects, estimating semantic proximities among categorical data objects based on differences between categories of the categorical data objects, or estimating semantic proximities among composite data objects based on absolute differences between values of the composite data objects and based on differences between categories of the composite data objects.
- process 400 may include building a semantic cohesion network among the data objects based on the semantic proximities among the data objects (block 430 ).
- the device may build a semantic cohesion network among the data objects based on the semantic proximities among the data objects, as described above.
- the semantic cohesion network includes a set of nodes corresponding to the data objects, links between the set of nodes that are based on the semantic proximities among the data objects, and weights associated with the links.
- process 400 may include identifying semantically cohesive data clusters in the semantic cohesion network (block 435 ).
- the device may identify semantically cohesive data clusters in the semantic cohesion network, as described above.
- identifying semantically cohesive data clusters in the semantic cohesion network includes identifying maximal groups of data objects, wherein data objects within each of the maximal groups are semantically cohesively connected together, and wherein the maximal groups of data objects correspond to the semantically cohesive data clusters in the semantic cohesion network.
- process 400 may include sorting the data objects in the semantically cohesive data clusters to generate semantically cohesive and sorted data clusters (block 440 ).
- the device may sort the data objects in the semantically cohesive data clusters to generate semantically cohesive and sorted data clusters, as described above.
- sorting the data objects in the semantically cohesive data clusters to generate the semantically cohesive and sorted data clusters includes sorting the data objects in the semantically cohesive data clusters, based on a quantity of distinct constituent elements of the data objects in the semantically cohesive data clusters, to generate the semantically cohesive and sorted data clusters.
- process 400 may include receiving a new data object (block 445 ).
- the device may receive a new data object, as described above.
- process 400 may include determining, from the semantically cohesive and sorted data clusters, a home data cluster for the new data object (block 450 ).
- the device may determine, from the semantically cohesive and sorted data clusters, a home data cluster for the new data object, as described above.
- determining, from the semantically cohesive and sorted data clusters, the home data cluster for the new data object includes determining the home data cluster for the new data object based on a maximum proximity of the home data cluster with the new data object.
- determining, from the semantically cohesive and sorted data clusters, the home data cluster for the new data object includes determining cluster centroids of the semantically cohesive and sorted data clusters, and determining the home data cluster for the new data object based on the cluster centroids.
- process 400 may include determining whether the new data object is semantically similar, within a threshold, to a data object in the home data cluster (block 455 ).
- the device may determine whether the new data object is semantically similar, within a threshold, to a data object in the home data cluster, as described above.
- determining whether the new data object is semantically similar, within the threshold, to the data object in the home data cluster includes identifying, as the data object in the home data cluster, a data object that is most semantically proximate to the new data object, and determining whether the data object that is most semantically proximate to the new data object satisfies the threshold.
- process 400 may include storing bookkeeping details of the new data object in the data structure based on the new data object being semantically similar to the data object in the home data cluster (block 460 ).
- the device may store bookkeeping details of the new data object in the data structure based on the new data object being semantically similar to the data object in the home data cluster, as described above.
- process 400 includes executing a chunking process to identify semantically unique elements in the new data object based on the new data object not being semantically similar to the data object in the home data cluster, and storing the semantically unique elements of the new data object in the data structure.
- process 400 includes adding the new data object to the home data cluster, and updating a centroid of the home data cluster.
- process 400 includes preventing the new data object from being stored in the data structure based on the new data object being semantically similar to the data object in the home data cluster.
- process 400 may include additional blocks, fewer blocks, different blocks, or differently arranged blocks than those depicted in FIG. 4 . Additionally, or alternatively, two or more of the blocks of process 400 may be performed in parallel.
- the term “component” is intended to be broadly construed as hardware, firmware, or a combination of hardware and software. It will be apparent that systems and/or methods described herein may be implemented in different forms of hardware, firmware, and/or a combination of hardware and software. The actual specialized control hardware or software code used to implement these systems and/or methods is not limiting of the implementations. Thus, the operation and behavior of the systems and/or methods are described herein without reference to specific software code—it being understood that software and hardware can be used to implement the systems and/or methods based on the description herein.
- satisfying a threshold may, depending on the context, refer to a value being greater than the threshold, greater than or equal to the threshold, less than the threshold, less than or equal to the threshold, equal to the threshold, and/or the like, depending on the context.
- the terms “has,” “have,” “having,” or the like are intended to be open-ended terms. Further, the phrase “based on” is intended to mean “based, at least in part, on” unless explicitly stated otherwise. Also, as used herein, the term “or” is intended to be inclusive when used in a series and may be used interchangeably with “and/or,” unless explicitly stated otherwise (e.g., if used in combination with “either” or “only one of”).
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
e(w)←bm25(w)*e(w)
where |A| corresponds to a total quantity of elements in the data object A and nw corresponds to a quantity of times that the element w appears in the object A.
semCh(A 1 ,A 2)=|A 1 −A 2|.
If data objects A1 and A2 are categorical data types, the redundancy elimination system may estimate the semantic proximity between the data objects A1 and A2 based on whether the data objects are the same or are different:
If data objects A1 and A2 are composite data types (e.g., records with identical schema of n≥1 fields, where each field is a basic data type), the redundancy elimination system may estimate the semantic proximity between the data objects A1 and A2 as follows:
semCh(A 1 ,A 2)=√{square root over (Σi∈1 . . . n(A 1 [i]−A 2 [i])2)}.
dis(o new ,C i)=semCh(o new,centroid(C i))
if (dis(o new ,C i)<θ),
θ←dis(o new ,C i), and
home(o new)←C i.
o id:semCh(o id ,o new)=mino∈home(o
The redundancy elimination system may determine whether the data object (oid) in the home data cluster is semantically identical to the new data object (onew) (e.g., whether a proximity of the data object old and the new data object onew is greater than a predetermined threshold (δhigh∈0 . . . 1) set by the redundancy elimination system).
EMr×c←EMr
r + =r+|M new|//add new rows for all unique elements in o new
c + =c+1//add new column o new
-
- For each e∈ch(onew)∪ch(oid)
EMr+ ×c+ [k,c +]=Inf(e)- k≤r is the index for e in the updated matrix
- Inf(e) is unique information contained in e
- For each e∈Mnew
EMr+ ×c+ [k,c +]=Inf(e)- k≥r is the index for e in the updated matrix.
- For each e∈ch(onew)∪ch(oid)
where |DB(oi)| is a size of the data structure when the new object (oi) arrives, |home(oi)| is a size of a home data cluster for the new object, δ is a computation required for initial clustering, centroid estimation, and sorting, ni is a fraction of the data structure evaluated before detecting a duplicate, and wi is a fraction of the home data cluster evaluated before detecting a duplicate. If DB(o1), . . . , DB(oi), at the time when new objects are received, are ≥1 times larger than home data clusters of o1, . . . , oi, the computational gain may be approximated as: gaincomp≥ or (−1)*100%. In some implementations, a corresponding energy gain for the process of redundancy elimination in the data structure is gainenergy=cfc*gaincomp, where cfc is a conversion factor for execution of unit computation (e.g., a quantity of carbon dioxide emitted on executing one unit of computation, such as a CPU cycle).
Claims (20)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US17/812,028 US11989210B2 (en) | 2022-07-12 | 2022-07-12 | Providing energy efficient dynamic redundancy elimination for stored data |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US17/812,028 US11989210B2 (en) | 2022-07-12 | 2022-07-12 | Providing energy efficient dynamic redundancy elimination for stored data |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20240020317A1 US20240020317A1 (en) | 2024-01-18 |
| US11989210B2 true US11989210B2 (en) | 2024-05-21 |
Family
ID=89509964
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/812,028 Active 2042-07-15 US11989210B2 (en) | 2022-07-12 | 2022-07-12 | Providing energy efficient dynamic redundancy elimination for stored data |
Country Status (1)
| Country | Link |
|---|---|
| US (1) | US11989210B2 (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119398906B (en) * | 2024-10-22 | 2025-06-03 | 中国建材检验认证集团湖南有限公司 | Dynamic adjustment method for carbon credit evaluation and distribution |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8074043B1 (en) * | 2009-01-30 | 2011-12-06 | Symantec Corporation | Method and apparatus to recover from interrupted data streams in a deduplication system |
-
2022
- 2022-07-12 US US17/812,028 patent/US11989210B2/en active Active
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8074043B1 (en) * | 2009-01-30 | 2011-12-06 | Symantec Corporation | Method and apparatus to recover from interrupted data streams in a deduplication system |
Non-Patent Citations (2)
| Title |
|---|
| Weis et al., "Industry-Scale Duplicate Detection," Proceedings of the VLDB Endowment 1.2, Aug. 24-30, 2008, 12 Pages. |
| Xia et al., "A Comprehensive Study of the Past, Present, and Future of Data Deduplication," Proceedings of the IEEE, 2016, 31 Pages. |
Also Published As
| Publication number | Publication date |
|---|---|
| US20240020317A1 (en) | 2024-01-18 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11514054B1 (en) | Supervised graph partitioning for record matching | |
| US11361243B2 (en) | Recommending machine learning techniques, features, and feature relevance scores | |
| JP7136752B2 (en) | Methods, devices, and non-transitory computer-readable media for generating data related to scarcity data based on received data input | |
| CN110297868B (en) | Building an enterprise-specific knowledge graph | |
| US10380162B2 (en) | Item to vector based categorization | |
| US12229193B2 (en) | Search systems and methods utilizing search based user clustering | |
| US11237749B2 (en) | System and method for backup data discrimination | |
| US12112133B2 (en) | Multi-model approach to natural language processing and recommendation generation | |
| US11080265B2 (en) | Dynamic hash function composition for change detection in distributed storage systems | |
| US20180114136A1 (en) | Trend identification using multiple data sources and machine learning techniques | |
| CN110362968B (en) | Information detection method, device and server | |
| US11061936B2 (en) | Property grouping for change detection in distributed storage systems | |
| US11175965B2 (en) | Systems and methods for dynamically evaluating container compliance with a set of rules | |
| US20190057297A1 (en) | Leveraging knowledge base of groups in mining organizational data | |
| AU2019203747B2 (en) | Scoring mechanism for discovery of extremist content | |
| US11544285B1 (en) | Automated transformation of hierarchical data from a source data format to a target data format | |
| US20240248882A1 (en) | Record management for database systems using fuzzy field matching | |
| US11989210B2 (en) | Providing energy efficient dynamic redundancy elimination for stored data | |
| US11055274B2 (en) | Granular change detection in distributed storage systems | |
| US11275893B1 (en) | Reference document generation using a federated learning system | |
| US20210357453A1 (en) | Query usage based organization for very large databases | |
| US12164867B1 (en) | Comparing code repositories | |
| US20240394252A1 (en) | Data enrichment using parallel search | |
| US12197421B2 (en) | Cross-provider topic conflation | |
| CN107622129A (en) | Method and device for organizing knowledge base and computer storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: ACCENTURE GLOBAL SOLUTIONS LIMITED, IRELAND Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MISRA, JANARDAN;BALANI, NAVEEN GORDHAN;REEL/FRAME:060485/0920 Effective date: 20220707 |
|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |