AU2021102043A4 - System and method for data classification for efficient memory utilization and avoiding duplicate data - Google Patents

System and method for data classification for efficient memory utilization and avoiding duplicate data Download PDF

Info

Publication number
AU2021102043A4
AU2021102043A4 AU2021102043A AU2021102043A AU2021102043A4 AU 2021102043 A4 AU2021102043 A4 AU 2021102043A4 AU 2021102043 A AU2021102043 A AU 2021102043A AU 2021102043 A AU2021102043 A AU 2021102043A AU 2021102043 A4 AU2021102043 A4 AU 2021102043A4
Authority
AU
Australia
Prior art keywords
data
folder
files
duplicate
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2021102043A
Inventor
Ved Prakash Bhardwaj
Piyush Chauhan
Saurabh Jain
Nitin
Deepak Kumar Sharma
Dhirendra Kumar Sharma
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bhardwaj Ved Prakash Dr
Chauhan Piyush Dr
Sharma Dhirendra Kumar Dr
Nitin Dr
Original Assignee
Bhardwaj Ved Prakash Dr
Chauhan Piyush Dr
Sharma Dhirendra Kumar Dr
Nitin Dr
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bhardwaj Ved Prakash Dr, Chauhan Piyush Dr, Sharma Dhirendra Kumar Dr, Nitin Dr filed Critical Bhardwaj Ved Prakash Dr
Priority to AU2021102043A priority Critical patent/AU2021102043A4/en
Application granted granted Critical
Publication of AU2021102043A4 publication Critical patent/AU2021102043A4/en
Ceased legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/65Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

SYSTEM AND METHOD FOR DATA CLASSIFICATION FOR EFFICIENT MEMORY UTILIZATION AND AVOIDING DUPLICATE DATA ABSTRACT The present invention relates to System and method for data classification for efficient memory utilization and avoiding duplicate data. The objective of the present invention is to solve the problems in the prior art related to adequacies in technologies of data classification and removing duplication of data in data processing. 25 DRAWINGS Applicants: Dr. Ved Prakash Bhardwaj & Other No FIGURE 1 26

Description

DRAWINGS
Applicants: Dr. Ved Prakash Bhardwaj & Other
No
FIGURE 1
SYSTEM AND METHOD FOR DATA CLASSIFICATION FOR EFFICIENT MEMORY UTILIZATION AND AVOIDING DUPLICATE DATA FIELD OF INVENTION
[001]. The present invention relates to the technical field of information
technology-data management application field, in particular to a data of the
storage system of the method for determining the number of copies and
duplication.
[002]. The present invention relates to the technical field of computer
program processing technology field, in particular to a duplicate data
removal in a storage medium.
[003]. More particularly, the present invention is related to System and
method for data classification for efficient memory utilization and avoiding
duplicate data.
BACKGROUND & PRIOR ART
[004]. The subject matter discussed in the background section should not
be assumed to be prior art merely as a result of its mention in the background
section. Similarly, a problem mentioned in the background section or
associated with the subject matter of the background section should not be
assumed to have been previously recognized in the prior art. The subject
matter in the background section merely represents different approaches,
which in-and-of-themselves may also be inventions.
[005]. Databases play an important role in today's IT based market. Many
organizations and systems depend on the accuracy and quality of databases
for performing their operations. So the quality of the data stored in the
databases, can have considerable cost entanglement to a system that relies
on information to function and conduct the business.
[006]. Database administration is a job whose primary function is to
provide the overall aid for a computer database. These operations are carried
out by a person called administrator or database administrator. Databases
require consistent administration and maintenance, and a DBA is specially
practiced to perform all of the functions essential to do so. As the amount
of information increases the database administrator faces several problems
such as how to maintain data availability, security, quality assurance,
privacy, searching etc.
[007]. The data that is available in the data repositories such as digital
libraries and e-commerce brokers are obtained by gathering data from
different data sources and these data may be of different structures. The
existence of dirty data (i.e. replicas, data without any standard
representation etc.) in data repositories can cause several problems like
performance degradation, quality loss and increased operational cost and
time. Dirty data is not always bad. It depends on the correct management of
that data. In order to avoid the above specified problems, it is necessary to
study the reason of "dirty" data in repositories. During the aggregation or
integration of different data sources there occurs duplicates, quasi replicas
or near duplicates in these repositories and that is the major root for the
existence of dirty data.
[008]. So it is necessary to detect and remove the duplicate entries in the
data repositories. This problem is known as record deduplication Actually
record deduplication is the mission of identifying, in a data repository,
whether two records that refer to the same real world item or object in spite
of misspelling terms, typos, different writing styles or even different schema
representations or data types.
[009]. In this paper, we present a genetic programming (GP) approach to
record deduplication that combines the best pieces of evidences extracted
from the dataset to produce a deduplication function which can be able to
identify whether two or more entries in a repository are refer to the same
real world object or not. Actually record deduplication is a time consuming
task even for small repositories and it is difficult for the user to select best
evidences from the dataset. So us aspire is to find a method that finds an apt
combination of the best pieces of evidence that is present in the dataset, thus
obtain a deduplication function that maximizes performance using a small
typical portion of the corresponding data for training purposes. This
resultant function can also be used on the left over data or even applied to
other repositories with similar characteristics. With that deduplication
function we have to calculate the gene value for each record for performing
the record deduplication.
[0010]. Some of the work listed herewith:
CN109614506A - Method and device for importing duplicate removal data
into gallery based on rocksdb and storage medium presents " method and
device for importing data into a gallery based on rocksdb and a storage
medium, and the method comprises the steps: classifying the data to be
imported into the gallery to obtain a plurality of data categories, setting a
category identifier for each data category, and setting an edge relation
between the data in each data category; based on a rocksdb vertex duplicate
removal database, performing duplicate removal on data in each data
category, and inserting the data into the image library as a vertex; and based
on the rocksdb edge duplicate removal database,performing duplicate
removal on the edge relationship among the data, and inserting the edge
relationship into the image library as an edge. According to the method,
firstly, a rocksdb vertex duplicate removal database and a rocksdb edge
duplicate removal database which correspond to data types are constructed;
the method comprises the following steps of: importing data into a gallery;
according to the technical scheme
JP2004234582A - DICTIONARY CONSTRUCTION METHOD,
SYSTEM, AND SCREEN presents "a dictionary with further high utility
value by easily constructing a field term dictionary which has required an
enormous man-hour for work in the past by comparison of a term extracted
from history data of a retrieval function with an original dictionary term,
and enabling adaptation of a term actually used at present as a registration
candidate of the dictionary by using retrieval history data. SOLUTION: This
system comprises a means for extracting and storing a retrieval keyword or
other retrieval attribute information from the use history data of the retrieval
function; a means for comparing an existing dictionary or term classification
data of dictionary with the extracted retrieval keyword and extracting and
storing only terms not overlapped; a means for displaying the existing
dictionary or the term classification data; a means for narrowing and
displaying a dictionary registration candidate from the term data from which
duplication is removed; an editing means for associating the narrowed
registration candidate term with a term in the existing dictionary or the term
classification data; and a means for storing the editing results."
[0011]. CN108241639B - Data de-duplication method presents "a data
duplicate removal method. The method comprises the steps of classifying
data blocks based on the last bytes of the data blocks, and setting a database
server for performing processing and storage corresponding to each type of
data block; setting a minimal data block length by an interface server, and
for data files needed to be subjected to duplicate removal, if the file length
is smaller than the minimum length, directly sending the data files to the
database servers corresponding to the data blocks; otherwise, performing
block segmentation on the data files by using different trail bytes; in six
block segmentation modes with maximum block numbers, selecting two
block segmentation modes with maximum repeated data quantities by the
interface server, and instructing the corresponding database servers to
perform storage; for the repeated data blocks, only storing a pointer by the
database servers, wherein the pointer points to the stored same data blocks;
and for the non-repeated data blocks, storing the whole data blocks and the
hash values of the data blocks."
[0012]. US20080281847AI - METHOD OF PROCESSING PROTEIN
PEPTIDE DATA AND SYSTEM presents "a method of processing protein
peptide data obtained from healthy or pathological samples for analysis,
comprising the steps of: providing a list of peptide sequences and associated
auxiliary information representing an input data set; compiling from the
input data set a new peptide sequence list by removing peptide sequence
redundancy in the peptide sequence list, said new peptide sequence list
representing a peptide data set; and grouping together members of the
peptide data set originating from the same protein thus generating a protein
data set."
[0013]. CN110222139A - Road entity data deduplication method and
device, computing equipment and medium presents "a road entity data
deduplication method and device, computing equipment and a medium. The
method comprises: acquiring road source data, classifying the road source
data into at least one data subset according to road entity event types, one
data subset corresponding to one road entity event type, and the road source
data being used for describing the road entity events; determining a road
name and a geographic area name in the text content corresponding to each
piece of road source data in each data subset; and according to the road name
and the geographic area name in the text content corresponding to each piece
of road source data, performing text matching in historical road source data
belonging to the same road entity event type as the corresponding data
subset, and determining a duplicate text in each data subset. According to
the embodiment of the invention, the de-duplication effectiveness of the
road entity data in the internet data can be improved, so that the processing
efficiency of mass road entity data is improved."
[0014]. CN103997512B - A faces the cloud storage system method for
determining the number of copies of the data presents "a data duplicate
quantity determination method for a cloud storage system. The method is
based on data popularity and node popularity, takes satisfying service
demands and controlling a data duplicate quantity as targets, classifies data,
predicts the data duplicate demand quantity of the different data, increases
data duplicates in advance, or timely deletes excessive data duplicates. The
method comprises the following links: analyzing a data popularity
prediction model; predicting a data duplicate change quantity; calculating
the node popularity; increasing/deleting the data duplicates; and migrating
the data duplicates. The method reduces the data duplicate demand quantity,
reduces the hardware cost, mitigates data maintenance burden of the system,
reduces the generation probability of hot spot problems, and effectively
improves the utilization rate of the data duplicates."
[0015]. Groupings of alternative elements or embodiments of the invention
disclosed herein are not to be construed as limitations. Each group member
can be referred to and claimed individually or in any combination with other
members of the group or other elements found herein. One or more members
of a group can be included in, or deleted from, a group for reasons of
convenience and/or patentability. When any such inclusion or deletion
occurs, the specification is herein deemed to contain the group as modified,
thus fulfilling the written description of all Markus groups used in the
appended claims.
[0016]. As used in the description herein and throughout the claims that
follow, the meaning of "a," "an," and "the" includes plural reference unless
the context clearly dictates otherwise. Also, as used in the description
herein, the meaning of"in" includes "in" and "on"unless the context clearly
dictates otherwise.
[0017]. The recitation of ranges of values herein is merely intended to serve
as a shorthand method of referring individually to each separate value
falling within the range. Unless otherwise indicated herein, each individual
value is incorporated into the specification as if it were individually recited
herein. All methods described herein can be performed in any suitable order
unless otherwise indicated herein or otherwise clearly contradicted by
context.
[0018]. The use of any and all examples, or exemplary language (e.g. "Such
as") provided with respect to certain embodiments herein is intended merely
to better illuminate the invention and does not pose a limitation on the scope
of the invention otherwise claimed. No language in the specification should
be construed as indicating any non-claimed element essential to the practice
of the invention.
[0019]. The above information disclosed in this Background section is only
for the enhancement of understanding of the background of the invention
and therefore it may contain information that does not form the prior art that
is already known in this country to a person of ordinary skill in the art.
SUMMARY
[0020]. Before the present systems and methods, are described, it is to be
understood that this application is not limited to the particular systems, and
methodologies described, as there can be multiple possible embodiments
which are not expressly illustrated in the present disclosure. It is also to be
understood that the terminology used in the description is for the purpose of
describing the particular versions or embodiments only and is not intended
to limit the scope of the present application.
[0021]. The present invention mainly cures and solves the technical
problems existing in the prior art. In response to these problems, the present
invention discloses a System and method for data classification for efficient
memory utilization and avoiding duplicate data.
[0022]. As an aspect of the present invention , it presents a computer
implemented method for data classification for efficient memory utilization
and avoiding duplicate data, wherein the computer implemented method
comprising steps of: Performing a scan of the data of a folder in a memory
storage, wherein the scanning is performed on the folder that is not scanned;
Categorizing the data of the folder according to the extension of the files;
Storing the files of the same extension in a group for each type of file;
Creating subgroup of the groups of file based on the size of the files; and
Performing step of removing duplication of data according to the type of
data, wherein for a text file, Perform Extraction of keywords and put in a
List A, Comparing of files based on keywords and Removal of duplicate
copies in the list A, Arranging the selected Files of the List A in ascending
order and put in a sub-folder A, & Storing the duplicate copies of the files
in a sub-folder C; OR Performing step of removing duplication of data
according to the type of data, wherein for an image file, Performing
Comparison of Images and put in a List B and Removal of duplicate copies,
Arranging the selected Files of List B in ascending order and put in a sub
folder B, and Storing the duplicate copies in the sub-folder C.
OBJECTIVE OF THE INVENTION
[0023]. The principle objective of the present invention is to provide a
System and method for data classification for efficient memory utilization
and avoiding duplicate data.
BRIEF DESCRIPTION OF DRAWINGS
[0024]. To clarify various aspects of some example embodiments of the
present invention, a more particular description of the invention will be
rendered by reference to specific embodiments thereof which are illustrated
in the appended drawings. It is appreciated that these drawings depict only
illustrated embodiments of the invention and are therefore not to be
considered limiting of its scope. The invention will be described and explained with additional specificity and detail through the use of the accompanying drawings.
[0025]. In order that the advantages of the present invention will be easily
understood, a detailed description of the invention is discussed below in
conjunction with the appended drawings, which, however, should not be
considered to limit the scope of the invention to the accompanying
drawings, in which:
[0026]. Figure 1 shows a flow -diagram representation of method for data
classification for efficient memory utilization and avoiding duplicate data.,
according to one of the embodiment of the present invention.
DETAIL DESCRIPTION
[0027]. The present invention is related to System and method for data
classification for efficient memory utilization and avoiding duplicate data
[0028]. Figure 1 shows a flow -diagram representation of method for data
classification for efficient memory utilization and avoiding duplicate data.,
according to one of the embodiment of the present invention.
[0029]. Although the present disclosure has been described with the purpose
of System and method for data classification for efficient memory
utilization and avoiding duplicate data, it should be appreciated that the
same has been done merely to illustrate the invention in an exemplary
manner and to highlight any other purpose or function for which explained
structures or configurations could be used and is covered within the scope
of the present disclosure.
[0030]. Some embodiments of this disclosure, illustrating all its features,
will now be discussed in detail. The words and other forms thereof, are
intended to be open ended in that an item or items following any one of
these words are not meant to be an exhaustive listing of such item or items,
or meant to be limited to only the listed item or items. It must also be noted
that as used herein and in the appended claims, the singular forms "a," "an,"
and "the" include plural references unless the context clearly dictates
otherwise. Although any systems and methods similar or equivalent to those
described herein can be used in the practice or testing of embodiments of
the present disclosure, the exemplary systems and methods are now
described. The disclosed embodiments are merely exemplary of the
disclosure, which may be embodied in various forms.
[0031]. The system and method for data classification for efficient memory
utilization and avoiding duplicate data is disclosed in this present invention.
[0032]. The present methodology, initially scan the data of a folder of any
storage device. If the folder is already scanned, then the algorithm will stop
the scanning process otherwise the data is being categorized based on the
extension of files. In next step, the same kind of data further stored in a sub
group. Now suppose the data is in text format then specific keywords (at
least 5) will be extracted and each set of keywords will be stored in a list A.
In next step, text files are compared based on the keywords which are listed
in A. In case if two or more text files have same keywords then randomly
one copy will be considered the original one and rest of the copies will be
stored in sub-folder C. Further, the remaining text files are arranged in
ascending order and will be stored in sub-folder A. Suppose data is available
in image format, and then images will be compared based on their Mean
Squared Error method or Structural Similarity Measure method.
[0033]. In next step, if two or more images are similar then only any one
copy will be considered the original one and rest of the copies will be stored
in sub-folder C. Further, the remaining image files are arranged in ascending
order and will be stored in sub-folder B. In this way, the redundant data can
be removed from the folder. This concept will provide the data in a systematic manner. Further, the data which has been stored in sub-folder C can be deleted permanently if no longer required.
[0034]. In this way, the current method will also save the memory space.
Uniqueness of the Present Solution The present approach is unique; reasons
are:
[0035]. 1. In present scenario, it is very difficult and time consuming task to
identify the duplicate copies from a folder and most of the time users try to
do it manually and waste lots of time in this activity.
[0036]. 2. The present approach is identifying the duplicate text files and
image files from a folder automatically.
[0037]. 3. Further, it creates three subfolders, one for storing the original
copies of text files, second for storing the original copies of image files and
third for storing duplicate image and text files.
[0038]. 4. The folder which contains duplicate copies can be deleted by user
as per their convenience.
[0039]. 5. In this way, the present approach can increase the memory space
and reduces the user's effort in finding the required data.
[0040]. Although implementations of the invention have been described in
a language specific to structural features and/or methods, it is to be
understood that the appended claims are not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as examples of implementations of the invention.

Claims (2)

CLAIMS We claim:
1. A computer implemented method for data
classification for efficient memory utilization and
avoiding duplicate data, wherein the computer
implemented method comprising steps of:
Performing a scan of the data of a folder in a memory
storage, wherein the scanning is performed on the
folder that is not scanned;
Categorizing the data of the folder according to the
extension of the files;
Storing the files of the same extension in a group for
each type of file;
Creating subgroup of the groups of file based on the
size of the files; and
Performing step of removing duplication of data
according to the type of data, wherein for a text file,
Perform Extraction of keywords and put in a List A,
Comparing of files based on keywords and Removal
of duplicate copies in the list A,
Arranging the selected Files of the List A in ascending
order and put in a sub-folder A, &
Storing the duplicate copies of the files in a sub-folder
C; OR
Performing step of removing duplication of data
according to the type of data, wherein for an image file,
Performing Comparison of Images and put in a List B
and Removal of duplicate copies,
Arranging the selected Files of List B in ascending
order and put in a sub-folder B, and
Storing the duplicate copies in the sub-folder C.
2. The computer implemented method for data
classification for efficient memory utilization and
avoiding duplicate data as claimed in claim 1, wherein image data is available in image format, and then images is compared based on their Mean
Squared Error method or Structural Similarity
Measure method.
AU2021102043A 2021-04-19 2021-04-19 System and method for data classification for efficient memory utilization and avoiding duplicate data Ceased AU2021102043A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2021102043A AU2021102043A4 (en) 2021-04-19 2021-04-19 System and method for data classification for efficient memory utilization and avoiding duplicate data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2021102043A AU2021102043A4 (en) 2021-04-19 2021-04-19 System and method for data classification for efficient memory utilization and avoiding duplicate data

Publications (1)

Publication Number Publication Date
AU2021102043A4 true AU2021102043A4 (en) 2021-06-10

Family

ID=76215557

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2021102043A Ceased AU2021102043A4 (en) 2021-04-19 2021-04-19 System and method for data classification for efficient memory utilization and avoiding duplicate data

Country Status (1)

Country Link
AU (1) AU2021102043A4 (en)

Similar Documents

Publication Publication Date Title
US10783168B2 (en) Systems and methods for probabilistic data classification
US7401080B2 (en) Storage reports duplicate file detection
US9792289B2 (en) Systems and methods for file clustering, multi-drive forensic analysis and data protection
US8527556B2 (en) Systems and methods to update a content store associated with a search index
US7457934B2 (en) Method and apparatus for reducing the amount of data in a storage system
US7814078B1 (en) Identification of files with similar content
US8375008B1 (en) Method and system for enterprise-wide retention of digital or electronic data
US10417265B2 (en) High performance parallel indexing for forensics and electronic discovery
US8032494B2 (en) Archiving engine
CN109522290B (en) HBase data block recovery and data record extraction method
JP5233233B2 (en) Information search system, information search index registration device, information search method and program
JP6841024B2 (en) Data processing equipment, data processing programs and data processing methods
CN104978151A (en) Application awareness based data reconstruction method in repeated data deletion and storage system
JP2005267600A (en) System and method of protecting data for long time
CN110888837B (en) Object storage small file merging method and device
JP4667362B2 (en) Identifying similarity and revision history in large unstructured data sets
US20040002983A1 (en) Method and system for detecting tables to be modified
CN109947730B (en) Metadata recovery method, device, distributed file system and readable storage medium
AU2021102043A4 (en) System and method for data classification for efficient memory utilization and avoiding duplicate data
CN117453646A (en) Kernel log combined compression and query method integrating semantics and deep neural network
Evangelista et al. Adaptive and flexible blocking for record linkage tasks
Sitas et al. Duplicate detection algorithms of bibliographic descriptions
KR101082024B1 (en) Device for index managing of evidence image in digital forensic system and method therefor
CN117290889B (en) Safe storage method for realizing electronic labor contract based on blockchain
Nalini et al. Elimination of Data Redundancy before Persisting into DBMS using SVM Classification

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)
MK22 Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry