AU2021102043A4

AU2021102043A4 - System and method for data classification for efficient memory utilization and avoiding duplicate data

Info

Publication number: AU2021102043A4
Application number: AU2021102043A
Authority: AU
Inventors: Ved Prakash Bhardwaj; Piyush Chauhan; Saurabh Jain; Nitin; Deepak Kumar Sharma; Dhirendra Kumar Sharma
Original assignee: Bhardwaj Ved Prakash Dr; Chauhan Piyush Dr; Sharma Dhirendra Kumar Dr; Nitin Dr
Current assignee: Bhardwaj Ved Prakash Dr; Chauhan Piyush Dr; Sharma Dhirendra Kumar Dr
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2021-06-10
Anticipated expiration: 2029-04-19

Abstract

SYSTEM AND METHOD FOR DATA CLASSIFICATION FOR EFFICIENT MEMORY UTILIZATION AND AVOIDING DUPLICATE DATA ABSTRACT The present invention relates to System and method for data classification for efficient memory utilization and avoiding duplicate data. The objective of the present invention is to solve the problems in the prior art related to adequacies in technologies of data classification and removing duplication of data in data processing. 25 DRAWINGS Applicants: Dr. Ved Prakash Bhardwaj & Other No FIGURE 1 26

Description

DRAWINGS

Applicants: Dr. Ved Prakash Bhardwaj & Other

No

FIGURE 1

SYSTEM AND METHOD FOR DATA CLASSIFICATION FOR EFFICIENT MEMORY UTILIZATION AND AVOIDING DUPLICATE DATA FIELD OF INVENTION

[001]. The present invention relates to the technical field of information

technology-data management application field, in particular to a data of the

storage system of the method for determining the number of copies and

duplication.

[002]. The present invention relates to the technical field of computer

program processing technology field, in particular to a duplicate data

removal in a storage medium.

[003]. More particularly, the present invention is related to System and

method for data classification for efficient memory utilization and avoiding

duplicate data.

BACKGROUND & PRIOR ART

[004]. The subject matter discussed in the background section should not

be assumed to be prior art merely as a result of its mention in the background

section. Similarly, a problem mentioned in the background section or

associated with the subject matter of the background section should not be

assumed to have been previously recognized in the prior art. The subject

matter in the background section merely represents different approaches,

which in-and-of-themselves may also be inventions.

[005]. Databases play an important role in today's IT based market. Many

organizations and systems depend on the accuracy and quality of databases

for performing their operations. So the quality of the data stored in the

databases, can have considerable cost entanglement to a system that relies

on information to function and conduct the business.

[006]. Database administration is a job whose primary function is to

provide the overall aid for a computer database. These operations are carried

out by a person called administrator or database administrator. Databases

require consistent administration and maintenance, and a DBA is specially

practiced to perform all of the functions essential to do so. As the amount

of information increases the database administrator faces several problems

such as how to maintain data availability, security, quality assurance,

privacy, searching etc.

[007]. The data that is available in the data repositories such as digital

libraries and e-commerce brokers are obtained by gathering data from

different data sources and these data may be of different structures. The

existence of dirty data (i.e. replicas, data without any standard

representation etc.) in data repositories can cause several problems like

performance degradation, quality loss and increased operational cost and

time. Dirty data is not always bad. It depends on the correct management of

that data. In order to avoid the above specified problems, it is necessary to

study the reason of "dirty" data in repositories. During the aggregation or

integration of different data sources there occurs duplicates, quasi replicas

or near duplicates in these repositories and that is the major root for the

existence of dirty data.

[008]. So it is necessary to detect and remove the duplicate entries in the

data repositories. This problem is known as record deduplication Actually

record deduplication is the mission of identifying, in a data repository,

whether two records that refer to the same real world item or object in spite

of misspelling terms, typos, different writing styles or even different schema

representations or data types.

[009]. In this paper, we present a genetic programming (GP) approach to

record deduplication that combines the best pieces of evidences extracted

from the dataset to produce a deduplication function which can be able to

identify whether two or more entries in a repository are refer to the same

real world object or not. Actually record deduplication is a time consuming

task even for small repositories and it is difficult for the user to select best

evidences from the dataset. So us aspire is to find a method that finds an apt

combination of the best pieces of evidence that is present in the dataset, thus

obtain a deduplication function that maximizes performance using a small

typical portion of the corresponding data for training purposes. This

resultant function can also be used on the left over data or even applied to

other repositories with similar characteristics. With that deduplication

function we have to calculate the gene value for each record for performing

the record deduplication.

[0010]. Some of the work listed herewith:

CN109614506A - Method and device for importing duplicate removal data

into gallery based on rocksdb and storage medium presents " method and

device for importing data into a gallery based on rocksdb and a storage

medium, and the method comprises the steps: classifying the data to be

imported into the gallery to obtain a plurality of data categories, setting a

category identifier for each data category, and setting an edge relation

between the data in each data category; based on a rocksdb vertex duplicate

removal database, performing duplicate removal on data in each data

category, and inserting the data into the image library as a vertex; and based

on the rocksdb edge duplicate removal database,performing duplicate

removal on the edge relationship among the data, and inserting the edge

relationship into the image library as an edge. According to the method,

firstly, a rocksdb vertex duplicate removal database and a rocksdb edge

duplicate removal database which correspond to data types are constructed;

the method comprises the following steps of: importing data into a gallery;

according to the technical scheme

JP2004234582A - DICTIONARY CONSTRUCTION METHOD,

SYSTEM, AND SCREEN presents "a dictionary with further high utility

value by easily constructing a field term dictionary which has required an

enormous man-hour for work in the past by comparison of a term extracted

from history data of a retrieval function with an original dictionary term,

and enabling adaptation of a term actually used at present as a registration

candidate of the dictionary by using retrieval history data. SOLUTION: This

system comprises a means for extracting and storing a retrieval keyword or

other retrieval attribute information from the use history data of the retrieval

function; a means for comparing an existing dictionary or term classification

data of dictionary with the extracted retrieval keyword and extracting and

storing only terms not overlapped; a means for displaying the existing

dictionary or the term classification data; a means for narrowing and

displaying a dictionary registration candidate from the term data from which

duplication is removed; an editing means for associating the narrowed

registration candidate term with a term in the existing dictionary or the term

classification data; and a means for storing the editing results."

[0011]. CN108241639B - Data de-duplication method presents "a data

duplicate removal method. The method comprises the steps of classifying

data blocks based on the last bytes of the data blocks, and setting a database

server for performing processing and storage corresponding to each type of

data block; setting a minimal data block length by an interface server, and

for data files needed to be subjected to duplicate removal, if the file length

is smaller than the minimum length, directly sending the data files to the

database servers corresponding to the data blocks; otherwise, performing

block segmentation on the data files by using different trail bytes; in six

block segmentation modes with maximum block numbers, selecting two

block segmentation modes with maximum repeated data quantities by the

interface server, and instructing the corresponding database servers to

perform storage; for the repeated data blocks, only storing a pointer by the

database servers, wherein the pointer points to the stored same data blocks;

and for the non-repeated data blocks, storing the whole data blocks and the

hash values of the data blocks."

[0012]. US20080281847AI - METHOD OF PROCESSING PROTEIN

PEPTIDE DATA AND SYSTEM presents "a method of processing protein

peptide data obtained from healthy or pathological samples for analysis,

comprising the steps of: providing a list of peptide sequences and associated

auxiliary information representing an input data set; compiling from the

input data set a new peptide sequence list by removing peptide sequence

redundancy in the peptide sequence list, said new peptide sequence list

representing a peptide data set; and grouping together members of the

peptide data set originating from the same protein thus generating a protein

data set."

[0013]. CN110222139A - Road entity data deduplication method and

device, computing equipment and medium presents "a road entity data

deduplication method and device, computing equipment and a medium. The

method comprises: acquiring road source data, classifying the road source

data into at least one data subset according to road entity event types, one

data subset corresponding to one road entity event type, and the road source

data being used for describing the road entity events; determining a road

name and a geographic area name in the text content corresponding to each

piece of road source data in each data subset; and according to the road name

and the geographic area name in the text content corresponding to each piece

of road source data, performing text matching in historical road source data

belonging to the same road entity event type as the corresponding data

subset, and determining a duplicate text in each data subset. According to

the embodiment of the invention, the de-duplication effectiveness of the

road entity data in the internet data can be improved, so that the processing

efficiency of mass road entity data is improved."

[0014]. CN103997512B - A faces the cloud storage system method for

determining the number of copies of the data presents "a data duplicate

quantity determination method for a cloud storage system. The method is

based on data popularity and node popularity, takes satisfying service

demands and controlling a data duplicate quantity as targets, classifies data,

predicts the data duplicate demand quantity of the different data, increases

data duplicates in advance, or timely deletes excessive data duplicates. The

method comprises the following links: analyzing a data popularity

prediction model; predicting a data duplicate change quantity; calculating

the node popularity; increasing/deleting the data duplicates; and migrating

the data duplicates. The method reduces the data duplicate demand quantity,

reduces the hardware cost, mitigates data maintenance burden of the system,

reduces the generation probability of hot spot problems, and effectively

improves the utilization rate of the data duplicates."

[0015]. Groupings of alternative elements or embodiments of the invention

disclosed herein are not to be construed as limitations. Each group member

can be referred to and claimed individually or in any combination with other

members of the group or other elements found herein. One or more members

of a group can be included in, or deleted from, a group for reasons of

convenience and/or patentability. When any such inclusion or deletion

occurs, the specification is herein deemed to contain the group as modified,

thus fulfilling the written description of all Markus groups used in the

appended claims.

[0016]. As used in the description herein and throughout the claims that

follow, the meaning of "a," "an," and "the" includes plural reference unless

the context clearly dictates otherwise. Also, as used in the description

herein, the meaning of"in" includes "in" and "on"unless the context clearly

dictates otherwise.

[0017]. The recitation of ranges of values herein is merely intended to serve

as a shorthand method of referring individually to each separate value

falling within the range. Unless otherwise indicated herein, each individual

value is incorporated into the specification as if it were individually recited

herein. All methods described herein can be performed in any suitable order

unless otherwise indicated herein or otherwise clearly contradicted by

context.

[0018]. The use of any and all examples, or exemplary language (e.g. "Such

as") provided with respect to certain embodiments herein is intended merely

to better illuminate the invention and does not pose a limitation on the scope

of the invention otherwise claimed. No language in the specification should

be construed as indicating any non-claimed element essential to the practice

of the invention.

[0019]. The above information disclosed in this Background section is only

for the enhancement of understanding of the background of the invention

and therefore it may contain information that does not form the prior art that

is already known in this country to a person of ordinary skill in the art.

SUMMARY

[0020]. Before the present systems and methods, are described, it is to be

understood that this application is not limited to the particular systems, and

methodologies described, as there can be multiple possible embodiments

which are not expressly illustrated in the present disclosure. It is also to be

understood that the terminology used in the description is for the purpose of

describing the particular versions or embodiments only and is not intended

to limit the scope of the present application.

[0021]. The present invention mainly cures and solves the technical

problems existing in the prior art. In response to these problems, the present

invention discloses a System and method for data classification for efficient

memory utilization and avoiding duplicate data.

[0022]. As an aspect of the present invention , it presents a computer

implemented method for data classification for efficient memory utilization

and avoiding duplicate data, wherein the computer implemented method

comprising steps of: Performing a scan of the data of a folder in a memory

storage, wherein the scanning is performed on the folder that is not scanned;

Categorizing the data of the folder according to the extension of the files;

Storing the files of the same extension in a group for each type of file;

Creating subgroup of the groups of file based on the size of the files; and

Performing step of removing duplication of data according to the type of

data, wherein for a text file, Perform Extraction of keywords and put in a

List A, Comparing of files based on keywords and Removal of duplicate

copies in the list A, Arranging the selected Files of the List A in ascending

order and put in a sub-folder A, & Storing the duplicate copies of the files

in a sub-folder C; OR Performing step of removing duplication of data

according to the type of data, wherein for an image file, Performing

Comparison of Images and put in a List B and Removal of duplicate copies,

Arranging the selected Files of List B in ascending order and put in a sub

folder B, and Storing the duplicate copies in the sub-folder C.

OBJECTIVE OF THE INVENTION

[0023]. The principle objective of the present invention is to provide a

System and method for data classification for efficient memory utilization

and avoiding duplicate data.

BRIEF DESCRIPTION OF DRAWINGS

[0024]. To clarify various aspects of some example embodiments of the

present invention, a more particular description of the invention will be

rendered by reference to specific embodiments thereof which are illustrated

in the appended drawings. It is appreciated that these drawings depict only

illustrated embodiments of the invention and are therefore not to be

considered limiting of its scope. The invention will be described and explained with additional specificity and detail through the use of the accompanying drawings.

[0025]. In order that the advantages of the present invention will be easily

understood, a detailed description of the invention is discussed below in

conjunction with the appended drawings, which, however, should not be

considered to limit the scope of the invention to the accompanying

drawings, in which:

[0026]. Figure 1 shows a flow -diagram representation of method for data

classification for efficient memory utilization and avoiding duplicate data.,

according to one of the embodiment of the present invention.

DETAIL DESCRIPTION

[0027]. The present invention is related to System and method for data

classification for efficient memory utilization and avoiding duplicate data

[0028]. Figure 1 shows a flow -diagram representation of method for data

classification for efficient memory utilization and avoiding duplicate data.,

according to one of the embodiment of the present invention.

[0029]. Although the present disclosure has been described with the purpose

of System and method for data classification for efficient memory

utilization and avoiding duplicate data, it should be appreciated that the

same has been done merely to illustrate the invention in an exemplary

manner and to highlight any other purpose or function for which explained

structures or configurations could be used and is covered within the scope

of the present disclosure.

[0030]. Some embodiments of this disclosure, illustrating all its features,

will now be discussed in detail. The words and other forms thereof, are

intended to be open ended in that an item or items following any one of

these words are not meant to be an exhaustive listing of such item or items,

or meant to be limited to only the listed item or items. It must also be noted

that as used herein and in the appended claims, the singular forms "a," "an,"

and "the" include plural references unless the context clearly dictates

otherwise. Although any systems and methods similar or equivalent to those

described herein can be used in the practice or testing of embodiments of

the present disclosure, the exemplary systems and methods are now

described. The disclosed embodiments are merely exemplary of the

disclosure, which may be embodied in various forms.

[0031]. The system and method for data classification for efficient memory

utilization and avoiding duplicate data is disclosed in this present invention.

[0032]. The present methodology, initially scan the data of a folder of any

storage device. If the folder is already scanned, then the algorithm will stop

the scanning process otherwise the data is being categorized based on the

extension of files. In next step, the same kind of data further stored in a sub

group. Now suppose the data is in text format then specific keywords (at

least 5) will be extracted and each set of keywords will be stored in a list A.

In next step, text files are compared based on the keywords which are listed

in A. In case if two or more text files have same keywords then randomly

one copy will be considered the original one and rest of the copies will be

stored in sub-folder C. Further, the remaining text files are arranged in

ascending order and will be stored in sub-folder A. Suppose data is available

in image format, and then images will be compared based on their Mean

Squared Error method or Structural Similarity Measure method.

[0033]. In next step, if two or more images are similar then only any one

copy will be considered the original one and rest of the copies will be stored

in sub-folder C. Further, the remaining image files are arranged in ascending

order and will be stored in sub-folder B. In this way, the redundant data can

be removed from the folder. This concept will provide the data in a systematic manner. Further, the data which has been stored in sub-folder C can be deleted permanently if no longer required.

[0034]. In this way, the current method will also save the memory space.

Uniqueness of the Present Solution The present approach is unique; reasons

are:

[0035]. 1. In present scenario, it is very difficult and time consuming task to

identify the duplicate copies from a folder and most of the time users try to

do it manually and waste lots of time in this activity.

[0036]. 2. The present approach is identifying the duplicate text files and

image files from a folder automatically.

[0037]. 3. Further, it creates three subfolders, one for storing the original

copies of text files, second for storing the original copies of image files and

third for storing duplicate image and text files.

[0038]. 4. The folder which contains duplicate copies can be deleted by user

as per their convenience.

[0039]. 5. In this way, the present approach can increase the memory space

and reduces the user's effort in finding the required data.

[0040]. Although implementations of the invention have been described in

a language specific to structural features and/or methods, it is to be

understood that the appended claims are not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as examples of implementations of the invention.

Claims

CLAIMS We claim:

1. A computer implemented method for data

classification for efficient memory utilization and

avoiding duplicate data, wherein the computer

implemented method comprising steps of:

Performing a scan of the data of a folder in a memory

storage, wherein the scanning is performed on the

folder that is not scanned;

Categorizing the data of the folder according to the

extension of the files;

Storing the files of the same extension in a group for

each type of file;

Creating subgroup of the groups of file based on the

size of the files; and

Performing step of removing duplication of data

according to the type of data, wherein for a text file,

Perform Extraction of keywords and put in a List A,

Comparing of files based on keywords and Removal

of duplicate copies in the list A,

Arranging the selected Files of the List A in ascending

order and put in a sub-folder A, &

Storing the duplicate copies of the files in a sub-folder

C; OR

Performing step of removing duplication of data

according to the type of data, wherein for an image file,

Performing Comparison of Images and put in a List B

and Removal of duplicate copies,

Arranging the selected Files of List B in ascending

order and put in a sub-folder B, and

Storing the duplicate copies in the sub-folder C.

2. The computer implemented method for data

classification for efficient memory utilization and

avoiding duplicate data as claimed in claim 1, wherein image data is available in image format, and then images is compared based on their Mean

Squared Error method or Structural Similarity

Measure method.