CN112817920A - Distributed big data cleaning method - Google Patents

Distributed big data cleaning method Download PDF

Info

Publication number
CN112817920A
CN112817920A CN202110233271.1A CN202110233271A CN112817920A CN 112817920 A CN112817920 A CN 112817920A CN 202110233271 A CN202110233271 A CN 202110233271A CN 112817920 A CN112817920 A CN 112817920A
Authority
CN
China
Prior art keywords
data
picture
cleaning
address
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110233271.1A
Other languages
Chinese (zh)
Inventor
陈卿
徐弘�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhixiaobing Science & Technology Co ltd
Original Assignee
Shenzhen Zhixiaobing Science & Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhixiaobing Science & Technology Co ltd filed Critical Shenzhen Zhixiaobing Science & Technology Co ltd
Priority to CN202110233271.1A priority Critical patent/CN112817920A/en
Publication of CN112817920A publication Critical patent/CN112817920A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/587Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Library & Information Science (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application provides a distributed big data cleaning method, which comprises the following steps: the terminal acquires data storage time of the distributed big data, and determines the data with the storage time larger than a time threshold as the data to be cleaned; the terminal executes association operation on first cleaning data of the data to be cleaned, wherein the association operation specifically comprises the following steps: searching the quantity alpha of the associated data of the cleaning data, keeping the first cleaning data when the quantity alpha is determined to be larger than a quantity threshold value by the terminal, and cleaning the first cleaning data if the quantity alpha is determined to be smaller than the quantity threshold value. The technical scheme provided by the application has the advantage of reducing cost.

Description

Distributed big data cleaning method
Technical Field
The application relates to the field of big data, in particular to a distributed big data cleaning method.
Background
Big data (big data), an IT industry term, refers to a data set that cannot be captured, managed, and processed with a conventional software tool within a certain time range, and is a massive, high-growth-rate, diversified information asset that needs a new processing mode to have stronger decision-making power, insight discovery power, and process optimization capability.
The existing big data has large memory space, occupies excessive memory resources and improves the cost.
Disclosure of Invention
The invention aims to provide a distributed big data cleaning method, and the technical scheme can be used for classifying and cleaning big data, reducing storage resources and reducing cost.
In a first aspect, a method for cleaning distributed big data is provided,
the terminal acquires data storage time of the distributed big data, and determines the data with the storage time larger than a time threshold as the data to be cleaned;
the terminal executes association operation on first cleaning data of the data to be cleaned, wherein the association operation specifically comprises the following steps: the number alpha of associated data of the cleaning data is searched,
and the terminal determines that the quantity alpha is greater than a quantity threshold value, the first cleaning data is reserved, and if the determined quantity alpha is less than the quantity threshold value, the first cleaning data is cleaned.
The method includes the steps that a terminal obtains data storage time of distributed big data, and the data with the storage time larger than a time threshold value are determined as data to be cleaned; the terminal executes association operation on first cleaning data of the data to be cleaned, wherein the association operation specifically comprises the following steps: searching the quantity alpha of the associated data of the cleaning data, keeping the first cleaning data when the quantity alpha is determined to be larger than a quantity threshold value by the terminal, and cleaning the first cleaning data if the quantity alpha is determined to be smaller than the quantity threshold value. The scheme can determine whether the first cleaning data is useful according to the quantity alpha of the associated data, and further determine whether the first cleaning data is cleaned or reserved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of a hardware structure of a terminal according to the present invention.
Fig. 2 is a schematic flow chart of a distributed big data cleaning method provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The embodiments of the present application will be described below with reference to the drawings.
The term "and/or" in this application is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in this document indicates that the former and latter related objects are in an "or" relationship.
The "plurality" appearing in the embodiments of the present application means two or more. The descriptions of the first, second, etc. appearing in the embodiments of the present application are only for illustrating and differentiating the objects, and do not represent the order or the particular limitation of the number of the devices in the embodiments of the present application, and do not constitute any limitation to the embodiments of the present application. The term "connect" in the embodiments of the present application refers to various connection manners, such as direct connection or indirect connection, to implement communication between devices, which is not limited in this embodiment of the present application.
In the present application, "|" means an absolute value.
Fig. 1 is a hardware block diagram of a terminal provided in an embodiment of the present application, including a processor, a memory, a camera, and a display screen. A plurality of terminals as shown in fig. 1 may constitute a distributed big data system.
Referring to fig. 2, fig. 2 provides a distributed big data cleaning method, which may be applied in a terminal in a distributed big data system, where the terminal may be in the structure shown in fig. 1, and as shown in fig. 2, the method may include the following steps:
step S201, a terminal acquires data storage time of distributed big data, and data with the storage time larger than a time threshold value is determined as data to be cleaned;
step S202, the terminal executes association operation on first cleaning data of the to-be-cleaned data, wherein the association operation specifically comprises the following steps: the number alpha of associated data of the cleaning data is searched,
step S203, the terminal determines that the number α is greater than the number threshold, retains the first cleaning data, and cleans the first cleaning data if the determined number α is less than the number threshold.
The method includes the steps that a terminal obtains data storage time of distributed big data, and the data with the storage time larger than a time threshold value are determined as data to be cleaned; the terminal executes association operation on first cleaning data of the data to be cleaned, wherein the association operation specifically comprises the following steps: searching the quantity alpha of the associated data of the cleaning data, keeping the first cleaning data when the quantity alpha is determined to be larger than a quantity threshold value by the terminal, and cleaning the first cleaning data if the quantity alpha is determined to be smaller than the quantity threshold value. The scheme can determine whether the first cleaning data is useful according to the quantity alpha of the associated data, and further determine whether the first cleaning data is cleaned or reserved.
For example, in the application, taking the picture data as an example, for the distributed big data, if the number of pictures associated with the picture data in the distributed big data is larger, the greater the likelihood that it needs to be retained because there is some correlation to the picture, such as a picture of a tour, in the photo pictures of Beijing, if a first picture corresponding to the first cleaning data has a large number of related pictures, the first picture is determined to need to be stored, otherwise, if the first picture is only an isolated picture, the value of the first picture to be reserved is not great, so that the possibility of deleting the first picture is great, if all the non-partial pictures related to the first picture are deleted, the quantity alpha of the related data searched for the first picture is definitely small, and therefore the judgment on the cleaning data can be realized through the quantity.
Optionally, if the first cleaning data is a picture, the searching for the number α of the associated data of the cleaning data may specifically include:
determining a first picture of cleaning data, identifying the first picture to determine whether the first picture has a person, if so, deleting the person from the first picture to obtain a first background picture, classifying and identifying the first background picture to obtain landmark information corresponding to the first background picture, determining a first address of the first background picture according to the landmark information, inquiring pictures matched with the first address from the pictures stored in the distributed big data storage, and determining the number of the pictures as alpha.
The background picture obtained by deleting the character can be obtained by adopting an existing mode, and the classification and identification mode can adopt an existing mode, such as Baidu AI identification, Google identification and the like.
Or taking a picture of a tour as an example, taking beijing as an example, a first picture obtained during the tour, for example, a first picture taken at san ritun, is directly compared with a background after being removed from the human, and if the pictures stored in the distributed big data store have a plurality of pictures of san ritun addresses, the first picture is retained, and if not, the first picture is cleaned (deleted).
It should be noted that the querying, from the pictures stored in the distributed big data storage, the picture matched with the first address specifically includes:
the method comprises the steps of carrying out face recognition on a figure to determine the identity of the figure, determining a resident city of the figure according to the identity, if the resident city comprises the first address, determining that a picture matched with the first address is a picture matched with the first address, if the resident city does not comprise the first address, determining a first city corresponding to the first address, and determining that a picture matched with the first city is a picture matched with the first address.
In this scheme, for people in beijing, the matching is too large in city, so the range is reduced, and the specific address information is determined, but for tourists, the range needs to be expanded to the city, for example, zhang san from shenzhen to beijing, and pictures taken at sanlintun, qinghua university and other addresses should be determined as the pictures with matched addresses.
In an optional scheme, if the first cleaning data is reserved, the method further includes:
when the terminal determines that the type of the first cleaning data is picture data, performing a grid operation on each picture in the picture data to obtain operation data; the grid operation may specifically include: establishing grids (similar to grids in VISO drawing, namely a plurality of square grids with consistent area size) for a picture, calculating the similarity between every two grids (with consistent area size), determining the grids with the similarity larger than a similarity threshold value as a similar grid group, setting a bitmap to represent the position of the similar grid group in the picture, storing pixel data of one grid in the similar grid group, and deleting the pixel data of the rest grid group to finish the grid operation of the picture. The calculation method of the similarity may specifically include: and establishing a three-dimensional matrix m x n x 3 according to R, G, B values of pixel points of each grid and positions of the pixel points in the grids, wherein m represents a length value of the three-dimensional matrix, n represents a width value of the three-dimensional matrix, 3 represents a depth value, each depth corresponds to a value of R, G, B, calculating differences of 2 three-dimensional matrices of the two grids to obtain a three-dimensional difference matrix, counting the number x of element values smaller than a numerical pixel threshold value in the three-dimensional difference matrix, and the similarity = x/(m n 3): 100%.
The method includes the steps that a plurality of frames in a graph after gridding are completely consistent, for example, if the regions of a road surface are stored, a plurality of storage spaces are obviously increased, pixel data of one grid are directly stored in a similar grid group after similarity calculation, and other pixel data are used for representing the positions of the pixel data through bitmaps (bitmaps), so that the storage space of the picture can be reduced, and the storage cost can be reduced.
In an optional scheme, if the first cleaning data is reserved, the method may further include:
if the terminal determines that the type of the first cleaning data is position coordinates, determining a position coordinate area, replacing the position coordinates of the area with a regular area and an irregular area for storage, if the regular area is circular, the terminal acquires first position coordinates in the middle range of the position coordinate area, determines the first position coordinates as a circle center to be determined, transmits beta rays by taking the circle center to be determined as a ray end point for 360 degrees, acquires beta end points between the beta rays and the area boundary line of the position coordinates, calculates beta distances between the beta end points and the circle center to be determined, determines the minimum value of the beta distances as a circle radius to be determined, draws a circle by taking the circle center to be determined as the circle center and the circle radius, randomly extracts w points on the circumference, acquires w coordinates of the w points, and determines the circle to be the regular area if the w coordinates are all within the boundary of the map area, the remaining area of the position coordinates within the boundary of the map area is an irregular area, and the remaining area is an area within the boundary other than the circle.
The method can reduce the data storage capacity of the position coordinates without storing massive position coordinates.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (4)

1. A distributed big data cleaning method is characterized in that,
the terminal acquires data storage time of the distributed big data, and determines the data with the storage time larger than a time threshold as the data to be cleaned;
the terminal executes association operation on first cleaning data of the data to be cleaned, wherein the association operation specifically comprises the following steps: the number alpha of associated data of the cleaning data is searched,
and the terminal determines that the quantity alpha is greater than a quantity threshold value, the first cleaning data is reserved, and if the determined quantity alpha is less than the quantity threshold value, the first cleaning data is cleaned.
2. The method of claim 1, wherein if the first cleaning data is a picture, the searching for the number α of the associated data of the cleaning data specifically comprises:
determining a first picture of cleaning data, identifying the first picture to determine whether the first picture has a person, if so, deleting the person from the first picture to obtain a first background picture, classifying and identifying the first background picture to obtain landmark information corresponding to the first background picture, determining a first address of the first background picture according to the landmark information, inquiring pictures matched with the first address from the pictures stored in the distributed big data storage, and determining the number of the pictures as alpha.
3. The method according to claim 2, wherein the querying the pictures matching the first address from the pictures in the distributed big data storage specifically comprises:
the method comprises the steps of carrying out face recognition on a figure to determine the identity of the figure, determining a resident city of the figure according to the identity, if the resident city comprises the first address, determining that a picture matched with the first address is a picture matched with the first address, if the resident city does not comprise the first address, determining a first city corresponding to the first address, and determining that a picture matched with the first city is a picture matched with the first address.
4. The method of claim 1, wherein if the first cleaning data is retained, the method further comprises:
when the terminal determines that the type of the first cleaning data is picture data, performing a grid operation on each picture in the picture data to obtain operation data; the grid operation may specifically include: establishing grids for a picture, calculating the similarity between every two grids, determining the grids with the similarity larger than a similarity threshold value as a similar grid group, setting a bitmap to indicate the position of the similar grid group in the picture, storing pixel data of one grid in the similar grid group, and deleting the pixel data of the rest grid groups to finish the square operation of the picture;
the calculation method of the similarity specifically includes: and establishing a three-dimensional matrix m x n x 3 according to R, G, B values of pixel points of each grid and positions of the pixel points in the grids, wherein m represents a length value of the three-dimensional matrix, n represents a width value of the three-dimensional matrix, 3 represents a depth value, each depth corresponds to a value of R, G, B, calculating differences of 2 three-dimensional matrices of the two grids to obtain a three-dimensional difference matrix, counting the number x of element values smaller than a numerical pixel threshold value in the three-dimensional difference matrix, and the similarity = x/(m n 3): 100%.
CN202110233271.1A 2021-03-03 2021-03-03 Distributed big data cleaning method Withdrawn CN112817920A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110233271.1A CN112817920A (en) 2021-03-03 2021-03-03 Distributed big data cleaning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110233271.1A CN112817920A (en) 2021-03-03 2021-03-03 Distributed big data cleaning method

Publications (1)

Publication Number Publication Date
CN112817920A true CN112817920A (en) 2021-05-18

Family

ID=75862748

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110233271.1A Withdrawn CN112817920A (en) 2021-03-03 2021-03-03 Distributed big data cleaning method

Country Status (1)

Country Link
CN (1) CN112817920A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102467572A (en) * 2010-11-17 2012-05-23 英业达股份有限公司 Data block inquiring method for supporting data de-duplication program
CN105069083A (en) * 2015-07-31 2015-11-18 小米科技有限责任公司 Determination method and device of associated user
CN107077487A (en) * 2014-10-23 2017-08-18 微软技术许可有限责任公司 Personal photo is tagged using depth network
CN109716324A (en) * 2016-09-28 2019-05-03 微软技术许可有限责任公司 Direct table association in in-memory data library
CN111597369A (en) * 2020-05-18 2020-08-28 Oppo广东移动通信有限公司 Photo viewing method and device, storage medium and terminal
CN111712806A (en) * 2017-09-21 2020-09-25 深圳传音通讯有限公司 Method and device for clearing residual file and readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102467572A (en) * 2010-11-17 2012-05-23 英业达股份有限公司 Data block inquiring method for supporting data de-duplication program
CN107077487A (en) * 2014-10-23 2017-08-18 微软技术许可有限责任公司 Personal photo is tagged using depth network
CN105069083A (en) * 2015-07-31 2015-11-18 小米科技有限责任公司 Determination method and device of associated user
CN109716324A (en) * 2016-09-28 2019-05-03 微软技术许可有限责任公司 Direct table association in in-memory data library
CN111712806A (en) * 2017-09-21 2020-09-25 深圳传音通讯有限公司 Method and device for clearing residual file and readable storage medium
CN111597369A (en) * 2020-05-18 2020-08-28 Oppo广东移动通信有限公司 Photo viewing method and device, storage medium and terminal

Similar Documents

Publication Publication Date Title
WO2021057797A1 (en) Positioning method and apparatus, terminal and storage medium
JP4139615B2 (en) Event clustering of images using foreground / background segmentation
EP2711670A1 (en) Visual localisation
CN111612043B (en) Road scene matching method, device and storage medium
CN103399896A (en) Method and system for recognizing association relationships among users
CN111639147B (en) Map compression method, system and computer readable storage medium
CN113487523B (en) Method and device for optimizing graph contour, computer equipment and storage medium
CN103218427A (en) Local descriptor extracting method, image searching method and image matching method
CN112581477A (en) Image processing method, image matching method, device and storage medium
CN111459723B (en) Terminal data processing system
CN111445442A (en) Crowd counting method and device based on neural network, server and storage medium
CN111985531B (en) Method, device, equipment and storage medium for determining abnormal resource demand cluster
CN111177450B (en) Image retrieval cloud identification method and system and computer readable storage medium
CN112817920A (en) Distributed big data cleaning method
CN112487082B (en) Biological feature recognition method and related equipment
CN112053439A (en) Method, device and equipment for determining instance attribute information in image and storage medium
CN114078269A (en) Face image clustering method, device, server and storage medium
CN115100541A (en) Satellite remote sensing data processing method and system and cloud platform
CN112257666B (en) Target image content aggregation method, device, equipment and readable storage medium
CN114820440A (en) Image processing method and apparatus, storage medium, and electronic device
CN111028313B (en) Table distribution image generation method and device
CN111986169A (en) Door and window detection method, system, terminal and medium
JP2012003358A (en) Background determination device, method, and program
CN115049731B (en) Visual image construction and positioning method based on binocular camera
CN114742995B (en) Indoor positioning method based on digital twin building and heterogeneous feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20210518