CN112817920A - Distributed big data cleaning method - Google Patents
Distributed big data cleaning method Download PDFInfo
- Publication number
- CN112817920A CN112817920A CN202110233271.1A CN202110233271A CN112817920A CN 112817920 A CN112817920 A CN 112817920A CN 202110233271 A CN202110233271 A CN 202110233271A CN 112817920 A CN112817920 A CN 112817920A
- Authority
- CN
- China
- Prior art keywords
- data
- picture
- cleaning
- address
- determining
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
- G06F16/162—Delete operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/51—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/587—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- Computing Systems (AREA)
- Library & Information Science (AREA)
- Human Computer Interaction (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application provides a distributed big data cleaning method, which comprises the following steps: the terminal acquires data storage time of the distributed big data, and determines the data with the storage time larger than a time threshold as the data to be cleaned; the terminal executes association operation on first cleaning data of the data to be cleaned, wherein the association operation specifically comprises the following steps: searching the quantity alpha of the associated data of the cleaning data, keeping the first cleaning data when the quantity alpha is determined to be larger than a quantity threshold value by the terminal, and cleaning the first cleaning data if the quantity alpha is determined to be smaller than the quantity threshold value. The technical scheme provided by the application has the advantage of reducing cost.
Description
Technical Field
The application relates to the field of big data, in particular to a distributed big data cleaning method.
Background
Big data (big data), an IT industry term, refers to a data set that cannot be captured, managed, and processed with a conventional software tool within a certain time range, and is a massive, high-growth-rate, diversified information asset that needs a new processing mode to have stronger decision-making power, insight discovery power, and process optimization capability.
The existing big data has large memory space, occupies excessive memory resources and improves the cost.
Disclosure of Invention
The invention aims to provide a distributed big data cleaning method, and the technical scheme can be used for classifying and cleaning big data, reducing storage resources and reducing cost.
In a first aspect, a method for cleaning distributed big data is provided,
the terminal acquires data storage time of the distributed big data, and determines the data with the storage time larger than a time threshold as the data to be cleaned;
the terminal executes association operation on first cleaning data of the data to be cleaned, wherein the association operation specifically comprises the following steps: the number alpha of associated data of the cleaning data is searched,
and the terminal determines that the quantity alpha is greater than a quantity threshold value, the first cleaning data is reserved, and if the determined quantity alpha is less than the quantity threshold value, the first cleaning data is cleaned.
The method includes the steps that a terminal obtains data storage time of distributed big data, and the data with the storage time larger than a time threshold value are determined as data to be cleaned; the terminal executes association operation on first cleaning data of the data to be cleaned, wherein the association operation specifically comprises the following steps: searching the quantity alpha of the associated data of the cleaning data, keeping the first cleaning data when the quantity alpha is determined to be larger than a quantity threshold value by the terminal, and cleaning the first cleaning data if the quantity alpha is determined to be smaller than the quantity threshold value. The scheme can determine whether the first cleaning data is useful according to the quantity alpha of the associated data, and further determine whether the first cleaning data is cleaned or reserved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of a hardware structure of a terminal according to the present invention.
Fig. 2 is a schematic flow chart of a distributed big data cleaning method provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The embodiments of the present application will be described below with reference to the drawings.
The term "and/or" in this application is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in this document indicates that the former and latter related objects are in an "or" relationship.
The "plurality" appearing in the embodiments of the present application means two or more. The descriptions of the first, second, etc. appearing in the embodiments of the present application are only for illustrating and differentiating the objects, and do not represent the order or the particular limitation of the number of the devices in the embodiments of the present application, and do not constitute any limitation to the embodiments of the present application. The term "connect" in the embodiments of the present application refers to various connection manners, such as direct connection or indirect connection, to implement communication between devices, which is not limited in this embodiment of the present application.
In the present application, "|" means an absolute value.
Fig. 1 is a hardware block diagram of a terminal provided in an embodiment of the present application, including a processor, a memory, a camera, and a display screen. A plurality of terminals as shown in fig. 1 may constitute a distributed big data system.
Referring to fig. 2, fig. 2 provides a distributed big data cleaning method, which may be applied in a terminal in a distributed big data system, where the terminal may be in the structure shown in fig. 1, and as shown in fig. 2, the method may include the following steps:
step S201, a terminal acquires data storage time of distributed big data, and data with the storage time larger than a time threshold value is determined as data to be cleaned;
step S202, the terminal executes association operation on first cleaning data of the to-be-cleaned data, wherein the association operation specifically comprises the following steps: the number alpha of associated data of the cleaning data is searched,
step S203, the terminal determines that the number α is greater than the number threshold, retains the first cleaning data, and cleans the first cleaning data if the determined number α is less than the number threshold.
The method includes the steps that a terminal obtains data storage time of distributed big data, and the data with the storage time larger than a time threshold value are determined as data to be cleaned; the terminal executes association operation on first cleaning data of the data to be cleaned, wherein the association operation specifically comprises the following steps: searching the quantity alpha of the associated data of the cleaning data, keeping the first cleaning data when the quantity alpha is determined to be larger than a quantity threshold value by the terminal, and cleaning the first cleaning data if the quantity alpha is determined to be smaller than the quantity threshold value. The scheme can determine whether the first cleaning data is useful according to the quantity alpha of the associated data, and further determine whether the first cleaning data is cleaned or reserved.
For example, in the application, taking the picture data as an example, for the distributed big data, if the number of pictures associated with the picture data in the distributed big data is larger, the greater the likelihood that it needs to be retained because there is some correlation to the picture, such as a picture of a tour, in the photo pictures of Beijing, if a first picture corresponding to the first cleaning data has a large number of related pictures, the first picture is determined to need to be stored, otherwise, if the first picture is only an isolated picture, the value of the first picture to be reserved is not great, so that the possibility of deleting the first picture is great, if all the non-partial pictures related to the first picture are deleted, the quantity alpha of the related data searched for the first picture is definitely small, and therefore the judgment on the cleaning data can be realized through the quantity.
Optionally, if the first cleaning data is a picture, the searching for the number α of the associated data of the cleaning data may specifically include:
determining a first picture of cleaning data, identifying the first picture to determine whether the first picture has a person, if so, deleting the person from the first picture to obtain a first background picture, classifying and identifying the first background picture to obtain landmark information corresponding to the first background picture, determining a first address of the first background picture according to the landmark information, inquiring pictures matched with the first address from the pictures stored in the distributed big data storage, and determining the number of the pictures as alpha.
The background picture obtained by deleting the character can be obtained by adopting an existing mode, and the classification and identification mode can adopt an existing mode, such as Baidu AI identification, Google identification and the like.
Or taking a picture of a tour as an example, taking beijing as an example, a first picture obtained during the tour, for example, a first picture taken at san ritun, is directly compared with a background after being removed from the human, and if the pictures stored in the distributed big data store have a plurality of pictures of san ritun addresses, the first picture is retained, and if not, the first picture is cleaned (deleted).
It should be noted that the querying, from the pictures stored in the distributed big data storage, the picture matched with the first address specifically includes:
the method comprises the steps of carrying out face recognition on a figure to determine the identity of the figure, determining a resident city of the figure according to the identity, if the resident city comprises the first address, determining that a picture matched with the first address is a picture matched with the first address, if the resident city does not comprise the first address, determining a first city corresponding to the first address, and determining that a picture matched with the first city is a picture matched with the first address.
In this scheme, for people in beijing, the matching is too large in city, so the range is reduced, and the specific address information is determined, but for tourists, the range needs to be expanded to the city, for example, zhang san from shenzhen to beijing, and pictures taken at sanlintun, qinghua university and other addresses should be determined as the pictures with matched addresses.
In an optional scheme, if the first cleaning data is reserved, the method further includes:
when the terminal determines that the type of the first cleaning data is picture data, performing a grid operation on each picture in the picture data to obtain operation data; the grid operation may specifically include: establishing grids (similar to grids in VISO drawing, namely a plurality of square grids with consistent area size) for a picture, calculating the similarity between every two grids (with consistent area size), determining the grids with the similarity larger than a similarity threshold value as a similar grid group, setting a bitmap to represent the position of the similar grid group in the picture, storing pixel data of one grid in the similar grid group, and deleting the pixel data of the rest grid group to finish the grid operation of the picture. The calculation method of the similarity may specifically include: and establishing a three-dimensional matrix m x n x 3 according to R, G, B values of pixel points of each grid and positions of the pixel points in the grids, wherein m represents a length value of the three-dimensional matrix, n represents a width value of the three-dimensional matrix, 3 represents a depth value, each depth corresponds to a value of R, G, B, calculating differences of 2 three-dimensional matrices of the two grids to obtain a three-dimensional difference matrix, counting the number x of element values smaller than a numerical pixel threshold value in the three-dimensional difference matrix, and the similarity = x/(m n 3): 100%.
The method includes the steps that a plurality of frames in a graph after gridding are completely consistent, for example, if the regions of a road surface are stored, a plurality of storage spaces are obviously increased, pixel data of one grid are directly stored in a similar grid group after similarity calculation, and other pixel data are used for representing the positions of the pixel data through bitmaps (bitmaps), so that the storage space of the picture can be reduced, and the storage cost can be reduced.
In an optional scheme, if the first cleaning data is reserved, the method may further include:
if the terminal determines that the type of the first cleaning data is position coordinates, determining a position coordinate area, replacing the position coordinates of the area with a regular area and an irregular area for storage, if the regular area is circular, the terminal acquires first position coordinates in the middle range of the position coordinate area, determines the first position coordinates as a circle center to be determined, transmits beta rays by taking the circle center to be determined as a ray end point for 360 degrees, acquires beta end points between the beta rays and the area boundary line of the position coordinates, calculates beta distances between the beta end points and the circle center to be determined, determines the minimum value of the beta distances as a circle radius to be determined, draws a circle by taking the circle center to be determined as the circle center and the circle radius, randomly extracts w points on the circumference, acquires w coordinates of the w points, and determines the circle to be the regular area if the w coordinates are all within the boundary of the map area, the remaining area of the position coordinates within the boundary of the map area is an irregular area, and the remaining area is an area within the boundary other than the circle.
The method can reduce the data storage capacity of the position coordinates without storing massive position coordinates.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.
Claims (4)
1. A distributed big data cleaning method is characterized in that,
the terminal acquires data storage time of the distributed big data, and determines the data with the storage time larger than a time threshold as the data to be cleaned;
the terminal executes association operation on first cleaning data of the data to be cleaned, wherein the association operation specifically comprises the following steps: the number alpha of associated data of the cleaning data is searched,
and the terminal determines that the quantity alpha is greater than a quantity threshold value, the first cleaning data is reserved, and if the determined quantity alpha is less than the quantity threshold value, the first cleaning data is cleaned.
2. The method of claim 1, wherein if the first cleaning data is a picture, the searching for the number α of the associated data of the cleaning data specifically comprises:
determining a first picture of cleaning data, identifying the first picture to determine whether the first picture has a person, if so, deleting the person from the first picture to obtain a first background picture, classifying and identifying the first background picture to obtain landmark information corresponding to the first background picture, determining a first address of the first background picture according to the landmark information, inquiring pictures matched with the first address from the pictures stored in the distributed big data storage, and determining the number of the pictures as alpha.
3. The method according to claim 2, wherein the querying the pictures matching the first address from the pictures in the distributed big data storage specifically comprises:
the method comprises the steps of carrying out face recognition on a figure to determine the identity of the figure, determining a resident city of the figure according to the identity, if the resident city comprises the first address, determining that a picture matched with the first address is a picture matched with the first address, if the resident city does not comprise the first address, determining a first city corresponding to the first address, and determining that a picture matched with the first city is a picture matched with the first address.
4. The method of claim 1, wherein if the first cleaning data is retained, the method further comprises:
when the terminal determines that the type of the first cleaning data is picture data, performing a grid operation on each picture in the picture data to obtain operation data; the grid operation may specifically include: establishing grids for a picture, calculating the similarity between every two grids, determining the grids with the similarity larger than a similarity threshold value as a similar grid group, setting a bitmap to indicate the position of the similar grid group in the picture, storing pixel data of one grid in the similar grid group, and deleting the pixel data of the rest grid groups to finish the square operation of the picture;
the calculation method of the similarity specifically includes: and establishing a three-dimensional matrix m x n x 3 according to R, G, B values of pixel points of each grid and positions of the pixel points in the grids, wherein m represents a length value of the three-dimensional matrix, n represents a width value of the three-dimensional matrix, 3 represents a depth value, each depth corresponds to a value of R, G, B, calculating differences of 2 three-dimensional matrices of the two grids to obtain a three-dimensional difference matrix, counting the number x of element values smaller than a numerical pixel threshold value in the three-dimensional difference matrix, and the similarity = x/(m n 3): 100%.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110233271.1A CN112817920A (en) | 2021-03-03 | 2021-03-03 | Distributed big data cleaning method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110233271.1A CN112817920A (en) | 2021-03-03 | 2021-03-03 | Distributed big data cleaning method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112817920A true CN112817920A (en) | 2021-05-18 |
Family
ID=75862748
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110233271.1A Withdrawn CN112817920A (en) | 2021-03-03 | 2021-03-03 | Distributed big data cleaning method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112817920A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102467572A (en) * | 2010-11-17 | 2012-05-23 | 英业达股份有限公司 | Data block inquiring method for supporting data de-duplication program |
CN105069083A (en) * | 2015-07-31 | 2015-11-18 | 小米科技有限责任公司 | Determination method and device of associated user |
CN107077487A (en) * | 2014-10-23 | 2017-08-18 | 微软技术许可有限责任公司 | Personal photo is tagged using depth network |
CN109716324A (en) * | 2016-09-28 | 2019-05-03 | 微软技术许可有限责任公司 | Direct table association in in-memory data library |
CN111597369A (en) * | 2020-05-18 | 2020-08-28 | Oppo广东移动通信有限公司 | Photo viewing method and device, storage medium and terminal |
CN111712806A (en) * | 2017-09-21 | 2020-09-25 | 深圳传音通讯有限公司 | Method and device for clearing residual file and readable storage medium |
-
2021
- 2021-03-03 CN CN202110233271.1A patent/CN112817920A/en not_active Withdrawn
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102467572A (en) * | 2010-11-17 | 2012-05-23 | 英业达股份有限公司 | Data block inquiring method for supporting data de-duplication program |
CN107077487A (en) * | 2014-10-23 | 2017-08-18 | 微软技术许可有限责任公司 | Personal photo is tagged using depth network |
CN105069083A (en) * | 2015-07-31 | 2015-11-18 | 小米科技有限责任公司 | Determination method and device of associated user |
CN109716324A (en) * | 2016-09-28 | 2019-05-03 | 微软技术许可有限责任公司 | Direct table association in in-memory data library |
CN111712806A (en) * | 2017-09-21 | 2020-09-25 | 深圳传音通讯有限公司 | Method and device for clearing residual file and readable storage medium |
CN111597369A (en) * | 2020-05-18 | 2020-08-28 | Oppo广东移动通信有限公司 | Photo viewing method and device, storage medium and terminal |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021057797A1 (en) | Positioning method and apparatus, terminal and storage medium | |
JP4139615B2 (en) | Event clustering of images using foreground / background segmentation | |
EP2711670A1 (en) | Visual localisation | |
CN111612043B (en) | Road scene matching method, device and storage medium | |
CN103399896A (en) | Method and system for recognizing association relationships among users | |
CN111639147B (en) | Map compression method, system and computer readable storage medium | |
CN113487523B (en) | Method and device for optimizing graph contour, computer equipment and storage medium | |
CN103218427A (en) | Local descriptor extracting method, image searching method and image matching method | |
CN112581477A (en) | Image processing method, image matching method, device and storage medium | |
CN111459723B (en) | Terminal data processing system | |
CN111445442A (en) | Crowd counting method and device based on neural network, server and storage medium | |
CN111985531B (en) | Method, device, equipment and storage medium for determining abnormal resource demand cluster | |
CN111177450B (en) | Image retrieval cloud identification method and system and computer readable storage medium | |
CN112817920A (en) | Distributed big data cleaning method | |
CN112487082B (en) | Biological feature recognition method and related equipment | |
CN112053439A (en) | Method, device and equipment for determining instance attribute information in image and storage medium | |
CN114078269A (en) | Face image clustering method, device, server and storage medium | |
CN115100541A (en) | Satellite remote sensing data processing method and system and cloud platform | |
CN112257666B (en) | Target image content aggregation method, device, equipment and readable storage medium | |
CN114820440A (en) | Image processing method and apparatus, storage medium, and electronic device | |
CN111028313B (en) | Table distribution image generation method and device | |
CN111986169A (en) | Door and window detection method, system, terminal and medium | |
JP2012003358A (en) | Background determination device, method, and program | |
CN115049731B (en) | Visual image construction and positioning method based on binocular camera | |
CN114742995B (en) | Indoor positioning method based on digital twin building and heterogeneous feature fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20210518 |