CN112817920A

CN112817920A - Distributed big data cleaning method

Info

Publication number: CN112817920A
Application number: CN202110233271.1A
Authority: CN
Inventors: 陈卿; 徐弘�
Original assignee: Shenzhen Zhixiaobing Science & Technology Co ltd
Current assignee: Shenzhen Zhixiaobing Science & Technology Co ltd
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2021-05-18

Abstract

The application provides a distributed big data cleaning method, which comprises the following steps: the terminal acquires data storage time of the distributed big data, and determines the data with the storage time larger than a time threshold as the data to be cleaned; the terminal executes association operation on first cleaning data of the data to be cleaned, wherein the association operation specifically comprises the following steps: searching the quantity alpha of the associated data of the cleaning data, keeping the first cleaning data when the quantity alpha is determined to be larger than a quantity threshold value by the terminal, and cleaning the first cleaning data if the quantity alpha is determined to be smaller than the quantity threshold value. The technical scheme provided by the application has the advantage of reducing cost.

Description

Distributed big data cleaning method

Technical Field

The application relates to the field of big data, in particular to a distributed big data cleaning method.

Background

Big data (big data), an IT industry term, refers to a data set that cannot be captured, managed, and processed with a conventional software tool within a certain time range, and is a massive, high-growth-rate, diversified information asset that needs a new processing mode to have stronger decision-making power, insight discovery power, and process optimization capability.

The existing big data has large memory space, occupies excessive memory resources and improves the cost.

Disclosure of Invention

The invention aims to provide a distributed big data cleaning method, and the technical scheme can be used for classifying and cleaning big data, reducing storage resources and reducing cost.

In a first aspect, a method for cleaning distributed big data is provided,

the terminal acquires data storage time of the distributed big data, and determines the data with the storage time larger than a time threshold as the data to be cleaned;

the terminal executes association operation on first cleaning data of the data to be cleaned, wherein the association operation specifically comprises the following steps: the number alpha of associated data of the cleaning data is searched,

and the terminal determines that the quantity alpha is greater than a quantity threshold value, the first cleaning data is reserved, and if the determined quantity alpha is less than the quantity threshold value, the first cleaning data is cleaned.

The method includes the steps that a terminal obtains data storage time of distributed big data, and the data with the storage time larger than a time threshold value are determined as data to be cleaned; the terminal executes association operation on first cleaning data of the data to be cleaned, wherein the association operation specifically comprises the following steps: searching the quantity alpha of the associated data of the cleaning data, keeping the first cleaning data when the quantity alpha is determined to be larger than a quantity threshold value by the terminal, and cleaning the first cleaning data if the quantity alpha is determined to be smaller than the quantity threshold value. The scheme can determine whether the first cleaning data is useful according to the quantity alpha of the associated data, and further determine whether the first cleaning data is cleaned or reserved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a hardware structure of a terminal according to the present invention.

Fig. 2 is a schematic flow chart of a distributed big data cleaning method provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The embodiments of the present application will be described below with reference to the drawings.

The term "and/or" in this application is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in this document indicates that the former and latter related objects are in an "or" relationship.

The "plurality" appearing in the embodiments of the present application means two or more. The descriptions of the first, second, etc. appearing in the embodiments of the present application are only for illustrating and differentiating the objects, and do not represent the order or the particular limitation of the number of the devices in the embodiments of the present application, and do not constitute any limitation to the embodiments of the present application. The term "connect" in the embodiments of the present application refers to various connection manners, such as direct connection or indirect connection, to implement communication between devices, which is not limited in this embodiment of the present application.

In the present application, "|" means an absolute value.

Fig. 1 is a hardware block diagram of a terminal provided in an embodiment of the present application, including a processor, a memory, a camera, and a display screen. A plurality of terminals as shown in fig. 1 may constitute a distributed big data system.

Referring to fig. 2, fig. 2 provides a distributed big data cleaning method, which may be applied in a terminal in a distributed big data system, where the terminal may be in the structure shown in fig. 1, and as shown in fig. 2, the method may include the following steps:

step S201, a terminal acquires data storage time of distributed big data, and data with the storage time larger than a time threshold value is determined as data to be cleaned;

step S202, the terminal executes association operation on first cleaning data of the to-be-cleaned data, wherein the association operation specifically comprises the following steps: the number alpha of associated data of the cleaning data is searched,

step S203, the terminal determines that the number α is greater than the number threshold, retains the first cleaning data, and cleans the first cleaning data if the determined number α is less than the number threshold.

For example, in the application, taking the picture data as an example, for the distributed big data, if the number of pictures associated with the picture data in the distributed big data is larger, the greater the likelihood that it needs to be retained because there is some correlation to the picture, such as a picture of a tour, in the photo pictures of Beijing, if a first picture corresponding to the first cleaning data has a large number of related pictures, the first picture is determined to need to be stored, otherwise, if the first picture is only an isolated picture, the value of the first picture to be reserved is not great, so that the possibility of deleting the first picture is great, if all the non-partial pictures related to the first picture are deleted, the quantity alpha of the related data searched for the first picture is definitely small, and therefore the judgment on the cleaning data can be realized through the quantity.

Optionally, if the first cleaning data is a picture, the searching for the number α of the associated data of the cleaning data may specifically include:

determining a first picture of cleaning data, identifying the first picture to determine whether the first picture has a person, if so, deleting the person from the first picture to obtain a first background picture, classifying and identifying the first background picture to obtain landmark information corresponding to the first background picture, determining a first address of the first background picture according to the landmark information, inquiring pictures matched with the first address from the pictures stored in the distributed big data storage, and determining the number of the pictures as alpha.

The background picture obtained by deleting the character can be obtained by adopting an existing mode, and the classification and identification mode can adopt an existing mode, such as Baidu AI identification, Google identification and the like.

Or taking a picture of a tour as an example, taking beijing as an example, a first picture obtained during the tour, for example, a first picture taken at san ritun, is directly compared with a background after being removed from the human, and if the pictures stored in the distributed big data store have a plurality of pictures of san ritun addresses, the first picture is retained, and if not, the first picture is cleaned (deleted).

It should be noted that the querying, from the pictures stored in the distributed big data storage, the picture matched with the first address specifically includes:

the method comprises the steps of carrying out face recognition on a figure to determine the identity of the figure, determining a resident city of the figure according to the identity, if the resident city comprises the first address, determining that a picture matched with the first address is a picture matched with the first address, if the resident city does not comprise the first address, determining a first city corresponding to the first address, and determining that a picture matched with the first city is a picture matched with the first address.

In this scheme, for people in beijing, the matching is too large in city, so the range is reduced, and the specific address information is determined, but for tourists, the range needs to be expanded to the city, for example, zhang san from shenzhen to beijing, and pictures taken at sanlintun, qinghua university and other addresses should be determined as the pictures with matched addresses.

In an optional scheme, if the first cleaning data is reserved, the method further includes:

when the terminal determines that the type of the first cleaning data is picture data, performing a grid operation on each picture in the picture data to obtain operation data; the grid operation may specifically include: establishing grids (similar to grids in VISO drawing, namely a plurality of square grids with consistent area size) for a picture, calculating the similarity between every two grids (with consistent area size), determining the grids with the similarity larger than a similarity threshold value as a similar grid group, setting a bitmap to represent the position of the similar grid group in the picture, storing pixel data of one grid in the similar grid group, and deleting the pixel data of the rest grid group to finish the grid operation of the picture. The calculation method of the similarity may specifically include: and establishing a three-dimensional matrix m x n x 3 according to R, G, B values of pixel points of each grid and positions of the pixel points in the grids, wherein m represents a length value of the three-dimensional matrix, n represents a width value of the three-dimensional matrix, 3 represents a depth value, each depth corresponds to a value of R, G, B, calculating differences of 2 three-dimensional matrices of the two grids to obtain a three-dimensional difference matrix, counting the number x of element values smaller than a numerical pixel threshold value in the three-dimensional difference matrix, and the similarity = x/(m n 3): 100%.

The method includes the steps that a plurality of frames in a graph after gridding are completely consistent, for example, if the regions of a road surface are stored, a plurality of storage spaces are obviously increased, pixel data of one grid are directly stored in a similar grid group after similarity calculation, and other pixel data are used for representing the positions of the pixel data through bitmaps (bitmaps), so that the storage space of the picture can be reduced, and the storage cost can be reduced.

In an optional scheme, if the first cleaning data is reserved, the method may further include:

if the terminal determines that the type of the first cleaning data is position coordinates, determining a position coordinate area, replacing the position coordinates of the area with a regular area and an irregular area for storage, if the regular area is circular, the terminal acquires first position coordinates in the middle range of the position coordinate area, determines the first position coordinates as a circle center to be determined, transmits beta rays by taking the circle center to be determined as a ray end point for 360 degrees, acquires beta end points between the beta rays and the area boundary line of the position coordinates, calculates beta distances between the beta end points and the circle center to be determined, determines the minimum value of the beta distances as a circle radius to be determined, draws a circle by taking the circle center to be determined as the circle center and the circle radius, randomly extracts w points on the circumference, acquires w coordinates of the w points, and determines the circle to be the regular area if the w coordinates are all within the boundary of the map area, the remaining area of the position coordinates within the boundary of the map area is an irregular area, and the remaining area is an area within the boundary other than the circle.

The method can reduce the data storage capacity of the position coordinates without storing massive position coordinates.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A distributed big data cleaning method is characterized in that,

2. The method of claim 1, wherein if the first cleaning data is a picture, the searching for the number α of the associated data of the cleaning data specifically comprises:

3. The method according to claim 2, wherein the querying the pictures matching the first address from the pictures in the distributed big data storage specifically comprises:

4. The method of claim 1, wherein if the first cleaning data is retained, the method further comprises:

when the terminal determines that the type of the first cleaning data is picture data, performing a grid operation on each picture in the picture data to obtain operation data; the grid operation may specifically include: establishing grids for a picture, calculating the similarity between every two grids, determining the grids with the similarity larger than a similarity threshold value as a similar grid group, setting a bitmap to indicate the position of the similar grid group in the picture, storing pixel data of one grid in the similar grid group, and deleting the pixel data of the rest grid groups to finish the square operation of the picture;

the calculation method of the similarity specifically includes: and establishing a three-dimensional matrix m x n x 3 according to R, G, B values of pixel points of each grid and positions of the pixel points in the grids, wherein m represents a length value of the three-dimensional matrix, n represents a width value of the three-dimensional matrix, 3 represents a depth value, each depth corresponds to a value of R, G, B, calculating differences of 2 three-dimensional matrices of the two grids to obtain a three-dimensional difference matrix, counting the number x of element values smaller than a numerical pixel threshold value in the three-dimensional difference matrix, and the similarity = x/(m n 3): 100%.