CN117493319A - Data deduplication method and device, electronic equipment and storage medium - Google Patents

Data deduplication method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117493319A
CN117493319A CN202311433803.1A CN202311433803A CN117493319A CN 117493319 A CN117493319 A CN 117493319A CN 202311433803 A CN202311433803 A CN 202311433803A CN 117493319 A CN117493319 A CN 117493319A
Authority
CN
China
Prior art keywords
signaling data
data
determined
information
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311433803.1A
Other languages
Chinese (zh)
Inventor
王云朋
穆纯进
姜雨彤
霍勇杰
李振豪
张逸明
郝树运
冯佳佳
茅矛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Unicom Digital Technology Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Unicom Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co Ltd, Unicom Digital Technology Co Ltd filed Critical China United Network Communications Group Co Ltd
Priority to CN202311433803.1A priority Critical patent/CN117493319A/en
Publication of CN117493319A publication Critical patent/CN117493319A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/20Services signaling; Auxiliary data signalling, i.e. transmitting data via a non-traffic channel
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data deduplication method, a data deduplication device, electronic equipment and a storage medium. The method comprises the following steps: acquiring a data partition and a consumption task corresponding to the data partition; according to the data partition and the consumption task, carrying out parallel calculation processing on the signaling data, and determining the processed signaling data in the data partition and a comparison identifier of the processed signaling data; grouping the target signaling data according to the attribute information in the target signaling data to obtain grouped signaling data; determining primary key information in the packet signaling data; and determining the processing state information of the signaling data to be determined according to the identifier to be determined, the primary key information and the primary key to be determined of the signaling data to be determined, which are compared. The method realizes the global duplicate removal effect and improves the data processing efficiency.

Description

Data deduplication method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a data deduplication method, apparatus, electronic device, and storage medium.
Background
In the Internet and operator enterprises, the volume of data is large, the user recharging and paying data, user signaling, flow data and the like are reflected, the real-time log information volume is large, and a single data source can reach the order of tens of thousands per second, so that real-time marketing de-duplication becomes a common requirement.
At present, the implementation of stream data deduplication mainly depends on the mode of accessing an external database, including Hbase, redis and the like.
However, when mass data is subjected to deduplication, the memory usage of the external database is very high, so that the control of enterprises on the cost is affected, and the problem that global deduplication cannot be realized due to poor deduplication effect exists.
Disclosure of Invention
The application provides a data deduplication method, a data deduplication device, electronic equipment and a storage medium, which are used for solving the problems that memory resources are too much to use and global deduplication cannot be achieved under the condition of massive data.
In a first aspect, the present application provides a data deduplication method, including:
acquiring a data partition and a consumption task corresponding to the data partition, wherein the data partition comprises signaling data, and the consumption task comprises processed target signaling data in the signaling data;
according to the data partition and the consumption task, carrying out parallel calculation processing on the signaling data, and determining the processed signaling data in the data partition and a comparison identifier of the processed signaling data;
grouping the target signaling data according to the attribute information in the target signaling data to obtain grouped signaling data;
determining primary key information in the packet signaling data, wherein the primary key information is determined according to attribute information in the packet signaling data and base station information corresponding to the attribute information;
And determining the processing state information of the signaling data to be determined according to the identifier to be determined, the primary key information and the primary key to be determined of the signaling data to be determined, which are compared.
In the embodiment of the application, acquiring the data partition and the consumption task corresponding to the data partition includes:
acquiring signaling data, wherein the signaling data represents signaling data in a target area;
according to a preset alternate transmission sequence, carrying out data partition processing on the signaling data, and determining the signaling data in the data partition and the data partition;
and obtaining the consumption task according to the determined data partition and the signaling data in the data partition.
In an embodiment of the present application, acquiring signaling data includes:
acquiring an original code stream output by a base station in a target area;
transcoding and information extraction processing are carried out on the original code stream, and a user number, an IMSI, a base station code, a base station longitude and latitude and a signaling event type are obtained;
and obtaining signaling data according to the user number, the IMSI, the base station code, the longitude and latitude of the base station and the signaling event type.
In the embodiment of the present application, according to a data partition and a consumption task, performing parallel computation processing on signaling data, determining processed signaling data in the data partition, and a comparison identifier of the processed signaling data, including:
Sequentially numbering the signaling data in the data partition, and determining an identifier of the signaling data;
determining unprocessed information data in the signaling data and identifiers of the unprocessed information data according to the consumption task and the identifiers of the signaling data;
and determining the comparison identifier of the processed signaling data according to the identifier of the signaling data and the identifier of the unprocessed information data.
In the embodiment of the present application, grouping target signaling data according to attribute information in the target signaling data to obtain grouped signaling data includes:
determining attribute information in the target signaling data and a user number in the attribute information;
and carrying out grouping processing on the target signaling data according to the user number in the attribute information to obtain grouping signaling data.
In this embodiment of the present application, after grouping target signaling data according to attribute information in the target signaling data to obtain grouped signaling data, before determining processing state information of the signaling data to be determined according to the identifier to be determined, the identifier to be determined of the signaling data to be determined, and primary key information and the primary key to be determined of the signaling data to be determined, the method further includes:
Determining a key value of the packet signaling data, wherein the key value is generated according to the processing time of the packet signaling data;
generating key value pair information of grouping signaling data according to the key value and the primary key information;
and storing the key value pair information into a key value pair library so as to call the main key information from the key value pair library when the step of determining the processing state information of the signaling data to be determined according to the identifier to be determined of the signaling data to be determined, the main key information and the main key to be determined of the signaling data to be determined is executed.
In the embodiment of the present application, determining processing state information of signaling data to be determined according to an identifier to be determined, which compares the identifier with signaling data to be determined, and primary key information and a primary key to be determined of the signaling data to be determined, includes:
determining a characterized sequential relationship between the target identifier and the alignment identifier;
if the sequence relation characterizes that the sequence corresponding to the target identifier is before or the same as the sequence corresponding to the comparison identifier, determining that the processing state information of the signaling data to be determined is processed, and storing the key value pair information of the signaling data to be determined into a key value pair library;
if the sequence relation characterizes that the sequence corresponding to the target identifier is behind the sequence corresponding to the comparison identifier, determining the primary key information of the signaling data to be determined;
Inquiring a key value pair library according to the primary key information of the signaling data to be determined to obtain an inquiry result;
and determining the processing state information of the signaling data to be determined according to the query result.
In this embodiment of the present application, determining, according to a query result, processing state information of signaling data to be determined includes:
if the query result represents that the primary key information of the signaling data to be determined is recorded in the key value pair library, determining that the processing state information of the signaling data to be determined is processed;
if the query result represents that the primary key information of the signaling data to be determined is not recorded in the key value pair library, determining that the processing state information of the signaling data to be determined is unprocessed;
and storing the key value pair information of the signaling data to be determined into a key value pair library, and carrying out service processing on the signaling data to be determined.
In a second aspect, the present application provides a data deduplication apparatus, comprising:
the acquisition module is used for acquiring a data partition and a consumption task corresponding to the data partition, wherein the data partition comprises signaling data, and the consumption task comprises processed target signaling data in the signaling data;
the calculation module is used for carrying out parallel calculation processing on the signaling data according to the data partition and the consumption task, and determining the processed signaling data in the data partition and the comparison identifier of the processed signaling data;
The grouping module is used for grouping the target signaling data according to the attribute information in the target signaling data to obtain grouping signaling data;
the first determining module is used for determining primary key information in the packet signaling data, wherein the primary key information is determined according to attribute information in the packet signaling data and base station information corresponding to the attribute information;
and the second determining module is used for determining the processing state information of the signaling data to be determined according to the identifier to be determined, the primary key information and the primary key to be determined of the signaling data to be determined, which are compared.
In a third aspect, the present application provides an electronic device, including: a processor, a memory communicatively coupled to the processor;
the memory stores computer-executable instructions;
the processor executes the computer-executable instructions stored in the memory to implement the data deduplication method of the embodiments of the present application.
In a fourth aspect, a computer readable storage medium stores computer executable instructions that, when executed by a processor, are configured to implement a data deduplication method according to an embodiment of the present application.
According to the data deduplication method, the device, the electronic equipment and the storage medium, through obtaining the data partition and the consumption task corresponding to the data partition, the data partition contains signaling data, and the consumption task comprises processed target signaling data in the signaling data; according to the data partition and the consumption task, carrying out parallel calculation processing on the signaling data, and determining the processed signaling data in the data partition and a comparison identifier of the processed signaling data; grouping the target signaling data according to the attribute information in the target signaling data to obtain grouped signaling data; determining primary key information in the packet signaling data, wherein the primary key information is determined according to attribute information in the packet signaling data and base station information corresponding to the attribute information; according to the method, the device and the system for determining the processing state information of the signaling data to be determined, the device is used for determining the processing state information of the signaling data to be determined according to the identifier to be determined, the primary key information and the primary key to be determined of the signaling data to be determined, so that after the data partition corresponds to the consumption task, parallel computing processing can be performed, and the maximum data reading performance is ensured.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
Fig. 1 is a schematic view of a scenario of a data deduplication method according to an embodiment of the present application;
fig. 2 is a flow chart of a data deduplication method according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating another method for data deduplication according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a data deduplication apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Specific embodiments thereof have been shown by way of example in the drawings and will herein be described in more detail. These drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but to illustrate the concepts of the present application to those skilled in the art by reference to specific embodiments.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards, and provide corresponding operation entries for the user to select authorization or rejection.
In the prior art, when real-time stream data deduplication is performed, the deduplication is generally performed through external storage and built-in RocksDB, wherein when the deduplication is performed on massive data, the external storage is performed, the memory resource is very high, the built-in RocksDB is used for deduplication, the situation that the data is repeated can not be achieved, and therefore, the overall deduplication effect can not be achieved only through the mode of database storage deduplication, and the effects of low cost, accuracy and high efficiency in the situation of massive data can not be achieved.
The method and the device can be used for obtaining the comparison identifier based on the parallel processing condition of the signaling data in the consumption task according to the key value pair information of the currently processed signaling data in the storage and the identifier of the processed signaling data of the parallel consumption task, so that when the signaling data is processed, the processing state information of the signaling data is rapidly and accurately determined by comparing the identifier to be confirmed with the main key information, the processed signaling data is not repeatedly processed, and the signaling data is further processed and recorded in a targeted mode, so that the effect of data de-duplication is achieved.
The embodiment of the application provides a data deduplication method, a data deduplication device, electronic equipment and a storage medium.
Fig. 1 is a schematic view of a scenario of a data deduplication method according to an embodiment of the present application. As shown in fig. 1, the execution subject of the data deduplication method may be a server. The server can be a mobile phone, a tablet, a computer and other devices. The implementation manner of the execution body is not particularly limited in this embodiment, as long as the execution body can acquire a data partition and a consumption task corresponding to the data partition, where the data partition includes signaling data, and the consumption task includes processed target signaling data in the signaling data; according to the data partition and the consumption task, carrying out parallel calculation processing on the signaling data, and determining the processed signaling data in the data partition and a comparison identifier of the processed signaling data; grouping the target signaling data according to the attribute information in the target signaling data to obtain grouped signaling data; determining primary key information in the packet signaling data, wherein the primary key information is determined according to attribute information in the packet signaling data and base station information corresponding to the attribute information; and determining the processing state information of the signaling data to be determined according to the identifier to be determined, the primary key information and the primary key to be determined of the signaling data to be determined, which are compared.
The terms referred to in this application are explained first:
data deduplication (data deduplication) may refer to finding and deleting duplicate data in a collection of data files, holding only unique data units, thereby eliminating redundant data.
The signaling data may refer to communication data between a user and a transmitting base station or a micro station, including: ticket data, PS domain signaling data, CS domain signaling data.
Fig. 2 is a flow chart of a data deduplication method according to an embodiment of the present application. The implementation subject of the method may be a server or other servers, and the embodiment is not particularly limited herein, as shown in fig. 2, the method may include:
s201, acquiring a data partition and a consumption task corresponding to the data partition, wherein the data partition comprises signaling data, and the consumption task comprises processed target signaling data in the signaling data.
The data partition may refer to storing data into multiple directories of multiple hosts, where each directory may refer to a partition, where the data may refer to user recharging payment data, user signaling, flow data, and the like.
The consumption task may refer to a consumption task generated by signaling data in a data partition, so as to read and process the signaling data, in this embodiment of the present application, the number of consumption tasks may be consistent with the number of data partitions, the consumption task may refer to a calculation task in a processing engine link, and the signaling data in the data partition may be distributed in parallel to the consumption task by using the following formula:
int startIndex=((topicName().hashCode()*31)&0x7FFFFFFF)%numSubtasks;
int subTaskId=(startIndex+partitionId())%numSubtasks;
wherein partitionionId may refer to Kafka partition ID, subtask Id may refer to Flink parallel task ID, numsubtask may refer to Flink task parallelism, and this scenario is equal to the Kafka partition number.
The processed target signaling data may refer to the signaling data in the data partition that is read to the memory in real time through the service parallel processing framework and is subjected to calculation processing.
In some embodiments, the amount of signaling data is at least two.
In this embodiment of the present application, obtaining the data partition and the consumption task corresponding to the data partition may include:
acquiring signaling data, wherein the signaling data represents signaling data in a target area;
according to a preset alternate transmission sequence, carrying out data partition processing on the signaling data, and determining the signaling data in the data partition and the data partition;
And obtaining the consumption task according to the data partition and the signaling data in the data partition.
In some embodiments, in order to differentiate service classification, the signaling data may be classified according to provinces, and data of one province may be written into a theme, for example, signaling data of a province and signaling data of a province B may be written into themes named signal_011 and signal_012 respectively, and in subsequent service processing, if a service of the signaling data of the province a or B is required to be read, the signaling data of the province a or B may be obtained by reading signal_011 or signal_012.
The preset rotation transmission sequence may refer to a preset strategy of evenly distributing signaling data to the data partitions, in this embodiment of the present application, the strategy may refer to a round robin strategy, that is, a sequential rotation write strategy, all the parts are ordered according to a dictionary sequence, and then the signaling data are written into the part partitions one by one in a polling manner, so that the size of the signaling data in each data partition is basically the same, and occupied memory is uniform.
In this embodiment of the present application, obtaining signaling data may include:
acquiring an original code stream output by a base station in a target area;
Transcoding and information extraction processing are carried out on the original code stream, and a user number, an IMSI, a base station code, a base station longitude and latitude and a signaling event type are obtained;
and obtaining signaling data according to the user number, the IMSI, the base station code, the longitude and latitude of the base station and the signaling event type.
The information extraction processing may refer to performing association filling on signaling data of different interfaces, extracting information such as a user number, an IMSI, a base station code, a longitude and latitude of a base station, a signaling event type, and the like, performing field cutting on the extracted information, selecting corresponding fields, and splicing the corresponding fields into a complete text record, where the obtained complete text record is the signaling data, and the signaling event type may refer to information such as voice, a short message, and location update.
S202, carrying out parallel calculation processing on the signaling data according to the data partition and the consumption task, and determining the processed signaling data in the data partition and the comparison identifier of the processed signaling data.
The parallel computing processing of the signaling data may refer to that the signaling data in the data partition is distributed in parallel to the consumption tasks by the foregoing method, so that the signaling data in each consumption task may be simultaneously computed.
The processed signaling data may refer to signaling data in a data partition for which the computational processing has been completed using a business parallel processing framework.
The comparison identifier may refer to a unique number of the signaling data that is currently processed in the process of performing sequential computation processing on the signaling data by the consuming task, for example, in some embodiments, the number may be a number, a letter, or any sequence generated by combining the two, where the identifier may indicate that the signaling data in the data partition may be marked by using an ordered sequence of numbers, for example, numbers such as 0, 1, 2, 3, or letters such as A, B, C, D, and each time one data is added, the identifier is added by one bit, so that when the signaling data in the consuming task is processed, the data that is currently processed by reading may be determined according to the number or the letter of the identifier.
In this embodiment of the present application, the identifier may refer to an sequentially incremented integer determined according to an offset mechanism in Kafka, when a flank task is restarted after a failure, for example, the flank is restarted at 12:08, according to the mechanism, the data is read again from 12:05, which results in that the data in the period from 12:05 to 12:08 is read repeatedly, and thus, the identifier of the data processed by 12:08 is recorded, when the data is played back, the data is recorded to be the repeated data when the value of the identifier is read, and the calculation processing is not performed any more, so that the duplication is removed, and the comparison identifier may store identifier records of each data partition through the Redis, for example, if there are 10 data partitions, the Redis may store 10 records.
In this embodiment of the present application, performing parallel computing processing on signaling data according to a data partition and a consumption task, and determining processed signaling data in the data partition and a comparison identifier of the processed signaling data may include:
sequentially numbering the signaling data in the data partition, and determining an identifier of the signaling data;
determining unprocessed information data in the signaling data and identifiers of the unprocessed information data according to the consumption task and the identifiers of the signaling data;
and determining the comparison identifier of the processed signaling data according to the identifier of the signaling data and the identifier of the unprocessed information data.
Wherein the identifier may refer to an identifier obtained by sequentially numbering the signaling data using sequentially increasing integers.
S203, grouping the target signaling data according to the attribute information in the target signaling data to obtain grouping signaling data.
The attribute information may refer to feature information included in the signaling data, including fields such as a mobile phone IMSI (International Mobile Subscriber Identity ) number, a timestamp, a location area number, an event type, etc., so in some embodiments, when the target signaling data is grouped, the attribute information according to which the grouping is determined according to a service de-importance requirement may include a subscriber number, an IMSI, a base station code, a base station longitude and latitude, and a signaling event type, and in this embodiment, the grouping may be according to the subscriber number in the attribute information.
Grouping the target signaling data may refer to grouping the target signaling data with consistent attribute information into the same group according to a hash formula, so as to realize duplication removal of the same attribute information, for example, in some embodiments, when the mobile phone number of the primary key is grouped when the primary key is duplication removed by service, grouping calculation may be performed according to the mobile phone number of the user, so as to ensure that the data with the same number is distributed into the same group for storage and calculation.
In this embodiment of the present application, grouping the target signaling data according to attribute information in the target signaling data to obtain grouped signaling data may include:
determining attribute information in the target signaling data and a user number in the attribute information;
and carrying out grouping processing on the target signaling data according to the user number in the attribute information to obtain grouping signaling data.
The packet processing may refer to allocating signaling data of the same mobile number to the same instance according to the formula int subtask id=hashcode (% numsubtask).
S204, determining the primary key information in the packet signaling data, wherein the primary key information is determined according to the attribute information in the packet signaling data and the base station information corresponding to the attribute information.
The primary key information may refer to attribute information of signaling data determined according to a service duplication elimination requirement and corresponding base station information, in some embodiments, the primary key information may be attribute information+base station lac+base station CI, the base station LAC (location area code ) may refer to an administrative area where the base station is located, the base station CI may refer to an antenna responsible for transmitting and receiving (or a pair of antennas consisting of an antenna solely responsible for transmitting and an antenna solely responsible for receiving), a sector number towards which a unit number of a CI number under the same base station is continuous, generally 1, 2, and 3, and in this embodiment, when the service duplication elimination is a user mobile phone number, the primary key information is a mobile phone number+base station lac+base station CI.
In this embodiment of the present application, after grouping target signaling data according to attribute information in the target signaling data to obtain grouped signaling data, before determining processing state information of the signaling data to be determined according to the identifier to be determined, the identifier to be determined of the signaling data to be determined, and primary key information and the primary key to be determined of the signaling data to be determined, the method may further include:
Determining a key value of the packet signaling data, wherein the key value is generated according to the processing time of the packet signaling data;
generating key value pair information of grouping signaling data according to the key value and the primary key information;
and storing the key value pair information into a key value pair library so as to call the main key information from the key value pair library when the step of determining the processing state information of the signaling data to be determined according to the identifier to be determined of the signaling data to be determined, the main key information and the main key to be determined of the signaling data to be determined is executed.
The key value (value) may refer to information as an auxiliary judgment, and in the embodiment of the present application, the key value may refer to data processing time, and auxiliary determination is performed on whether the data is processed.
The key-value pair may refer to a combination information (key) composed of the primary key information and the key value, and in the embodiment of the present application, the key-value pair may refer to a key: cell phone number + base station LAC + base station CI, value: the time, according to the key value pair, can confirm the basic information of the signaling data.
S205, determining the processing state information of the signaling data to be determined according to the identifier to be determined, the main key information and the main key to be determined of the signaling data to be determined, which are compared.
The signaling data to be determined may refer to signaling data received by a task, and determining processing state information of the signaling data to be determined may refer to determining whether the data is processed according to cooperation of a processed identifier and a storage service primary key, so as to perform deduplication.
For example, in some embodiments, a primary key to be determined may be spliced by selecting part of attribute information in the signaling, and then processing state information of the data may be determined according to a combined action of a to-be-determined identifier and a to-be-determined primary key of the signaling data.
In this embodiment of the present application, determining the processing state information of the signaling data to be determined according to the identifier to be determined, the primary key information, and the primary key to be determined of the signaling data to be determined by comparing the identifier with the identifier to be determined of the signaling data to be determined may include:
Determining a characterized sequential relationship between the target identifier and the alignment identifier;
if the sequence relation characterizes that the sequence corresponding to the target identifier is before or the same as the sequence corresponding to the comparison identifier, determining that the processing state information of the signaling data to be determined is processed, and storing the key value pair information of the signaling data to be determined into a key value pair library;
if the sequence relation characterizes that the sequence corresponding to the target identifier is behind the sequence corresponding to the comparison identifier, determining the primary key information of the signaling data to be determined;
inquiring a key value pair library according to the primary key information of the signaling data to be determined to obtain an inquiry result;
and determining the processing state information of the signaling data to be determined according to the query result.
The sequence relationship may refer to a sequence between the signaling data to be determined and the processed signaling data, which is determined after numbering the data according to an ascending sequence of the data, so as to determine processing state information of the signaling data to be determined, for example, in some embodiments, when the comparison identifier may be 5 and the identifier to be determined may be 3, the processing state of the signaling data to be determined is determined to be processed; when the comparison identifier may be 5 and the identifier to be determined may be 7, the processing state of the signaling data to be determined is determined to be unprocessed.
The query key value pair library may refer to a key value pair library for performing a main key query by extracting part of attribute information of the data after determining that the signaling data to be determined is unprocessed according to the identifier to be determined and the comparison identifier, and since the key value pairs stored in the key value pair library are all key value pairs of the processed signaling data, the processing state of the signaling data to be determined may be determined according to whether the main key information of the signaling data to be determined is stored in the key value pair library, since only the signaling data to be determined in an unprocessed state is calculated, thereby preventing the repetition of the processing of the processed signaling data.
In this embodiment of the present application, determining, according to the query result, processing state information of signaling data to be determined may include:
if the query result represents that the primary key information of the signaling data to be determined is recorded in the key value pair library, determining that the processing state information of the signaling data to be determined is processed;
if the query result represents that the primary key information of the signaling data to be determined is not recorded in the key value pair library, determining that the processing state information of the signaling data to be determined is unprocessed;
and storing the key value pair information of the signaling data to be determined into a key value pair library, and carrying out service processing on the signaling data to be determined.
The data deduplication method provided by the embodiment of the invention can carry out parallel computing processing after corresponding to the data partition and the consumption task, so that the maximum data reading performance is ensured, as the identifier of the signaling data is recorded in the data partition, the processed signaling data can be determined by recording the processed identifier, the situation that the data is repeatedly read after the task fault occurs is prevented, the effect of data deduplication is achieved by restarting, namely, the signaling data is not repeatedly issued, meanwhile, the signaling data is grouped according to the attribute information, the effect of deduplication is achieved for the signaling data with the same attribute information, and finally, the processing state of the task data can be confirmed according to the combined action of the processed primary key information and the identifier, so that whether the processing of the task data is needed can be rapidly determined, the processing efficiency of the signaling data is further improved, and the effect of global deduplication is achieved without using excessive memory resources.
Fig. 3 is a flow chart of another data deduplication method according to an embodiment of the present application, as shown in fig. 3, the method includes:
s301, acquiring signaling data, and transmitting the signaling data in real time through a message queue Kafka, wherein the signaling data with the same province is written into the same topic of the Kafka.
Wherein, acquiring signaling data may refer to signaling data collection using a base station data acquisition device.
The method for writing the signaling data with the same province into the same topic of the Kafka can refer to that each topic in the Kafka also has partition, and the signaling data is sequentially written into the partition in turn by using a round robin strategy, so that the data quantity of each partition in the Kafka is ensured to be uniform.
S302, flink consumes Kafka data in real time, and Redis records processed consumption data offset.
The flexible real-time consumption of the Kafka data may refer to that the parallel number of the flexible consumption tasks and the partition number of the Kafka are kept equal, and then the data in the Kafka is read into the memory in real time by using the flexible framework for calculation processing, that is, one parallel task reads the data of one partition.
S303, judging whether to process according to the offset.
Wherein, determining whether to process according to the offset may refer to recording the consumption data offset using the Redis, and determining whether the data is processed according to the recorded offset, in this embodiment, the information recorded by the Redis may be a key value: topic name + partition ID + consumer group, value: and (3) marking the data as processed if the offset of the data is smaller than or equal to the offset recorded in the Redis, and marking the data as unprocessed if the offset of the data is not smaller than or equal to the offset recorded in the Redis.
And S304, if yes, determining that the signaling data is processed.
Wherein determining that the signaling data has been processed may refer to determining that the offset of the data is less than or equal to the offset recorded in the Redis by comparing the values of the offsets, determining that the data has been processed, and storing the key and value of the signaling data in the RocksDB, wherein the information stored in the RocksDB may be the key: cell phone number + base station LAC + base station CI, value: time.
S305, if not, judging whether the key is stored in the RocksDB.
The determining whether the key is stored in the RocksDB may refer to querying the RocksDB using a primary key if it is determined that the signaling data is not processed according to the offset.
S306, if the key is stored in the RocksDB, determining that the signaling data is processed, and recording the signaling data as the repeated data without processing.
S307, if the key is not stored in the RocksDB, determining that the signaling data is not processed, and performing service calculation on the data.
S308, storing the key and the value of the signaling data into the RocksDB.
After the service calculation is performed on the signaling data, the key and the value of the signaling data can be stored in the RocksDB, so that the key value information stored in the RocksDB is continuously increased, and the result of judging the service processing state of the signaling data to be determined is more accurate.
Fig. 4 is a schematic structural diagram of a data deduplication apparatus according to an embodiment of the present application. As shown in fig. 4, the data deduplication apparatus 40 includes: an acquisition module 401, a calculation module 402, a grouping module 403, a first determination module 404, a second determination module 405. Wherein:
the acquiring module 401 is configured to acquire a data partition and a consumption task corresponding to the data partition, where the data partition includes signaling data, and the consumption task includes processed target signaling data in the signaling data;
a calculation module 402, configured to perform parallel calculation processing on the signaling data according to the data partition and the consumption task, and determine processed signaling data in the data partition and a comparison identifier of the processed signaling data;
a grouping module 403, configured to group the target signaling data according to the attribute information in the target signaling data, so as to obtain grouped signaling data;
a first determining module 404, configured to determine primary key information in the packet signaling data, where the primary key information is determined according to attribute information in the packet signaling data and base station information corresponding to the attribute information;
the second determining module 405 is configured to determine processing status information of the signaling data to be determined according to the identifier to be determined, which compares the identifier with the signaling data to be determined, and the primary key information and the primary key to be determined of the signaling data to be determined.
In the embodiment of the present application, the obtaining module 401 may also be configured to:
acquiring signaling data, wherein the signaling data represents signaling data in a target area;
according to a preset alternate transmission sequence, carrying out data partition processing on the signaling data, and determining the signaling data in the data partition and the data partition;
and obtaining the consumption task according to the data partition and the signaling data in the data partition.
In the embodiment of the present application, the obtaining module 401 may also be configured to:
acquiring an original code stream output by a base station in a target area;
transcoding and information extraction processing are carried out on the original code stream, and a user number, an IMSI, a base station code, a base station longitude and latitude and a signaling event type are obtained;
and obtaining signaling data according to the user number, the IMSI, the base station code, the longitude and latitude of the base station and the signaling event type.
In the present embodiment, the computing module 402 may also be configured to:
sequentially numbering the signaling data in the data partition, and determining an identifier of the signaling data;
determining unprocessed information data in the signaling data and identifiers of the unprocessed information data according to the consumption task and the identifiers of the signaling data;
and determining the comparison identifier of the processed signaling data according to the identifier of the signaling data and the identifier of the unprocessed information data.
In the embodiment of the present application, the grouping module 403 may be further configured to:
determining attribute information in the target signaling data and a user number in the attribute information;
and carrying out grouping processing on the target signaling data according to the user number in the attribute information to obtain grouping signaling data.
In the embodiment of the present application, the first determining module 404 may also be configured to:
determining a key value of the packet signaling data, wherein the key value is generated according to the processing time of the packet signaling data;
generating key value pair information of grouping signaling data according to the key value and the primary key information;
and storing the key value pair information into a key value pair library so as to call the main key information from the key value pair library when the step of determining the processing state information of the signaling data to be determined according to the identifier to be determined of the signaling data to be determined, the main key information and the main key to be determined of the signaling data to be determined is executed.
In the embodiment of the present application, the second determining module 405 may further be configured to:
determining a characterized sequential relationship between the target identifier and the alignment identifier;
if the sequence relation characterizes that the sequence corresponding to the target identifier is before or the same as the sequence corresponding to the comparison identifier, determining that the processing state information of the signaling data to be determined is processed, and storing the key value pair information of the signaling data to be determined into a key value pair library;
If the sequence relation characterizes that the sequence corresponding to the target identifier is behind the sequence corresponding to the comparison identifier, determining the primary key information of the signaling data to be determined;
inquiring a key value pair library according to the primary key information of the signaling data to be determined to obtain an inquiry result;
and determining the processing state information of the signaling data to be determined according to the query result.
In the embodiment of the present application, the second determining module 405 may further be configured to:
if the query result represents that the primary key information of the signaling data to be determined is recorded in the key value pair library, determining that the processing state information of the signaling data to be determined is processed;
if the query result represents that the primary key information of the signaling data to be determined is not recorded in the key value pair library, determining that the processing state information of the signaling data to be determined is unprocessed;
and storing the key value pair information of the signaling data to be determined into a key value pair library, and carrying out service processing on the signaling data to be determined.
As can be seen from the above, the data deduplication device in this embodiment is configured to obtain, by the obtaining module 401, a data partition and a consumption task corresponding to the data partition, where the data partition includes signaling data, and the consumption task includes processed target signaling data in the signaling data; the computing module 402 is configured to perform parallel computing processing on the signaling data according to the data partition and the consumption task, and determine processed signaling data in the data partition and a comparison identifier of the processed signaling data; a grouping module 403, configured to group the target signaling data according to the attribute information in the target signaling data, so as to obtain grouped signaling data; a first determining module 404, configured to determine primary key information in the packet signaling data, where the primary key information is determined according to attribute information in the packet signaling data and base station information corresponding to the attribute information; and the second determining module 405 is configured to determine processing state information of the signaling data to be determined according to the identifier to be determined, and the primary key information. Therefore, the embodiment of the application can carry out parallel computing processing according to the data partition and the consumption task after corresponding, so that the maximum data reading performance is ensured, as the identifier record is carried out on the signaling data in the data partition, the processed signaling data can be determined through the identifier record processed, the situation that the data is repeatedly read after the task fault occurs is prevented, the effect of data duplicate removal is achieved by restarting, the signaling data is not repeatedly issued, meanwhile, the signaling data are grouped according to the attribute information, the signaling data with the same attribute information achieves the duplicate removal effect, and finally, as the processed state of the task data can be confirmed according to the combined action of the processed primary key information and the identifier, whether the processing of the task data is needed can be rapidly determined, the processing efficiency of the signaling data is improved, and the effect of global duplicate removal without using excessive memory resources is achieved.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 5, the electronic device 50 includes:
the electronic device 50 may include one or more processing cores 'processors 501, one or more computer-readable storage media's memory 502, communication components 503, and the like. The processor 501, the memory 502, and the communication unit 503 are connected via a bus 504.
In a specific implementation, at least one processor 501 executes computer-executable instructions stored in memory 502, causing at least one processor 501 to perform a data deduplication method as described above.
The specific implementation process of the processor 501 may refer to the above-mentioned method embodiment, and its implementation principle and technical effects are similar, and this embodiment will not be described herein again.
In the embodiment shown in fig. 5, it should be understood that the processor may be a central processing unit (english: central Processing Unit, abbreviated as CPU), or may be other general purpose processors, digital signal processors (english: digital Signal Processor, abbreviated as DSP), application specific integrated circuits (english: application Specific Integrated Circuit, abbreviated as ASIC), or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution.
The Memory may comprise high-speed Memory (Random Access Memory, RAM) or may further comprise Non-volatile Memory (NVM), such as at least one disk Memory.
The bus may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (Peripheral Component, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, the buses in the drawings of the present application are not limited to only one bus or one type of bus.
In some embodiments, a computer program product is also presented, comprising a computer program or instructions which, when executed by a processor, implement the steps of any of the data deduplication methods described above.
The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present application provide a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform steps in any of the data deduplication methods provided by embodiments of the present application.
Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
Because the instructions stored in the storage medium may perform steps in any of the data deduplication methods provided in the embodiments of the present application, the beneficial effects that any of the data deduplication methods provided in the embodiments of the present application may be achieved, and detailed descriptions of the foregoing embodiments are omitted herein.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (11)

1. A method for deduplicating data, applied to a data deduplication system, the method comprising:
acquiring a data partition and a consumption task corresponding to the data partition, wherein the data partition comprises signaling data, and the consumption task comprises processed target signaling data in the signaling data;
according to the data partition and the consumption task, carrying out parallel calculation processing on the signaling data, and determining processed signaling data in the data partition and a comparison identifier of the processed signaling data;
grouping the target signaling data according to the attribute information in the target signaling data to obtain grouped signaling data;
determining primary key information in the packet signaling data, wherein the primary key information is determined according to attribute information in the packet signaling data and base station information corresponding to the attribute information;
and determining the processing state information of the signaling data to be determined according to the identifier to be determined of the comparison identifier and the signaling data to be determined and the primary key information and the primary key to be determined of the signaling data to be determined.
2. The method of claim 1, wherein the acquiring the data partition and the consumption task corresponding to the data partition comprises:
acquiring signaling data, wherein the signaling data represents signaling data in a target area;
according to a preset alternate transmission sequence, carrying out data partition processing on the signaling data, and determining the signaling data in the data partition and the data partition;
and obtaining the consumption task according to the data partition and the signaling data in the data partition.
3. The method of claim 2, wherein the acquiring signaling data comprises:
acquiring an original code stream output by a base station in the target area;
transcoding and information extraction processing are carried out on the original code stream, and a user number, an IMSI, a base station code, a base station longitude and latitude and a signaling event type are obtained;
and obtaining the signaling data according to the user number, the IMSI, the base station code, the longitude and latitude of the base station and the signaling event type.
4. The method of claim 1, wherein said parallel computing of said signaling data based on said data partition and said consumption task, determining processed signaling data in said data partition and an alignment identifier for said processed signaling data, comprises:
Sequentially numbering the signaling data in the data partition, and determining an identifier of the signaling data;
determining unprocessed information data in the signaling data and an identifier of the unprocessed information data according to the consumption task and the identifier of the signaling data;
and determining the comparison identifier of the processed signaling data according to the identifier of the signaling data and the identifier of the unprocessed information data.
5. The method according to claim 1, wherein said grouping the target signaling data according to the attribute information in the target signaling data to obtain grouped signaling data includes:
determining attribute information in the target signaling data and a user number in the attribute information;
and carrying out grouping processing on the target signaling data according to the user number in the attribute information to obtain grouping signaling data.
6. The method according to claim 1, wherein after grouping the target signaling data according to the attribute information in the target signaling data to obtain grouped signaling data, before determining the processing state information of the signaling data to be determined according to the comparison identifier and the identifier to be determined of the signaling data to be determined, and the primary key information and the primary key to be determined of the signaling data to be determined, the method further comprises:
Determining a key value of the packet signaling data, wherein the key value is generated according to the processing time of the packet signaling data;
generating key value pair information of the grouping signaling data according to the key value and the primary key information;
and storing the key value pair information into a key value pair library so as to call the main key information from the key value pair library when the step of determining the processing state information of the signaling data to be determined according to the identifier to be determined of the signaling data to be determined, the main key information and the main key to be determined of the signaling data to be determined and the comparison identifier are executed.
7. The method according to claim 1, wherein the determining the processing state information of the signaling data to be determined according to the comparison identifier and the signaling data to be determined identifier to be determined, and the primary key information and the primary key to be determined of the signaling data to be determined, comprises:
determining a characterized sequential relationship between the target identifier and the alignment identifier;
if the sequence relation characterizes that the sequence corresponding to the target identifier is before or the same as the sequence corresponding to the comparison identifier, determining that the processing state information of the signaling data to be determined is processed, and storing the key value pair information of the signaling data to be determined into a key value pair library;
If the sequence relation characterizes that the sequence corresponding to the target identifier is behind the sequence corresponding to the comparison identifier, determining the primary key information of the signaling data to be determined;
inquiring the key value pair library according to the primary key information of the signaling data to be determined to obtain an inquiry result;
and determining the processing state information of the signaling data to be determined according to the query result.
8. The method according to claim 7, wherein the determining the processing state information of the signaling data to be determined according to the query result includes:
if the query result represents that the primary key information of the signaling data to be determined is recorded in the key value pair library, determining that the processing state information of the signaling data to be determined is processed;
if the query result represents that the primary key information of the signaling data to be determined is not recorded in the key value pair library, determining that the processing state information of the signaling data to be determined is unprocessed;
and storing the key value pair information of the signaling data to be determined into the key value pair library, and carrying out service processing on the signaling data to be determined.
9. A data deduplication apparatus, the apparatus comprising:
The acquisition module is used for acquiring a data partition and a consumption task corresponding to the data partition, wherein the data partition comprises signaling data, and the consumption task comprises processed target signaling data in the signaling data;
the calculation module is used for carrying out parallel calculation processing on the signaling data according to the data partition and the consumption task, and determining processed signaling data in the data partition and a comparison identifier of the processed signaling data;
the grouping module is used for grouping the target signaling data according to the attribute information in the target signaling data to obtain grouping signaling data;
a first determining module, configured to determine primary key information in the packet signaling data, where the primary key information is determined according to attribute information in the packet signaling data and base station information corresponding to the attribute information;
and the second determining module is used for determining the processing state information of the signaling data to be determined according to the identifier to be determined of the comparison identifier and the signaling data to be determined and the primary key information and the primary key to be determined of the signaling data to be determined.
10. An electronic device, comprising: a processor, and a memory communicatively coupled to the processor;
The memory stores computer-executable instructions;
the processor executes computer-executable instructions stored in the memory to implement the data deduplication method of any of claims 1 to 8.
11. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are for implementing a data deduplication method as claimed in any of claims 1 to 8.
CN202311433803.1A 2023-10-31 2023-10-31 Data deduplication method and device, electronic equipment and storage medium Pending CN117493319A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311433803.1A CN117493319A (en) 2023-10-31 2023-10-31 Data deduplication method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311433803.1A CN117493319A (en) 2023-10-31 2023-10-31 Data deduplication method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117493319A true CN117493319A (en) 2024-02-02

Family

ID=89682053

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311433803.1A Pending CN117493319A (en) 2023-10-31 2023-10-31 Data deduplication method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117493319A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117892727A (en) * 2024-03-14 2024-04-16 中国电子科技集团公司第三十研究所 Real-time text data stream deduplication system and method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117892727A (en) * 2024-03-14 2024-04-16 中国电子科技集团公司第三十研究所 Real-time text data stream deduplication system and method
CN117892727B (en) * 2024-03-14 2024-05-17 中国电子科技集团公司第三十研究所 Real-time text data stream deduplication system and method

Similar Documents

Publication Publication Date Title
CN111352902A (en) Log processing method and device, terminal equipment and storage medium
CN108599973B (en) Log association method, device and equipment
CN117493319A (en) Data deduplication method and device, electronic equipment and storage medium
CN110188103A (en) Data account checking method, device, equipment and storage medium
CN113094434A (en) Database synchronization method, system, device, electronic equipment and medium
CN111815454B (en) Data uplink method and device, electronic equipment and storage medium
US20230030856A1 (en) Distributed table storage processing method, device and system
CN109992469B (en) Method and device for merging logs
US10719497B2 (en) Utilization of optimized ordered metadata structure for container-based large-scale distributed storage
US11341159B2 (en) In-stream data load in a replication environment
CN113282347B (en) Plug-in operation method, device, equipment and storage medium
CN113419792A (en) Event processing method and device, terminal equipment and storage medium
CN112527841A (en) Stream data merging processing method and device
CN111475291A (en) Data processing method, system, server and medium
CN114238419B (en) Data caching method and device based on multi-tenant SaaS application system
CN109446226B (en) Method and equipment for determining data set
CN113486627B (en) Single number generation method and device and electronic equipment
CN117422556B (en) Derivative transaction system, device and computer medium based on replication state machine
CN115103020B (en) Data migration processing method and device
CN114625595B (en) Method, device and system for rechecking dynamic configuration information of service system
CN110134691B (en) Data verification method, device, equipment and medium
CN117493463A (en) Data synchronization method, device, equipment and storage medium
CN110209666B (en) data storage method and terminal equipment
CN117271445A (en) Log data processing method, device, server, storage medium and program product
CN116737417A (en) Data synchronization method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination