CN113934767A - Data processing method and device, computer equipment and storage medium - Google Patents

Data processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113934767A
CN113934767A CN202111240901.4A CN202111240901A CN113934767A CN 113934767 A CN113934767 A CN 113934767A CN 202111240901 A CN202111240901 A CN 202111240901A CN 113934767 A CN113934767 A CN 113934767A
Authority
CN
China
Prior art keywords
data
processed
initial
bit array
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111240901.4A
Other languages
Chinese (zh)
Inventor
石志林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111240901.4A priority Critical patent/CN113934767A/en
Publication of CN113934767A publication Critical patent/CN113934767A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24573Query processing with adaptation to user needs using data annotations, e.g. user-defined metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2291User-Defined Types; Storage management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Abstract

The application discloses a data processing method and device, computer equipment and a storage medium, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like, and comprises the following steps: acquiring a data set to be processed; creating an initial bit array set based on the data set to be processed; processing each to-be-processed data in the to-be-processed data based on the initial bit array set to obtain a modulus value corresponding to each to-be-processed data in each initial bit array; replacing the position parameter of each initial array in the N initial bit arrays with a corresponding modulus value of each to-be-processed data in each initial bit array to obtain a target bit array set; target data is determined from the L data to be processed based on the target bit array set. By the method, each piece of data does not need to access external storage, and the consumption of storage resources can be reduced in a scene with large data volume, so that the real-time data processing efficiency is improved.

Description

Data processing method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of data statistical analysis technologies, and in particular, to a data processing method and apparatus, a computer device, and a storage medium.
Background
Under the digital age, the application range and the boundary of the internet are continuously expanded. A plurality of internet enterprises and traditional enterprises gradually accelerate the updating iteration of self application systems, and the updating iteration comprises a computer end and a mobile phone end, and enables self services and products by means of the scientific and technological strength of the internet. For the upgrade optimization of an application system, the statistical analysis of user behaviors is the most important technical support. Through user behavior statistical analysis, the intention, the characteristics and the requirements of the user are fully embodied, and the enterprise is helped to design products, optimize interfaces and accurately market by using a big data technology, so that the use experience of the user is improved. On the other hand, large-scale real-time data statistics is generally performed by using a streaming processing framework such as Flink, but the problem of repeated message transmission exists when the Flink processes data, so that the final statistical result has large deviation. Therefore, data needs to be deduplicated during the statistical analysis process, so as to eliminate the repeatedly transmitted data generated by the unreliable data source, and make the final statistical result more accurate. At present, whether data are repeated can be judged by recording each piece of data, but by adopting the method, each piece of data needs to access external storage, so that the consumption of storage resources can be improved under the scene of large data volume, and the real-time data processing efficiency is reduced. Therefore, how to perform data processing more efficiently becomes an urgent problem to be solved.
Disclosure of Invention
The embodiment of the application provides a data processing method and device, computer equipment and a storage medium, wherein initial arrays are associated with data to be processed one by one, namely the data to be processed is mapped into the initial arrays, the modulus value of each data to be processed corresponding to each initial bit array replaces the position parameter of the initial array associated with the data to be processed by the modulus value of the data to be processed, so as to obtain the position parameter of a target array associated with the data to be processed, and when the position parameter of the target array associated with the data to be processed is different from the initial value, the data to be processed is determined to be repeated data, and each data does not need to access external storage, so that the consumption of storage resources can be reduced in a scene with a large data volume, and the real-time data processing efficiency is improved.
In view of the above, a first aspect of the present application provides a data processing method, including:
acquiring a data set to be processed, wherein the data set to be processed comprises L data to be processed, and L is an integer greater than 1;
establishing an initial bit array set based on a data set to be processed, wherein the initial bit array set comprises N initial bit arrays, each initial bit array comprises L initial bit arrays, the initial bit arrays are associated with the data to be processed one by one, at least one position parameter of the position parameters of each initial bit array is an initial value, and N is an integer greater than 1;
processing each to-be-processed data in the to-be-processed data based on the initial bit array set to obtain a modulus value corresponding to each to-be-processed data in each initial bit array, wherein the modulus value corresponds to one initial array in L initial arrays included in the initial bit arrays;
replacing the position parameter of each initial array in the N initial bit arrays with a corresponding modulus value of each to-be-processed data in each initial bit array to obtain a target bit array set, wherein the target bit array set comprises the N target bit arrays, each target bit array comprises L target bit arrays, the target bit arrays are associated with the to-be-processed data one by one, and the position parameter of each target bit array is the corresponding modulus value of each to-be-processed data in each initial bit array;
and determining target data from the L data to be processed based on the target bit array set, wherein the position parameters of the N target arrays associated with the target data are different from the initial values, and the target data are repeated data.
A second aspect of the present application provides a data processing apparatus comprising:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a data set to be processed, the data set to be processed comprises L data to be processed, and L is an integer greater than 1;
the device comprises a creating module, a processing module and a processing module, wherein the creating module is used for creating an initial bit array set based on a data set to be processed, the initial bit array set comprises N initial bit arrays, each initial bit array comprises L initial bit arrays, the initial bit arrays are associated with the data to be processed one by one, at least one position parameter of the position parameter of each initial bit array is an initial value, and N is an integer greater than 1;
the processing module is used for processing each to-be-processed data in the to-be-processed data based on the initial bit array set to obtain a modulus value corresponding to each to-be-processed data in each initial bit array, wherein the modulus value corresponds to one initial array in L initial arrays included in the initial bit arrays one by one;
the replacing module is used for replacing the position parameter of each initial array in the N initial bit arrays with the corresponding modulus value of each data to be processed in each initial bit array to obtain a target bit array set, wherein the target bit array set comprises N target bit arrays, each target bit array comprises L target bit arrays, the target bit arrays are associated with the data to be processed one by one, and the position parameter of each target bit array is the corresponding modulus value of each data to be processed in each initial bit array;
the determining module is used for determining target data from the L data to be processed based on the target bit array set, wherein the position parameters of the N target arrays related to the target data are different from the initial values, and the target data are repeated data.
In a possible embodiment, the processing module is specifically configured to create a hash function corresponding to the initial bit array set;
creating a modulus function corresponding to each initial bit array;
and processing each to-be-processed data in the to-be-processed data based on the hash function corresponding to the initial bit array set and the modulus function corresponding to each initial bit array to obtain a modulus value corresponding to each to-be-processed data in each initial bit array.
In a possible implementation manner, the processing module is specifically configured to perform analysis processing on L pieces of data to be processed in the data set to be processed, and obtain a key value corresponding to each piece of data to be processed;
performing hash calculation on the key value corresponding to each piece of data to be processed by using a hash function corresponding to the initial bit array set to obtain a hash value of the key value corresponding to each piece of data to be processed;
and performing modulo processing on the hash value of the key value corresponding to each to-be-processed data by using the modulo function corresponding to each initial bit array to obtain a modulo value corresponding to each to-be-processed data in each initial bit array.
In a possible implementation manner, the obtaining module is further configured to perform analysis processing on L pieces of data to be processed in the data set to be processed, and obtain a non-key value corresponding to each piece of data to be processed;
the processing module is further used for discarding the target data after the determining module determines the target data from the L data to be processed based on the target bit array set.
In one possible embodiment, the data processing apparatus further comprises a read-write module;
the determining module is further configured to determine cache data from the L data to be processed based on the target bit array set, where at least one location parameter of the location parameter of each target array in the N target arrays associated with the cache data is an initial value, and the cache data is non-duplicated data;
and the read-write module is used for writing the key value corresponding to the cache data and the non-key value corresponding to the cache data into the cache.
In one possible embodiment, the data processing apparatus further comprises a statistics module;
the determining module is further used for determining the key value corresponding to the cache data as a non-repeated key value;
the processing module is further used for counting and counting the cache data in the L pieces of data to be processed to obtain the number of the non-repeated key values, wherein the number of the non-repeated key values is the number of all the cache data in the L pieces of data to be processed;
the counting module is also used for counting the modulus values of all the cache data in the L data to be processed in each initial bit array;
the determining module is further configured to determine summarized data of the non-repeated key values based on the number of the non-repeated key values and modulus values corresponding to all cache data in the L pieces of data to be processed in each initial bit array, where the summarized data is used for performing data analysis processing on a set of data to be processed.
In a possible embodiment, the obtaining module is specifically configured to obtain an initial data set, where the initial data set includes M initial data, each initial data corresponds to a topic type one to one, and the M initial data are derived from at least two data sources, and M is an integer greater than L;
analyzing each initial data in the initial data set to obtain a theme type corresponding to each initial data;
and extracting initial data with the theme type as the target theme type to generate a to-be-processed data set.
A third aspect of the present application provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the method of the above-described aspects.
A fourth aspect of the application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the steps of the method provided by the aspects described above.
According to the technical scheme, the embodiment of the application has the following advantages:
in the embodiment of the application, a data processing method is provided, where a to-be-processed data set including L to-be-processed data is obtained, where L is an integer greater than 1, and then an initial bit array set is created based on the to-be-processed data set, where the initial bit array set includes N initial bit arrays, each initial bit array includes L initial arrays, the initial bit arrays are associated with the to-be-processed data one by one, and at least one position parameter of a position parameter of each initial bit array is an initial value, and N is an integer greater than 1. Based on the above, processing each data to be processed in the data to be processed based on the initial bit array set to obtain a corresponding modulus value of each data to be processed in each initial bit array, wherein the modulus value corresponds to one initial bit array in L initial bit arrays included in the initial bit array, replacing the position parameter of each initial bit array in N initial bit arrays with the corresponding modulus value of each data to be processed in each initial bit array to obtain a target bit array set, the target bit array set includes N target bit arrays, each target bit array includes L target bit arrays, the target bit arrays are associated with the data to be processed one by one, the position parameter of each target bit array is the corresponding modulus value of each data to be processed in each initial bit array, and finally determining the target data from the L data to be processed based on the target bit array set, the position parameters of the N target arrays associated with the target data are different from the initial values, and the target data are repeated data. Through the mode, the initial arrays are associated with the data to be processed one by one, namely the data to be processed is mapped into the initial array, the modulus value corresponding to each initial bit array of each data to be processed replaces the position parameter of the initial array associated with the data to be processed, so that the position parameter of the target array associated with the data to be processed is obtained, and when the position parameter of the target array associated with the data to be processed is different from the initial value, the data to be processed is determined to be the repeated data, each piece of data is not required to access external storage, the consumption of storage resources can be reduced in a scene with large data volume, and the real-time data processing efficiency is improved.
Drawings
Fig. 1 is a schematic system architecture diagram of a data processing method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a data processing method according to an embodiment of the present application;
fig. 3 is a schematic diagram illustrating an embodiment of a method for data processing according to an embodiment of the present application;
FIG. 4 is a diagram of an embodiment of an initial bit array set provided by an embodiment of the present application;
FIG. 5 is a schematic diagram of an embodiment of a target bit array set provided by an embodiment of the present application;
FIG. 6 is a schematic diagram of an embodiment of determining target data provided by an embodiment of the present application;
FIG. 7 is a schematic diagram illustrating an embodiment of determining cache data according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 9 is a schematic diagram of an embodiment of a server in the embodiment of the present application.
Detailed Description
The embodiment of the application provides a data processing method and device, computer equipment and a storage medium, wherein initial arrays are associated with data to be processed one by one, namely the data to be processed is mapped into the initial arrays, the modulus value of each data to be processed corresponding to each initial bit array replaces the position parameter of the initial array associated with the data to be processed by the modulus value of the data to be processed, so as to obtain the position parameter of a target array associated with the data to be processed, and when the position parameter of the target array associated with the data to be processed is different from the initial value, the data to be processed is determined to be repeated data, and each data does not need to access external storage, so that the consumption of storage resources can be reduced in a scene with a large data volume, and the real-time data processing efficiency is improved.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Under the digital age, the application range and the boundary of the internet are continuously expanded. A plurality of internet enterprises and traditional enterprises gradually accelerate the updating iteration of self application systems, and the updating iteration comprises a computer end and a mobile phone end, and enables self services and products by means of the scientific and technological strength of the internet. For the upgrade optimization of an application system, the statistical analysis of user behaviors is the most important technical support. Through user behavior statistical analysis, the intention, the characteristics and the requirements of the user are fully embodied, and the enterprise is helped to design products, optimize interfaces and accurately market by using a big data technology, so that the use experience of the user is improved. On the other hand, large-scale real-time data statistics is generally performed by using a streaming processing framework such as Flink, but the problem of repeated message transmission exists when the Flink processes data, so that the final statistical result has large deviation. Therefore, data needs to be deduplicated during the statistical analysis process, so as to eliminate the repeatedly transmitted data generated by the unreliable data source, and make the final statistical result more accurate. At present, whether data are repeated can be judged by recording each piece of data, but by adopting the method, each piece of data needs to access external storage, so that the consumption of storage resources can be improved under the scene of large data volume, and the real-time data processing efficiency is reduced. Therefore, how to perform data processing more efficiently becomes an urgent problem to be solved. In order to solve the foregoing problem, an embodiment of the present application provides a data processing method, which can determine that data to be processed is duplicate data when a position parameter of a target array associated with the data to be processed is different from an initial value, and it is not necessary for each piece of data to access an external storage, and consumption of storage resources can be reduced in a scenario with a large data amount, so as to improve real-time data processing efficiency.
First, some terms or concepts related to the present application are explained for convenience of understanding.
One, stream computing
Streaming computation refers to data being input in the form of a data stream and processing the computation before being output. Batch processing systems often require a batch of data to be in tact to begin processing, as opposed to batch processing, where streaming computing typically processes each piece of data or small batches of data as soon as it arrives.
Two, Flink
Flink is a framework for unified stream processing and batch processing. Since pipelined data is transmitted between parallel tasks, Flink supports both streaming and batch processing at runtime.
Tri, bloom filter
The bloom filter is a data structure, uses a probability type data structure composed of a hash (hash) function and an array, and is characterized in that the data structure is inserted and inquired efficiently, and the bloom filter can obtain a result of 'something does not exist or possibly exists'.
Tetra, Hbase
Hbase is a distributed, column-oriented open-source database that can be used to quickly write and query data based on key values (key).
Five, Redis
Redis is an open-source, log-based database written in the C language and supporting networks and can be memory-based or persistent.
Sixth, repeated consumption
In the log generation and real-time calculation processes, the condition that the same data is processed for multiple times due to dirty data and service restart is called repeated consumption, and the data processed for multiple times (namely, repeated consumption) is defined as repeated data in the application.
Based on this, the application scenarios of the embodiments of the present application are described below. It is understood that the data processing method is executed by a server, and the data processing method provided in the embodiments of the present application is described below by taking the server as an execution subject. Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture of a data processing method according to an embodiment of the present disclosure, as shown in fig. 1, the video processing system includes a server and a terminal device, the server acquires a plurality of pieces of data to be processed from the plurality of terminal devices, and processes the plurality of pieces of data to be processed by using the data processing method according to the embodiment of the present disclosure, so that when a position parameter of a target array associated with the piece of data to be processed is different from an initial value, the piece of data to be processed is determined to be duplicate data, it is not necessary that each piece of data accesses an external storage, and consumption of storage resources can be reduced in a scene with a large data amount, thereby improving real-time data processing efficiency.
It should be noted that the server related to the present application may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform. The terminal devices include, but are not limited to, mobile phones, computers, intelligent voice interaction devices, intelligent household appliances, vehicle-mounted terminals, and the like. And the terminal device and the server can communicate with each other through a wireless network, a wired network or a removable storage medium. Wherein the wireless network described above uses standard communication techniques and/or protocols. The wireless Network is typically the internet, but can be any Network including, but not limited to, bluetooth, Local Area Network (LAN), Metropolitan Area Network (MAN), Wide Area Network (WAN), mobile, private, or any combination of virtual private networks. In some embodiments, custom or dedicated data communication techniques may be used in place of or in addition to the data communication techniques described above. The removable storage medium may be a Universal Serial Bus (USB) flash drive, a removable hard drive or other removable storage medium, and the like.
Although only five terminal devices and one server are shown in fig. 1, it should be understood that the example in fig. 1 is only used for understanding the present solution, and the number of the specific terminal devices and the number of the servers should be flexibly determined according to actual situations.
Second, embodiments of the present invention may be applied to a variety of scenarios including, but not limited to, Cloud technology (Cloud technology), artificial intelligence, intelligent traffic, assisted driving, and the like. The cloud technology is a hosting technology for unifying series resources such as hardware, software, network and the like in a wide area network or a local area network to realize data calculation, storage, processing and sharing. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.
Cloud computing (cloud computing) refers to a delivery and use mode of an IT infrastructure, and refers to acquiring required resources in an on-demand and easily-extensible manner through a network; the generalized cloud computing refers to a delivery and use mode of a service, and refers to obtaining a required service in an on-demand and easily-extensible manner through a network. Such services may be IT and software, internet related, or other services. Cloud Computing is a product of development and fusion of traditional computers and Network Technologies, such as Grid Computing (Grid Computing), distributed Computing (distributed Computing), Parallel Computing (Parallel Computing), Utility Computing (Utility Computing), Network Storage (Network Storage Technologies), Virtualization (Virtualization), Load balancing (Load Balance), and the like.
With the development of diversification of internet, real-time data stream and connecting equipment and the promotion of demands of search service, social network, mobile commerce, open collaboration and the like, cloud computing is rapidly developed. Different from the prior parallel distributed computing, the generation of cloud computing can promote the revolutionary change of the whole internet mode and the enterprise management mode in concept. .
The so-called artificial intelligence cloud Service is also generally called AIaaS (AI as a Service, chinese). The method is a service mode of an artificial intelligence platform, and particularly, the AIaaS platform splits several types of common AI services and provides independent or packaged services at a cloud. This service model is similar to the one opened in an AI theme mall: all developers can access one or more artificial intelligence services provided by the platform through an API (application programming interface), and part of the qualified developers can also use an AI framework and an AI infrastructure provided by the platform to deploy and operate and maintain the self-dedicated cloud artificial intelligence services.
Fig. 2 is a schematic flow chart of a data processing method provided in an embodiment of the present application, and as shown in fig. 2, a specific flow of data processing in the embodiment of the present application includes obtaining a to-be-processed data set by parsing an initial data set, creating N initial bit arrays based on the to-be-processed data set, creating a hash function corresponding to the N initial bit arrays and a modulo function corresponding to each initial bit array, obtaining a target bit array set, and determining target data based on the target bit array set. The following will describe functions and flows of the respective parts, specifically:
in step a1, the initial data set is parsed to obtain a to-be-processed data set. Specifically, user behaviors and article data from different terminal devices are collected through Software Development Kit (SDK) embedded points and reported to a server, the user behaviors and the article data are initial data, the server analyzes and processes the initial data of different terminal devices, and sets at least one (at least one) mode to prevent the initial data from being lost, but repeat data is possible, based on the result, the server analyzes the initial data from different terminal devices to obtain a theme type corresponding to each initial data, the initial data of different theme types are distributed and calculated, and the data of the target theme type is processed by the embodiment of the application, so that the server can extract the initial data of the theme type as the target theme type to generate a to-be-processed data set, and the to-be-processed data set comprises L to-be-processed data, l is an integer greater than 1.
In step a2, the server creates N initial bit arrays based on the set of data to be processed. Specifically, since the set of data to be processed includes L pieces of data to be processed, in order to determine whether each piece of data to be processed is duplicate data, the server creates N initial bit arrays with a length of L, that is, each initial bit array includes L initial bit arrays, one initial bit array is associated with one piece of data to be processed, and a position parameter of each initial bit array is initialized to an initial value, where the initial value is 0 in this embodiment of the present application.
In step A3, after the creation of the N initial bit arrays is completed in step a2, a hash (hash) function corresponding to the N initial bit arrays is created, and a modulo function corresponding to each initial bit array is created. The specific hash function and modulo function are described in detail in the following embodiments.
In step a4, a hash function corresponding to N initial bit arrays is created through step A3, and each to-be-processed data in the to-be-processed data is processed through the modulo function corresponding to each initial bit array, so as to obtain a modulo value corresponding to each to-be-processed data in each initial bit array, the modulo value corresponding to one initial bit array in L initial bit arrays included in the initial bit arrays, and then the position parameter of each initial bit array in the N initial bit arrays is replaced with the modulo value corresponding to each to-be-processed data in each initial bit array, so as to obtain a target bit array set, the target bit array set includes N target bit arrays, each target bit array includes L target bit arrays, one target bit array is associated with one to-be-processed data, and the position parameter of each target bit array is the modulo value corresponding to each to-be-processed data in each initial bit array, the modulus value is actually the remainder of each data to be processed that is obtained by taking the remainder in different initial digit groups.
In step a5, target data is determined from the aforementioned L data to be processed based on the target bit array set obtained in step a4, the position parameters of the N target arrays associated with the target data are different from the initial values, and the target data is the repeated data described in the embodiment of the present application.
With reference to the above description, taking an execution subject as a server as an example, please refer to fig. 3, where fig. 3 is a schematic diagram of an embodiment of a data processing method provided in the embodiment of the present application, and as shown in fig. 3, the method includes:
101. and acquiring a data set to be processed.
In this embodiment, the server obtains a data set to be processed, where the data set to be processed includes L data to be processed, and L is an integer greater than 1. Specifically, the multiple pieces of to-be-processed data in the to-be-processed data set are derived from at least two different terminal devices, for example, the to-be-processed data set includes 3 pieces of to-be-processed data, which are to-be-processed data a, to-be-processed data B, and to-be-processed data C, where the to-be-processed data a and the to-be-processed data B are derived from the terminal device a, and the to-be-processed data B is derived from the terminal device B.
102. An initial bit array set is created based on the set of data to be processed.
In this embodiment, the server creates an initial bit array set based on the to-be-processed data set, and based on step 101, it can be known that the to-be-processed data set includes L to-be-processed data, and the present application aims to determine whether the to-be-processed data is repeated data, so that the server creates N initial bit arrays with a length of L, that is, each initial bit array includes L initial arrays, one initial array is associated with one to-be-processed data, and N is an integer greater than 1. For example, the initial array 1, the initial array 2, and the initial array 3 are included in the initial bit array a, and the to-be-processed data 1 is associated with the initial array 1, the to-be-processed data 2 is associated with the initial array 2, and the to-be-processed data 3 is associated with the initial array 3. Secondly, the server needs to initialize the position parameter of each initial array to an initial value, which is 0 in the embodiment of the present application.
For ease of understanding, referring to fig. 4, fig. 4 is a schematic diagram of an embodiment of an initial bit array set provided by the present application, and as shown in fig. 4, B11, B12, B21, and B22 all indicate initial arrays. Based on this, the initial bit array B1 includes a plurality of initial arrays, such as an initial array B11 and an initial array B12, and the to-be-processed data 1 is associated with the initial array B11, the to-be-processed data 2 is associated with the initial array B12, and then, the position parameter of the initial array B11 is 0, and the position parameter of the initial array B12 is 0. Similarly, the initial bit array B2 includes a plurality of initial arrays, such as an initial array B21 and an initial array B22, and the to-be-processed data 1 is associated with the initial array B21, and the to-be-processed data 2 is associated with the initial array B22, and then, the position parameter of the initial array B21 is 0, the position parameter of the initial array B12 is 0, the position parameter of the initial array B21 is 0, and the position parameter of the initial array B22 is 0. It should be understood that the foregoing examples are for the purpose of understanding the present solution and are not to be construed as limiting thereof.
103. And processing each to-be-processed data in the to-be-processed data based on the initial bit array set to obtain a modulus value corresponding to each to-be-processed data in each initial bit array.
In this embodiment, the server processes each to-be-processed data in the to-be-processed data based on the initial bit array set to obtain a modulo value corresponding to each to-be-processed data in each initial bit array, where the modulo value corresponds to one initial array of the L initial arrays included in the initial bit array. For example, based on the example of step 102, including initial array 1, initial array 2, and initial array 3 in initial bit array a, and pending data 1 is associated with initial array 1, pending data 2 is associated with initial array 2, and pending data 3 is associated with initial array 3, then the modulo value of pending data 1 is associated with initial array 1, the modulo value of pending data 2 is associated with initial array 2, and the modulo value of pending data 3 is associated with initial array 3. Specifically, the modulus value is actually the remainder of each data to be processed after modulus processing based on the initial bit array.
104. And replacing the position parameter of each initial array in the N initial bit arrays with a corresponding modulus value of each to-be-processed data in each initial bit array to obtain a target bit array set.
In this embodiment, the server replaces the position parameter of each initial array in the N initial bit arrays with the corresponding modulo value of each to-be-processed data in each initial bit array, so as to obtain the target bit array set. Specifically, the server maps the modulus value corresponding to the data to be processed in each initial bit array to the position parameter of each initial bit array associated with the data to be processed.
Specifically, the obtained target bit array set includes N target bit arrays, each target bit array includes L target bit arrays, the target bit arrays are associated with the data to be processed one by one, and the position parameter of each target bit array is a modulus value corresponding to each data to be processed in each initial bit array. For convenience of understanding, further explanation is made based on the initial bit array set shown in fig. 4 as an example, please refer to fig. 5, where fig. 5 is a diagram illustrating an embodiment of a target bit array set provided by an embodiment of the present application, as shown in fig. 5, C11, C12, C21, and C22 all indicate initial arrays, C31, C32, C41, and C42 all indicate target arrays, C51 indicates a modulus value corresponding to data 1 to be processed in the initial bit array C1, C52 indicates a modulus value corresponding to data 1 to be processed in the initial bit array C2, C53 indicates a modulus value corresponding to data 2 to be processed in the initial bit array C1, and C54 indicates a modulus value corresponding to data 5 to be processed in the initial bit array C2.
Based on this, if the modulus value C52 corresponding to the initial bit array C1 obtained from the data 1 to be processed is 1, and the modulus value C52 corresponding to the initial bit array C2 is 0, the modulus value C51 may be mapped to the position parameter of the initial data C11, and the modulus value C52 may be mapped to the position parameter of the initial data C21. Similarly, if the modulo value C53 in the initial bit array C1 obtained from the data 2 to be processed is 0, and the corresponding modulo value C54 in the initial bit array C2 is 1, the modulo value C53 may be mapped to the position parameter of the initial data C21, and the modulo value C54 may be mapped to the position parameter of the initial data C22. Thus, a target array C3 and a target array C4 are obtained, and the to-be-processed data 1 is associated with the target array C31 and the target array C41, the to-be-processed data 2 is associated with the target array C41 and the target array C42, then, the position parameter of the target array C31 is 1, the position parameter of the target array C41 is 0, the position parameter of the target array C32 is 2, and the position parameter of the target array C42 is 1. It should be understood that the foregoing examples are for the purpose of understanding the present solution and are not to be construed as limiting thereof.
105. Target data is determined from the L data to be processed based on the target bit array set.
In this embodiment, the server determines target data from the L pieces of data to be processed based on the target bit array set, where the position parameters of the N target arrays associated with the target data are different from the initial values, that is, the target data is duplicate data. Specifically, by obtaining the target bit array set in step 104, and replacing the position parameter of each target data in the target bit array with the modulus value of the associated to-be-processed data, based on the bloom filter mentioned in the foregoing embodiment, the operation logic based on the bloom filter can determine that when the position parameters of all target arrays associated with the to-be-processed data are all 1, then the to-be-processed data is in the set, that is, the to-be-processed data has been processed at least once, thereby determining that the to-be-processed data is the target data (duplicated data).
For example, referring to fig. 6, fig. 6 is a schematic view of an embodiment of determining target data provided by the present application, as shown in fig. 6, if the target bit array set includes the target bit array D1 and the target bit array D2, the target bit array D11 associated with the data to be processed 1 is included in the target bit array D1, and the position parameter of the target array D11 is "1", and the target bit array D2 includes the target array D21 associated with the data to be processed 1, and the position parameter of the target array D21 is "1", so that the position parameter of the target array D11 associated with the data to be processed 1 and the position parameter of the target array D21 are both different from the initial values, and thus the data to be processed 1 can be determined as the target data. It should be understood that, in the embodiment, the position data is taken as 1 for judgment, and since the modulus value is taken as a remainder in practical application, and the remainder may be a positive integer such as 0, 1, 2, and the like, as long as the position parameters of the target array are all values different from the initial value (0 in the present application), the target data may be determined, and there may be a case where the position parameters of the target array associated with the target data in different target bit arrays are different, so this scheme does not limit this.
In the embodiment of the application, a data processing method is provided, and according to the method, initial arrays are associated with data to be processed one by one, that is, one data to be processed is mapped into one initial array, a modulus value corresponding to each initial bit array of each data to be processed replaces a position parameter of the initial array associated with the data to be processed with the modulus value of the data to be processed, so as to obtain a position parameter of a target array associated with the data to be processed, and when the position parameter of the target array associated with the data to be processed is different from the initial value, the data to be processed is determined to be repeated data, each data does not need to access external storage, and consumption of storage resources can be reduced in a scene with a large data volume, so that real-time data processing efficiency is improved.
Optionally, on the basis of the embodiment corresponding to fig. 3, in an optional embodiment of the data processing method provided in the embodiment of the present application, each to-be-processed data in the to-be-processed data is processed based on the initial bit array set, so as to obtain a modulus value corresponding to each to-be-processed data in each initial bit array, which specifically includes:
creating a hash function corresponding to the initial digit group set;
creating a modulus function corresponding to each initial bit array;
and processing each to-be-processed data in the to-be-processed data based on the hash function corresponding to the initial bit array set and the modulus function corresponding to each initial bit array to obtain a modulus value corresponding to each to-be-processed data in each initial bit array.
In this embodiment, the server needs to specifically create a hash function corresponding to the initial bit array set, that is, create a corresponding hash function for the initial bit array set. And it also needs to create a modulus function corresponding to each initial bit array, that is, create a corresponding modulus function for all initial bit arrays in the initial bit array set, for example, hash function 1 corresponding to the initial bit array set, modulus function 1 corresponding to initial bit array 1 in the initial bit array set, modulus function 2 corresponding to initial bit array 2, and modulus function 3 corresponding to initial bit array 3. Therefore, the server processes each to-be-processed data in the to-be-processed data based on the hash function corresponding to the initial bit array set and the modulus function corresponding to each initial bit array, and obtains a modulus value corresponding to each to-be-processed data in each initial bit array. How to obtain the modulus value based on the hash function and the modulus function will be described in detail below.
Optionally, on the basis of the embodiment corresponding to fig. 3, in an optional embodiment of the data processing method provided in the embodiment of the present application, based on the hash function corresponding to the initial bit array set and the modulus function corresponding to each initial bit array, each to-be-processed data in the to-be-processed data is processed to obtain a modulus value corresponding to each to-be-processed data in each initial bit array, where the method specifically includes:
analyzing L data to be processed in the data set to be processed to obtain a key value corresponding to each data to be processed;
performing hash calculation on the key value corresponding to each piece of data to be processed by using a hash function corresponding to the initial bit array set to obtain a hash value of the key value corresponding to each piece of data to be processed;
and performing modulo processing on the hash value of the key value corresponding to each to-be-processed data by using the modulo function corresponding to each initial bit array to obtain a modulo value corresponding to each to-be-processed data in each initial bit array.
In this embodiment, the server first analyzes L pieces of data to be processed in the set of data to be processed, obtains a key value (key) corresponding to each piece of data to be processed, then performs hash calculation on the key value corresponding to each piece of data to be processed using a hash function corresponding to the set of initial bit arrays, obtains a hash value hash (key) of the key value corresponding to each piece of data to be processed, and based on this, performs modulo processing on the hash value hash (key) of the key value corresponding to each piece of data to be processed using a modulo function corresponding to each initial bit array, and obtains a modulo value corresponding to each piece of data to be processed in each initial bit array. Specifically, the modulus processing is performed through the following formula (1) to obtain a modulus value corresponding to each to-be-processed data in each initial bit array:
yi=modi(hash(key))
yi∈{1,...,L};(1)
wherein, yiThe method comprises the steps of obtaining a modulus value corresponding to data to be processed in an initial bit array i, obtaining a hash (key) of a key value corresponding to the data to be processed, obtaining the quantity of the data to be processed in a data set to be processed, obtaining an initial bit array i, and obtaining the value of the key value.
In the embodiment of the application, another data processing method is provided, and by the method, a hash value of a key value corresponding to data to be processed can be obtained by using a hash function, but a modulo function corresponding to a plurality of initial bit arrays is used for modulo the hash value of the generated value key value, so that the obtained modulo value can be subsequently associated with an initial array position parameter, thereby ensuring the feasibility of data processing, and secondly, the key value corresponding to each data to be processed does not need to be subjected to multiple times of hash calculation, thereby reducing the running time of the hash calculation, and thus improving the efficiency of data processing.
Optionally, on the basis of the embodiment corresponding to fig. 3, in an optional embodiment of the data processing method provided in the embodiment of the present application, the data processing method further includes:
analyzing L data to be processed in the data set to be processed to obtain a non-key value corresponding to each data to be processed;
and after determining the target data from the L data to be processed based on the target bit array set, the method of data processing further comprises:
the target data is discarded.
In this embodiment, the server can also perform parsing on L pieces of data to be processed in the set of data to be processed, and obtain a non-key value corresponding to each piece of data to be processed. It should be understood that the server may analyze the non-key value corresponding to the data to be processed to obtain the non-key value corresponding to the data to be processed, where the analyzing may be performed simultaneously with the obtaining of the key value corresponding to the data to be processed, or may analyze the data to be processed to obtain the non-key value corresponding to the data to be processed after determining that the data to be processed is not the target data, and the specific details are not limited herein. Secondly, after the server determines target data from the L data to be processed based on the target bit array set, the target data is discarded, so that the repeated data can be filtered under the condition of using enough few storage resources, the repeated data is discarded, the waste of the storage space is avoided, the reasonable distribution of the storage resources of the storage space can be ensured, and the operation and maintenance cost of the storage system is reduced.
Optionally, on the basis of the embodiment corresponding to fig. 3, in an optional embodiment of the data processing method provided in the embodiment of the present application, the data processing method further includes:
determining cache data from the L data to be processed based on the target bit array set, wherein at least one position parameter exists in the position parameter of each target array in the N target arrays associated with the cache data and is an initial value, and the cache data is non-repeated data;
and writing the key value corresponding to the cache data and the non-key value corresponding to the cache data into the cache.
In this embodiment, the server can always determine the target data in step 105, and then the server can also determine the cache data from the L data to be processed based on the target bit array set, where at least one of the location parameters of each of the N target arrays associated with the cache data exists as an initial value, and based on the bloom filter mentioned in the foregoing embodiment, the operation logic based on the bloom filter can determine that the data to be processed is not processed if the location parameters of all target arrays associated with the data to be processed exist as 0, so as to determine that the data to be processed is cache data (non-duplicated data). Further, since the server obtains the non-key value corresponding to each piece of data to be processed based on the method described in the foregoing embodiment, the server may also write the key value corresponding to the cached data and the non-key value corresponding to the cached data into the cache.
Illustratively, referring to fig. 7, fig. 7 is a schematic diagram of an embodiment of determining cache data provided by the present application, as shown in fig. 7, if the target bit array set includes a target bit array E1 and a target bit array E2, the target bit array E1 includes a target bit array E11 associated with data to be processed 1 and a target bit array E12 associated with data to be processed 2, and the position parameter of the target bit array E11 is "0", and the position parameter of the target bit array E12 is "1". Next, the target bit array E2 includes a target array E21 associated with the data to be processed 1 and a target array E22 associated with the data to be processed 2, and the position parameter of the target array E21 is "1" and the position parameter of the target array E22 is "0". Therefore, although the position parameter of the target array E21 associated with the data to be processed 1 is different from the initial value, there is a case where the position parameter of the target array E11 associated with the data to be processed 1 coincides with the initial value, and thus the data to be processed 1 can be determined as the cache data. Similarly, the position parameter of the target array E12 associated with the data to be processed 2 is different from the initial value, and the position parameter of the target array E22 associated with the data to be processed 2 is identical to the initial value, but the data to be processed 2 may be determined as the cache data. It should be understood that the foregoing example is only used for understanding the present solution, and in practical applications, regardless of the value of the position parameter of the target array associated with the data to be processed, as long as the position parameter of the target array is an initial value (0 in the present application), the data to be processed can be used as the buffered data, so the foregoing example should not be construed as a limitation of the present solution.
According to the data processing method, the waste of the storage space is avoided by discarding the repeated data, so that the reasonable distribution of the storage resources of the storage space can be ensured, and the operation and maintenance cost of the storage system is reduced. Secondly, by writing the key value and the non-key value corresponding to the non-repeated cache data into the cache, when the cache data needs to be processed in the storage space, the cache data can be directly obtained from the cache address, and the data processing efficiency is improved on the basis of ensuring the reliability of data processing.
Optionally, on the basis of the embodiment corresponding to fig. 3, in an optional embodiment of the data processing method provided in the embodiment of the present application, the data processing method further includes:
determining a key value corresponding to the cache data as a non-repeated key value;
counting and counting the cache data in the L data to be processed to obtain the number of non-repeated key values, wherein the number of the non-repeated key values is the number of all the cache data in the L data to be processed;
counting the modulus values of all cache data in the L data to be processed in each initial bit array;
and determining summarized data of the non-repeated key values based on the number of the non-repeated key values and modulus values corresponding to all cache data in the L data to be processed in each initial bit array, wherein the summarized data is used for performing data analysis processing on the data set to be processed.
In this embodiment, the server may further determine the key value corresponding to the cached data as a non-duplicate key value, and then count and count the cached data in which the L pieces of data to be processed exist, to obtain the number of the non-duplicate key values. Based on this, the server can also obtain the modulus value corresponding to each to-be-processed data in each initial bit array through the foregoing embodiment, so based on the situation that all the cache data are determined from L to-be-processed data, the server can also count the modulus values corresponding to all the cache data in L to-be-processed data in each initial bit array. And finally, determining summarized data of the non-repeated key values based on the number of the non-repeated key values and modulus values of all cache data in the L pieces of data to be processed in each initial bit array, where the summarized data is used for performing data analysis processing on the data set to be processed, such as performing data monitoring or abnormal behavior monitoring on the data set to be processed in the data set to be processed, or acquiring user model characteristics based on the data set to be processed, and the specific details are not limited herein.
Specifically, the server determines summary data of non-duplicate key values by the following equation (2):
Figure BDA0003319187870000121
wherein s is summarized data of the non-repeated key values, and K is the number of the non-repeated key values and is a modulus value corresponding to all cache data in each initial bit array in the L data to be processed.
According to the data processing method, repeated data are discarded, caching is carried out on non-repeated data, and finally regions in a storage space are non-repeated and can be processed, so that the related information of the non-repeated data is counted to obtain summarized data, operation and maintenance personnel can carry out data analysis processing on a data set to be processed based on the summarized data, and the stability of the system is guaranteed.
Optionally, on the basis of the embodiment corresponding to fig. 3, in an optional embodiment of the data processing method provided in the embodiment of the present application, the obtaining a to-be-processed data set specifically includes:
acquiring an initial data set, wherein the initial data set comprises M initial data, each initial data corresponds to a topic type one by one, the M initial data are sourced from at least two data sources, and M is an integer greater than L;
analyzing each initial data in the initial data set to obtain a theme type corresponding to each initial data;
and extracting initial data with the theme type as the target theme type to generate a to-be-processed data set.
In this embodiment, user behaviors and article data from different terminal devices are collected through SDK embedded points and reported to a server, where the user behaviors and the article data are initial data, the server analyzes and processes the initial data of the different terminal devices, and sets an at least once mode, so that the initial data is not lost, but may be duplicated.
In the embodiment of the application, another data processing method is provided, and by the method, the initial data sets from different terminal devices are processed to be distributed to different servers to process the initial data of different theme types, so that the reliability and efficiency of processing data by each server are ensured.
Fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, and as shown in fig. 8, the data processing apparatus includes:
an obtaining module 801, configured to obtain a to-be-processed data set, where the to-be-processed data set includes L to-be-processed data, and L is an integer greater than 1;
a creating module 802, configured to create an initial bit array set based on a to-be-processed data set, where the initial bit array set includes N initial bit arrays, each initial bit array includes L initial bit arrays, the initial bit arrays are associated with the to-be-processed data one by one, and at least one position parameter of a position parameter of each initial bit array is an initial value, and N is an integer greater than 1;
a processing module 803, configured to process each to-be-processed data in the to-be-processed data based on the initial bit array set, to obtain a modulus value corresponding to each to-be-processed data in each initial bit array, where the modulus value corresponds to one initial array of the L initial bit arrays included in the initial bit array;
a replacing module 804, configured to replace a position parameter of each initial array in the N initial bit arrays with a corresponding modulo value of each to-be-processed data in each initial bit array, to obtain a target bit array set, where the target bit array set includes N target bit arrays, each target bit array includes L target bit arrays, the target bit arrays are associated with the to-be-processed data one by one, and the position parameter of each target bit array is the corresponding modulo value of each to-be-processed data in each initial bit array;
the determining module 805 is configured to determine target data from the L data to be processed based on the target bit array set, where the position parameters of the N target arrays associated with the target data are different from the initial values, and the target data is repeated data.
Optionally, on the basis of the embodiment corresponding to fig. 8, in another embodiment of the data processing apparatus 800 provided in the embodiment of the present application, the processing module 803 is specifically configured to create a hash function corresponding to the initial bit array set;
creating a modulus function corresponding to each initial bit array;
and processing each to-be-processed data in the to-be-processed data based on the hash function corresponding to the initial bit array set and the modulus function corresponding to each initial bit array to obtain a modulus value corresponding to each to-be-processed data in each initial bit array.
Optionally, on the basis of the embodiment corresponding to fig. 8, in another embodiment of the data processing apparatus 800 provided in this embodiment of the present application, the processing module 803 is specifically configured to perform analysis processing on L pieces of to-be-processed data in the to-be-processed data set, and obtain a key value corresponding to each piece of to-be-processed data;
performing hash calculation on the key value corresponding to each piece of data to be processed by using a hash function corresponding to the initial bit array set to obtain a hash value of the key value corresponding to each piece of data to be processed;
and performing modulo processing on the hash value of the key value corresponding to each to-be-processed data by using the modulo function corresponding to each initial bit array to obtain a modulo value corresponding to each to-be-processed data in each initial bit array.
Optionally, on the basis of the embodiment corresponding to fig. 8, in another embodiment of the data processing apparatus 800 provided in the embodiment of the present application, the obtaining module 801 is further configured to perform parsing on L pieces of data to be processed in the data set to be processed, and obtain a non-critical value corresponding to each piece of data to be processed;
the processing module 803 is further configured to discard the target data after the determining module 805 determines the target data from the L data to be processed based on the target bit array set.
Optionally, on the basis of the embodiment corresponding to fig. 8, in another embodiment of the data processing apparatus 800 provided in the embodiment of the present application, the data processing apparatus further includes a read/write module 806;
the determining module 805 is further configured to determine cache data from the L data to be processed based on the target bit array set, where at least one location parameter of the location parameter of each target array in the N target arrays associated with the cache data is an initial value, and the cache data is non-duplicate data;
the read/write module 806 is configured to write the key value corresponding to the cached data and the non-key value corresponding to the cached data into the cache.
Optionally, on the basis of the embodiment corresponding to fig. 8, in another embodiment of the data processing apparatus 800 provided in the embodiment of the present application, the data processing apparatus further includes a statistics module 807;
the determining module 805 is further configured to determine a key value corresponding to the cached data as a non-duplicate key value; determining summarized data of the non-repeated key values based on the number of the non-repeated key values and modulus values of all cache data in the L data to be processed in each initial bit array, wherein the summarized data is used for performing data analysis processing on a data set to be processed;
the processing module 803 is further configured to count and count the cache data in the L pieces of data to be processed to obtain the number of non-duplicate key values, where the number of non-duplicate key values is the number of all cache data in the L pieces of data to be processed;
the counting module 807 is further configured to count modulo values corresponding to all cache data in each initial bit array in the L pieces of data to be processed.
Optionally, on the basis of the embodiment corresponding to fig. 8, in another embodiment of the data processing apparatus 800 provided in the embodiment of the present application, the obtaining module 801 is specifically configured to obtain an initial data set, where the initial data set includes M initial data, each initial data corresponds to a topic type one to one, the M initial data is derived from at least two data sources, and M is an integer greater than L;
analyzing each initial data in the initial data set to obtain a theme type corresponding to each initial data;
and extracting initial data with the theme type as the target theme type to generate a to-be-processed data set.
Referring to fig. 9, fig. 9 is a schematic diagram of an embodiment of a server in the embodiment of the present application, and as shown in fig. 9, the server 1000 may generate a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1022 (e.g., one or more processors) and a memory 1032, and one or more storage media 1030 (e.g., one or more mass storage devices) storing an application 1042 or data 1044. Memory 1032 and storage medium 1030 may be, among other things, transient or persistent storage. The program stored on the storage medium 1030 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, a central processor 1022 may be disposed in communication with the storage medium 1030, and configured to execute a series of instruction operations in the storage medium 1030 on the server 1000.
The Server 1000 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input-output interfaces 1058, and/or one or more operating systems 1041, such as a Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTMAnd so on.
The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 9.
The server includes a CPU 1022 for executing the embodiment shown in fig. 3 and the corresponding embodiments in fig. 3.
Also provided in the embodiments of the present application is a computer-readable storage medium, which stores a computer program, and when the computer program runs on a computer, the computer program causes the computer to execute the steps performed by the server in the method described in the foregoing embodiment shown in fig. 3.
Also provided in an embodiment of the present application is a computer program product including a program, which when run on a computer causes the computer to perform the steps performed by the server in the method described in the embodiment of fig. 3.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions in actual implementation, for example, at least two units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may also be distributed on at least two network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (10)

1. A method of data processing, comprising:
acquiring a data set to be processed, wherein the data set to be processed comprises L data to be processed, and L is an integer greater than 1;
creating an initial bit array set based on the data set to be processed, wherein the initial bit array set comprises N initial bit arrays, each initial bit array comprises L initial bit arrays, the initial bit arrays are associated with the data to be processed one by one, at least one position parameter of the position parameters of each initial bit array is an initial value, and N is an integer greater than 1;
processing each piece of data to be processed in the pieces of data to be processed based on the initial bit array set to obtain a modulus value corresponding to each piece of data to be processed in each initial bit array, wherein the modulus value corresponds to one initial array in the L initial arrays included in the initial bit array;
replacing the position parameter of each initial array in the N initial bit arrays with a corresponding modulus value of each to-be-processed data in each initial bit array to obtain a target bit array set, wherein the target bit array set comprises N target bit arrays, each target bit array comprises L target bit arrays, the target bit arrays are associated with the to-be-processed data one by one, and the position parameter of each target bit array is the corresponding modulus value of each to-be-processed data in each initial bit array;
determining target data from the L data to be processed based on the target bit array set, wherein the position parameters of N target arrays associated with the target data are different from the initial values, and the target data are repeated data.
2. The method according to claim 1, wherein the processing each to-be-processed data in the to-be-processed data based on the initial bit array set to obtain a modulus value corresponding to each to-be-processed data in each initial bit array comprises:
creating a hash function corresponding to the initial bit array set;
creating a modulus function corresponding to each initial bit array;
and processing each piece of data to be processed in the data to be processed based on the hash function corresponding to the initial bit array set and the modulus function corresponding to each initial bit array to obtain a modulus value corresponding to each piece of data to be processed in each initial bit array.
3. The method according to claim 2, wherein the processing each to-be-processed data in the to-be-processed data based on the hash function corresponding to the set of initial bit arrays and the modulo function corresponding to each initial bit array to obtain a modulo value corresponding to each to-be-processed data in each initial bit array comprises:
analyzing the L pieces of data to be processed in the data set to be processed to obtain a key value corresponding to each piece of data to be processed;
performing hash calculation on the key value corresponding to each piece of data to be processed by using a hash function corresponding to the initial bit array set to obtain a hash value of the key value corresponding to each piece of data to be processed;
and performing modulo processing on the hash value of the key value corresponding to each piece of data to be processed by using the modulo function corresponding to each initial bit array to obtain a modulo value corresponding to each piece of data to be processed in each initial bit array.
4. The method of claim 3, further comprising:
analyzing the L pieces of data to be processed in the data set to be processed to obtain a non-key value corresponding to each piece of data to be processed;
and after the determining target data from the L data to be processed based on the target bit array set, the method also includes:
and discarding the target data.
5. The method of claim 4, further comprising:
determining cache data from the L data to be processed based on the target bit array set, wherein the position parameter of each target array in the N target arrays associated with the cache data is the initial value, and the cache data is non-repeated data;
and writing the key value corresponding to the cache data and the non-key value corresponding to the cache data into a cache.
6. The method of claim 5, further comprising:
determining a key value corresponding to the cache data as a non-repeated key value;
counting and counting the cache data in the L data to be processed to obtain the number of non-repeated key values, wherein the number of the non-repeated key values is the number of all cache data in the L data to be processed;
counting the modulus values of all cache data in the L data to be processed corresponding to each initial bit array;
and determining summarized data of the non-repeated key values based on the number of the non-repeated key values and modulus values corresponding to all cache data in the L data to be processed in each initial bit array, wherein the summarized data is used for performing data analysis processing on the data set to be processed.
7. The method of claim 1, wherein the obtaining the set of data to be processed comprises:
acquiring an initial data set, wherein the initial data set comprises M initial data, each initial data corresponds to a topic type one by one, the M initial data are sourced from at least two data sources, and M is an integer greater than L;
analyzing each initial data in the initial data set to obtain a theme type corresponding to each initial data;
and extracting initial data with the theme type as the target theme type, and generating the data set to be processed.
8. A data processing apparatus, characterized in that the data processing apparatus comprises:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a data set to be processed, the data set to be processed comprises L data to be processed, and L is an integer greater than 1;
a creating module, configured to create an initial bit array set based on the to-be-processed data set, where the initial bit array set includes N initial bit arrays, each initial bit array includes L initial bit arrays, the initial bit arrays are associated with the to-be-processed data one by one, and at least one position parameter of a position parameter of each initial bit array is an initial value, and N is an integer greater than 1;
the processing module is used for processing each piece of data to be processed in the pieces of data to be processed based on the initial bit array set to obtain a modulus value corresponding to each piece of data to be processed in each initial bit array, wherein the modulus value corresponds to one initial array in the L initial arrays included in the initial bit arrays;
a replacement module, configured to replace a position parameter of each initial array in the N initial bit arrays with a corresponding modulo value of each to-be-processed data in each initial bit array, so as to obtain a target bit array set, where the target bit array set includes N target bit arrays, each target bit array includes L target bit arrays, the target bit arrays are associated with the to-be-processed data one by one, and a position parameter of each target bit array is a corresponding modulo value of each to-be-processed data in each initial bit array;
a determining module, configured to determine target data from the L data to be processed based on the target bit array set, where position parameters of N target arrays associated with the target data are different from the initial values, and the target data is repeated data.
9. A computer device, comprising: a memory, a transceiver, a processor, and a bus system;
wherein the memory is used for storing programs;
the processor is configured to execute a program in the memory to implement the method of any one of claims 1 to 7;
the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.
10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 7.
CN202111240901.4A 2021-10-25 2021-10-25 Data processing method and device, computer equipment and storage medium Pending CN113934767A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111240901.4A CN113934767A (en) 2021-10-25 2021-10-25 Data processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111240901.4A CN113934767A (en) 2021-10-25 2021-10-25 Data processing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113934767A true CN113934767A (en) 2022-01-14

Family

ID=79284267

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111240901.4A Pending CN113934767A (en) 2021-10-25 2021-10-25 Data processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113934767A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116702401A (en) * 2023-08-07 2023-09-05 腾讯科技(深圳)有限公司 Data processing method, related device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116702401A (en) * 2023-08-07 2023-09-05 腾讯科技(深圳)有限公司 Data processing method, related device, equipment and storage medium
CN116702401B (en) * 2023-08-07 2023-12-08 腾讯科技(深圳)有限公司 Data processing method, related device, equipment and storage medium

Similar Documents

Publication Publication Date Title
US11392416B2 (en) Automated reconfiguration of real time data stream processing
US10122788B2 (en) Managed function execution for processing data streams in real time
US11516097B2 (en) Highly scalable distributed connection interface for data capture from multiple network service sources
CN108616419B (en) Data packet acquisition and analysis system and method based on Docker
US20180109421A1 (en) Template-based declarative and composable configuration of network functions
US8344916B2 (en) System and method for simplifying transmission in parallel computing system
US20230370500A1 (en) Distributed interface for data capture from multiple sources
US20180248934A1 (en) Method and System for a Scheduled Map Executor
CN107391770B (en) Method, device and equipment for processing data and storage medium
CN111580884A (en) Configuration updating method and device, server and electronic equipment
US11546380B2 (en) System and method for creation and implementation of data processing workflows using a distributed computational graph
CN110798517A (en) Decentralized cluster load balancing method and system, mobile terminal and storage medium
US10326824B2 (en) Method and system for iterative pipeline
CN113934767A (en) Data processing method and device, computer equipment and storage medium
CN114125015A (en) Data acquisition method and system
CN106599244B (en) General original log cleaning device and method
CN109981697A (en) A kind of file dump method, system, server and storage medium
US20220043806A1 (en) Parallel decomposition and restoration of data chunks
US20170371726A1 (en) Rapid predictive analysis of very large data sets using an actor-driven distributed computational graph
US20220094741A1 (en) Incremental Application Programming Interface (API) Processing Based on Resource Utilization
CN114547199A (en) Database increment synchronous response method and device and computer readable storage medium
TW202315360A (en) Microservice allocation method, electronic equipment, and storage medium
US11755957B2 (en) Multitemporal data analysis
CN115757041B (en) Method for collecting dynamically configurable multi-cluster logs and application
CN113760836A (en) Wide table calculation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination