CN112347355A - Data processing method, device, server and storage medium - Google Patents

Data processing method, device, server and storage medium Download PDF

Info

Publication number
CN112347355A
CN112347355A CN202011251902.4A CN202011251902A CN112347355A CN 112347355 A CN112347355 A CN 112347355A CN 202011251902 A CN202011251902 A CN 202011251902A CN 112347355 A CN112347355 A CN 112347355A
Authority
CN
China
Prior art keywords
content
bit array
filtering
carried
identifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011251902.4A
Other languages
Chinese (zh)
Inventor
饶慧林
黄志濠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Kugou Computer Technology Co Ltd
Original Assignee
Guangzhou Kugou Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Kugou Computer Technology Co Ltd filed Critical Guangzhou Kugou Computer Technology Co Ltd
Priority to CN202011251902.4A priority Critical patent/CN112347355A/en
Publication of CN112347355A publication Critical patent/CN112347355A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application provides a data processing method, a data processing device, a server and a storage medium, and belongs to the technical field of big data. The method comprises the following steps: generating a first digit group based on a plurality of first content identifications in a plurality of first historical behavior data corresponding to any user identification in a data warehouse, wherein each first content identification corresponds to a plurality of digits in the first digit group respectively; storing the corresponding relation between the user identification and the first digit array; responding to a filtering request, and determining a target digit array corresponding to the user identification carried by the filtering request based on the stored corresponding relation between the user identification and the digit array; and determining a filtering result of the content identification carried by the filtering request based on the target bit array. According to the method and the device, a large number of content identifications are represented by binary digit arrays, the filtering service is supported through the storage digit array, the storage space occupied for supporting the filtering service can be greatly reduced, and the utilization rate of the storage space is improved.

Description

Data processing method, device, server and storage medium
Technical Field
The present application relates to the field of big data technologies, and in particular, to a data processing method, an apparatus, a server, and a storage medium.
Background
In some scenarios, there is a need for filtering based on massive amounts of data in order to provide better services to users. For example, in a scenario of content recommendation to a user, it is necessary to filter out content already recommended to the user from candidate content, and then determine content recommended to the user from remaining candidate content.
In the related art, the user identifier and the content identifier of the recommended content are stored correspondingly in advance, so that the recommended content can be determined according to the corresponding relation between the stored user identifier and the content identifier of the recommended content, and then the recommended content is filtered.
Because the content identifier is usually in a string format, in the case of a large storage amount of the content identifier, directly storing the content identifier occupies a huge storage space, resulting in a low utilization rate of the storage space.
Disclosure of Invention
The embodiment of the application provides a data processing method, a data processing device, a server and a storage medium, and can improve the utilization rate of a storage space. The technical scheme is as follows:
according to an aspect of an embodiment of the present application, there is provided a data processing method, including:
generating a first bit array based on a plurality of first content identifications in a plurality of first historical behavior data corresponding to any user identification in a data warehouse, wherein the first historical behavior data is used for recording the content of the executed interactive behavior, and each first content identification corresponds to a plurality of bits in the first bit array respectively;
storing the corresponding relation between any user identification and the first digit array;
responding to a filtering request, and determining a target digit array corresponding to the user identification carried by the filtering request based on the stored corresponding relation between the user identification and the digit array;
and determining a filtering result of the content identification carried by the filtering request based on the target bit array.
In an optional implementation manner, the generating a first bit array based on a plurality of first content identifiers in a plurality of first historical behavior data corresponding to any user identifier in a data warehouse includes:
responding to a query compression statement, calling a query compression module, executing a plurality of first content identifications in a plurality of first historical behavior data corresponding to any user identification in a data warehouse in a memory corresponding to a data warehouse tool hive, and generating a first digit array;
the query compression statement is used for indicating to obtain a first bit array used for representing a plurality of first content identifications corresponding to any user identification, and the query compression module is a functional module expanded on the basis of the hive.
In another alternative implementation, the first historical behavior data is used for recording the content of the executed interactive behavior in a first time period;
the storing of the corresponding relationship between the any user identifier and the first bit array includes:
generating a first keyword based on the any user identification and the first time period;
and storing the corresponding relation between the first keyword and the first bit array.
In another optional implementation manner, after storing the corresponding relationship between the any user identifier and the first bit array, the method further includes:
in response to reaching the digit group update time, generating a second digit group based on a plurality of second content identifiers in a plurality of second historical behavior data corresponding to any user identifier in the data warehouse, wherein the second historical behavior data is used for recording the content of the executed interactive behaviors in a second time period to which the digit group update time belongs;
and updating the stored bit array corresponding to the second time period to the second bit array.
In another alternative implementation, the first historical behavior data is used for recording the content of the interactive behavior of the executed target type in a first time period;
the storing of the corresponding relationship between the any user identifier and the first bit array includes:
generating a second keyword based on the any user identification, the first time period and the target type of interaction behavior in the first historical behavior data;
and storing the corresponding relation between the second keyword and the first bit array.
In another optional implementation manner, before storing the corresponding relationship between the any user identifier and the first bit array, the method further includes:
encoding the first digit array to generate a character string;
the storing of the corresponding relationship between the any user identifier and the first bit array includes:
and storing the corresponding relation between any user identification and the character string.
In another optional implementation manner, before determining a filtering result of the content identifier carried by the filtering request based on the target bit array, the method further includes:
receiving at least one piece of real-time behavior data corresponding to any user identification, wherein the real-time behavior data is used for recording the content of the executed interactive behavior;
storing the corresponding relation of the any user identification and at least one content identification in at least one piece of real-time behavior data corresponding to the any user identification;
the determining, based on the target bit array, a filtering result of the content identifier carried by the filtering request includes:
and determining a filtering result of the content identification carried by the filtering request based on the target bit array and at least one third content identification corresponding to the user identification carried by the filtering request, wherein the at least one third content identification is derived from at least one real-time behavior data corresponding to the user identification carried by the filtering request.
In another optional implementation manner, the generating a first bit array based on a plurality of first content identifiers in a plurality of first historical behavior data corresponding to any user identifier in a data warehouse includes:
processing each first content identifier based on a hash processing mode corresponding to the first bit array to obtain a plurality of bits of the first bit array corresponding to each first content identifier;
setting values on a plurality of bits of the first bit array corresponding to each first content identifier to be 1.
In another optional implementation manner, the determining, based on the target bit array, a filtering result of the content identifier carried by the filtering request includes:
processing the content identification carried by the filtering request based on a hash processing mode corresponding to the target bit array to obtain a plurality of bits of the target bit array corresponding to the content identification carried by the filtering request;
in response to that values on a plurality of bits of the target bit array corresponding to the content identifier carried by the filtering request include 0, if the content identifier carried by the filtering request is not included in the plurality of content identifiers represented by the target bit array, determining that the filtering result is that the content identifier carried by the filtering request is not filtered;
and in response to that the values on the bits of the target bit array corresponding to the content identifiers carried by the filtering request are all 1, determining that the content identifiers represented by the target bit array comprise the content identifiers carried by the filtering request, and determining that the filtering result is to filter the content identifiers carried by the filtering request.
According to another aspect of embodiments of the present application, there is provided a data processing apparatus, including:
the digit group generating module is used for generating a first digit group based on a plurality of first content identifications in a plurality of first historical behavior data corresponding to any user identification in a data warehouse, wherein the first historical behavior data is used for recording the content of the executed interactive behavior, and each first content identification corresponds to a plurality of digits in the first digit group respectively;
a first corresponding relation storage module, configured to store a corresponding relation between the any user identifier and the first bit array;
the bit array determining module is used for responding to a filtering request, and determining a target bit array corresponding to the user identifier carried by the filtering request based on the corresponding relation between the stored user identifier and the bit array;
and the filtering result determining module is used for determining the filtering result of the content identification carried by the filtering request based on the target bit array.
In an optional implementation manner, the bit array generating module is configured to:
responding to the query compression statement, calling a query compression module, and generating a first digit array in a memory corresponding to the data warehouse tool hive based on a plurality of first content identifications in a plurality of first historical behavior data corresponding to any user identification in the data warehouse;
the query compression statement is used for indicating to obtain a first bit array used for representing a plurality of first content identifications corresponding to any user identification, and the query compression module is a functional module expanded on the basis of the hive.
In another alternative implementation, the first historical behavior data is used for recording the content of the executed interactive behavior in a first time period;
the first correspondence storage module is configured to:
generating a first keyword based on the any user identification and the first time period;
and storing the corresponding relation between the first keyword and the first bit array.
In another optional implementation manner, the digit group generating module is further configured to generate a second digit group based on a plurality of second content identifiers in a plurality of second historical behavior data corresponding to the any user identifier in the data warehouse in response to reaching a digit group update time, where the second historical behavior data is used to record content of an executed interactive behavior in a second time period to which the digit group update time belongs;
the first corresponding relation storage module is further configured to update the stored bit array corresponding to the second time period to the second bit array.
In another alternative implementation, the first historical behavior data is used for recording the content of the interactive behavior of the executed target type in a first time period;
the first correspondence storage module is configured to:
generating a second keyword based on the any user identification, the first time period and the target type of interaction behavior in the first historical behavior data;
and storing the corresponding relation between the second keyword and the first bit array.
In another optional implementation manner, the apparatus further includes:
the digit array coding module is used for coding the first digit array to generate a character string;
the first corresponding relation storage module is further configured to store a corresponding relation between the any user identifier and the character string.
In another optional implementation manner, the apparatus further includes:
the real-time behavior data receiving module is used for receiving at least one piece of real-time behavior data corresponding to any user identifier, and the real-time behavior data is used for recording the content of the executed interactive behavior;
the second corresponding relation storage module is used for storing the corresponding relation between any user identification and at least one content identification in at least one piece of real-time behavior data corresponding to any user identification;
the filtering result determining module is configured to determine a filtering result for the content identifier carried in the filtering request based on the target bit array and at least one third content identifier corresponding to the user identifier carried in the filtering request, where the at least one third content identifier is derived from at least one real-time behavior data corresponding to the user identifier carried in the filtering request.
In another optional implementation manner, the bit array generating module is configured to:
processing each first content identifier based on a hash processing mode corresponding to the first bit array to obtain a plurality of bits of the first bit array corresponding to each first content identifier;
setting values on a plurality of bits of the first bit array corresponding to each first content identifier to be 1.
In another optional implementation manner, the filtering result determining module is configured to:
processing the content identification carried by the filtering request based on a hash processing mode corresponding to the target bit array to obtain a plurality of bits of the target bit array corresponding to the content identification carried by the filtering request;
in response to that values on a plurality of bits of the target bit array corresponding to the content identifier carried by the filtering request include 0, if the content identifier carried by the filtering request is not included in the plurality of content identifiers represented by the target bit array, determining that the filtering result is that the content identifier carried by the filtering request is not filtered;
and in response to that the values on the bits of the target bit array corresponding to the content identifiers carried by the filtering request are all 1, determining that the content identifiers represented by the target bit array comprise the content identifiers carried by the filtering request, and determining that the filtering result is to filter the content identifiers carried by the filtering request.
According to another aspect of the embodiments of the present application, there is provided a server, where the server includes a processor and a memory, where at least one program code is stored in the memory, and the at least one program code is loaded and executed by the processor to implement the data processing method in any one of the above-mentioned optional implementation manners.
According to another aspect of embodiments of the present application, there is provided a computer-readable storage medium having at least one program code stored therein, where the at least one program code is loaded and executed by a processor to implement the data processing method in any one of the above-mentioned optional implementation manners.
According to another aspect of embodiments of the present application, there is provided a computer program product or a computer program, the computer program product or the computer program comprising computer program code, the computer program code being stored in a computer-readable storage medium, the computer program code being read by a processor of a server from the computer-readable storage medium, the computer program code being executed by the processor, so that the server performs the data processing method described in any one of the above-mentioned alternative implementations.
According to the technical scheme provided by the embodiment of the application, the content identifications in each historical behavior data corresponding to any user identification respectively correspond to a plurality of digits of the digit group, and the digit group represents the plurality of content identifications of the executed interactive behavior corresponding to the user identification, so that when a filtering request carrying any user identification is received, the filtering result of the content identification carried by the filtering request can be determined based on the digit group corresponding to the user identification. Under the large data scene with a large number of content identifications, a large number of content identifications are represented by a binary digit array, and the filtering service is supported by the storage digit array, so that the storage space occupied for supporting the filtering service is greatly reduced, and the utilization rate of the storage space is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;
fig. 2 is a flowchart of a data processing method provided in an embodiment of the present application;
fig. 3 is a flowchart of a data processing method provided in an embodiment of the present application;
fig. 4 is a schematic diagram of a data processing method provided in an embodiment of the present application;
fig. 5 is a block diagram of a data processing apparatus according to an embodiment of the present application;
fig. 6 is a block diagram of a server according to an embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application. Referring to fig. 1, the implementation environment includes a terminal 110, a first server 120, and a second server 130.
Optionally, the terminal 110 is a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, or the like, but is not limited thereto. The terminal 110 is provided with a client having a content recommendation function. For example, the client is an audio playing client and has an audio recommendation function; for another example, the client is a video client and has a video recommendation function.
The first server 120 is used to provide a content filtering service. The second server 130 is used to provide a content recommendation service. Wherein, the terminal 110 is connected with the first server 120 through a wireless or wired network; the terminal 110 is connected with the second server 130 through a wireless or wired network; the first server 120 and the second server 130 are connected through a wireless or wired network. The terminal 110 receives the content which is sent by the second server 130 and is recommended to the user corresponding to the terminal 110, and displays the received content in a recommendation interface; the interactive behavior performed on the displayed content is detected and reported to the first server 120. The first server 120 stores the interactive behavior performed on the displayed content. The second server 130 requests the first server 120 to determine whether the content with the executed interactive behavior exists in the candidate content to be recommended when determining the content recommended to the user corresponding to the terminal 110 next time; filtering out contents on which interactive behaviors are performed from the candidate contents to be recommended according to the feedback of the first server 120; and determining the content recommended to the user corresponding to the terminal 110 from the remaining candidate content, and sending the determined content to the terminal 110, so that the terminal 110 displays the received content in a recommendation interface, and repeated recommendation of the content is reduced.
Fig. 2 is a flowchart of a data processing method according to an embodiment of the present application. Referring to fig. 2, the embodiment includes:
201. the first server generates a first bit array based on a plurality of first content identifications in a plurality of first historical behavior data corresponding to any user identification in the data warehouse, wherein the first historical behavior data are used for recording the content of the executed interactive behavior, and each first content identification corresponds to a plurality of bits in the first bit array respectively.
Wherein, the data warehouse is a large-scale data storage set for storing historical data. The data warehouse is the data base for data mining, analysis and processing by the BI (Business Intelligence) project. Data warehouses support distributed storage of large-scale data. In the embodiment of the application, the data warehouse is used for storing historical behavior data of users, and recommendation of repeated contents is reduced through statistics and analysis of the historical behavior data in the data warehouse.
The historical behavior data is used for recording the content of the executed interactive behavior. Optionally, the historical behavior data includes a user identifier, a content identifier, an interactive behavior, and an execution time, and is used to indicate that the user identified by the user identifier performed the interactive behavior on the content identified by the content identifier at the execution time. The user identification is used for uniquely identifying the user. The user identification is also called an Identity Document (ID), a user account, and the like. Optionally, the User Identification is a UID (User Identification) assigned to the User. The content identification is used for uniquely identifying the content. The content identification is also referred to as a content ID. Optionally, the type of the content identified by the content identifier is an audio type or a short video type, and the like, which is not limited in this application. Optionally, the interactive behavior refers to an exposure behavior, a play behavior, a collection behavior, or a download behavior, where the exposure behavior refers to a behavior in which the user triggers the terminal to display content.
The bit array, also called bit array, is composed of binary 0's and 1's. A bit array is used for representing a plurality of content identifications corresponding to a user identification. Each content id corresponds to a different plurality of bits in the bit array. For example, the bit array is 1001011101, wherein, the 1 st bit, the 4 th bit and the 8 th bit represent a content id; the 6 th, 7 th and 10 th bits represent another content identification.
202. The first server stores the corresponding relation between any user identification and the first digit array.
The first server correspondingly stores the user identification and a bit array generated based on a plurality of content identifications corresponding to the user identification. Optionally, the first server stores a correspondence between the any user identifier and the first bit array in a key-value (key value pair) form, where the key is the any user identifier and the value is the first bit array.
203. And the first server responds to the filtering request, and determines a target bit array corresponding to the user identifier carried by the filtering request based on the stored corresponding relation between the user identifier and the bit array.
The second server for providing the recommendation service sends a filtering request to the first server before determining the contents recommended to the user to request the first server to determine whether the contents to be recommended have the contents on which the interactive behavior is performed. The filtering request carries a user identifier and a content identifier, wherein the user identifier carried by the filtering request is used for identifying a recommended object, and the content identifier carried by the filtering request is used for identifying candidate content to be recommended. And the first server inquires the corresponding relation between the stored user identification and the digit array based on the user identification carried by the filtering request to obtain a target digit array corresponding to the content identification carried by the filtering request.
204. The first server determines a filtering result of the content identification carried by the filtering request based on the target bit array.
The first server corresponds the content identification carried by the filtering request to a plurality of bits of a target bit array, and determines whether the content identification carried by the filtering request is in a plurality of content identifications represented by the target bit array according to the values of 0 or 1 on the plurality of bits. And if the content identifier carried by the filtering request is in the plurality of content identifiers represented by the target digit array, the interactive behavior of the content identifier carried by the filtering request is represented, and the content identified by the content identifier carried by the filtering request is determined to be filtered from the content to be recommended. If the content identifier carried by the filtering request is not among the plurality of content identifiers represented by the target bit array, it indicates that the content identifier carried by the filtering request is not executed with an interactive behavior, and it is determined that the content identified by the content identifier carried by the filtering request is not filtered.
According to the technical scheme provided by the embodiment of the application, the content identifications in each historical behavior data corresponding to any user identification respectively correspond to a plurality of digits of the digit group, and the digit group represents the plurality of content identifications of the executed interactive behavior corresponding to the user identification, so that when a filtering request carrying any user identification is received, the filtering result of the content identification carried by the filtering request can be determined based on the digit group corresponding to the user identification. Under the large data scene with a large number of content identifications, a large number of content identifications are represented by a binary digit array, and the filtering service is supported by the storage digit array, so that the storage space occupied for supporting the filtering service is greatly reduced, and the utilization rate of the storage space is improved.
Fig. 3 is a flowchart of a data processing method according to an embodiment of the present application. Referring to fig. 3, the embodiment includes:
301. the first server generates a first bit array based on a plurality of first content identifications in a plurality of first historical behavior data corresponding to any user identification in the data warehouse.
It should be noted that, in an optional implementation manner, the plurality of first historical behavior data are all historical behavior data corresponding to the any user identifier stored in the data warehouse, and the first server generates the first bit array corresponding to the any user identifier based on the plurality of first content identifiers in the plurality of first historical behavior data.
In another optional implementation manner, the first server generates a corresponding bit array with a time period of the target duration as a granularity. The plurality of first historical behavior data are used for recording the content of the executed interactive behaviors in the time period of the target duration, and the execution time in the plurality of first historical behavior data is within the time period of the target duration. In the embodiment of the present application, the target duration is 1 month, and the first server takes the month as granularity, and generates a corresponding bit array as an example for explanation. For example, in month 1, the first server generates a first digit group of month 1 corresponding to any user identifier based on a plurality of first content identifiers of which the execution time corresponds to the user identifier in a plurality of first historical behavior data of month 1; in month 2, the first server generates a first digit group of month 2 corresponding to any user identifier based on a plurality of first content identifiers of a plurality of first historical behavior data of month 2 corresponding to the execution time of the user identifier, and so on.
In another optional implementation manner, the first server further uses different types of the interactive behaviors as differences, and generates corresponding bit arrays based on a plurality of content identifications of the same type of the interactive behaviors. The plurality of first historical behavior data are used for recording the content of the executed target type of interactive behavior. Optionally, the target type of interactive behavior is one of an exposure behavior, a play behavior, a collection behavior, and a download behavior. For example, the first server generates a bit array based on a plurality of content identifiers of the executed exposure behavior corresponding to any user identifier; and generating another bit array based on a plurality of content identifications of the executed playing behavior corresponding to any user identification, and so on.
In another optional implementation manner, the first server further uses different types of the content as a difference, and generates a corresponding bit array based on a plurality of first historical behavior data with the same type of the recorded content. Optionally, the type of content includes an audio type, a video type, an image type, a text type, and the like. For example, the first server generates a bit array based on a plurality of content identifications in a plurality of historical behavior data for recording the audio type content corresponding to any one user identification; or the first server generates another bit array based on a plurality of content identifications in a plurality of historical behavior data for recording the video type content corresponding to any user identification.
It should be noted that, optionally, the first server may generate, by taking the user identifier as a difference, the bit array corresponding to the user identifier, and further combine with one or more of the division of the time period, the type of the interactive behavior, and the type of the content to generate a more finely classified bit array corresponding to the user identifier, which is not limited in this embodiment of the present application.
The first server processes the plurality of first content identifiers according to the bloom filter principle to generate a first bit array. The step 301 includes: the first server processes each first content identifier based on a hash processing mode corresponding to the first bit array to obtain a plurality of bits of the first bit array corresponding to each first content identifier; setting values on a plurality of bits of a first bit array corresponding to each first content identification to be 1.
The bloom filter is a random data structure with small occupied space and high efficiency, and is composed of a bit array and a plurality of hash functions. The hash processing method corresponding to the first bit array is to generate a plurality of hash functions based on the first bit array according to the bloom filter principle. The first server performs hash processing on the first content identifier based on each hash function to obtain one bit in a first bit array corresponding to the first content identifier, so that a plurality of bits in the first bit array corresponding to the first content identifier are obtained based on the plurality of hash functions.
The number of the plurality of hash functions is consistent with the number of the plurality of bits in the first bit array corresponding to the first content identifier. That is to say, the first server performs hash processing on the first content identifier based on the K hash functions to obtain K bits in the first bit array corresponding to the first content identifier. K is a positive integer, and K is smaller than the number of elements included in the first bit array, for example, the first bit array includes 10 elements in total, that is, the first bit array is composed of 10 bits, and K is a positive integer smaller than 10. The number of elements included in the first digit group and the specific numerical value of the K can be flexibly configured according to the storage number of the content identifiers, the more the storage number of the content identifiers is, the larger the number of the elements included in the first digit group is, the larger the value of the K is, so that the condition that a plurality of content identifiers correspond to the same digit is reduced, the value of 0 or 1 on a plurality of digits is improved, and the accuracy of judging whether a certain content identifier is in the plurality of content identifiers represented by the digit group is judged.
It should be noted that, in an alternative implementation, the first server queries, analyzes and processes data stored in the data warehouse based on hive (a data warehouse tool). hive supports customization of UDF (User Defined Function) to provide extensibility for data operations. And based on a plurality of first content identifications in a plurality of first historical behavior data corresponding to any user identification in the data warehouse, generating a first bit array, and encapsulating the first bit array in the UDF to form a query compression module. The query compression module is a function module which is expanded on the basis of hive and has the functions of querying a plurality of content identifications corresponding to user identifications, generating a bit array for representing the plurality of content identifications and outputting the bit array. Correspondingly, the first server responds to the query compression statement, calls the query compression module, and executes a step of generating a first bit array based on a plurality of first content identifiers in a plurality of first historical behavior data corresponding to any user identifier in the data warehouse in a memory corresponding to hive.
The query compression statement is used for indicating to acquire a first bit array used for representing a plurality of first content identifications corresponding to any user identification. The first server responds to the query compression statement, and calls the query compression module to obtain a first bit array used for representing a plurality of first content identifications. The query compression module executes a step of generating a first digit group based on a plurality of first content identifiers in a plurality of first historical behavior data corresponding to any user identifier in a data warehouse in a memory corresponding to hive based on the support of hive, and outputs the first digit group.
According to the technical scheme, on the basis of the data warehouse tool hive, a query compression module for generating the bit array is expanded. By calling the query compression module, the bit array generated by compressing the plurality of content identifications can be directly obtained, and the acquisition efficiency of the bit array is improved. And the query compression module executes the step of generating the bit array in the memory corresponding to the hive based on the support of the hive, does not need to export the content identification into a data file, stores the data file in another database, processes the data file to generate the bit array, omits intermediate links such as export and storage of the data file in the bit array generation process, generates the bit array more efficiently and conveniently, and improves the generation efficiency of the bit array. Meanwhile, the occupation of an intermediate link on the storage space is saved, and the utilization rate of the storage space is improved.
In another optional implementation manner, the first server queries a plurality of first historical behavior data corresponding to any user identifier from the data warehouse; exporting a first content identification in the plurality of first historical behavior data as a data file; storing a plurality of first content identifications in the data file in a staging database; a first bit array is generated based on a plurality of first content identifications in the staging database.
Optionally, after the first server generates the first bit array, the first bit array is further encoded to generate the character string. Optionally, the first server encodes the bit array based on rules of Base64 (representing binary data based on 64 printable characters) encoding, generating a string of characters to facilitate transmission and storage of the bit array. Optionally, the function of encoding the first bit array to generate the character string may also be encapsulated in the query compression module, and the query compression module performs the step of encoding the first bit array to generate the character string.
302. The first server stores the corresponding relation between any user identification and the first digit array.
Optionally, the first server stores the corresponding relationship between the any user identifier and the first bit array in a mysql (a relational database management system) database to implement persistent storage of data; or, the first server further caches the corresponding relationship between the any user identifier and the first bit array in a remote dictionary service (rdis) database, so as to realize efficient reading of data.
According to the technical scheme, the plurality of content identifications corresponding to the user identification are represented by the binary digit group, the user identification and the digit group are correspondingly stored, and under the condition that the number of the plurality of content identifications corresponding to the user identification is large, compared with the content identifications in a plurality of character string formats, the occupied storage space can be greatly reduced, and the utilization rate of the storage space is improved.
In an optional implementation manner, the first digit group is used to represent all content identifiers of the executed interaction behavior corresponding to the any user identifier, and correspondingly, the first server stores the corresponding relationship between the any user identifier and the first digit group according to a form in which the any user identifier corresponds to the first digit group. For example, the first server stores the corresponding relationship between the any user identifier and the first bit array in the form of a key-value, where the key is the any user identifier, such as 13753.
In another optional implementation manner, the first server generates a corresponding first bit array with a time period of the target duration as a granularity. The first historical behavior data on which the first bit array is generated is used to record the content of the performed interactive behavior during the first time period. The first time period is a time period of a target time length, for example, the target time length is 1 month, and the first time period is 1 month, 2 months, or 3 months, etc. Correspondingly, the first server generates a first keyword based on the any user identifier and the first time period; and storing the corresponding relation between the first key word and the first bit array. Optionally, the first keyword is in the form of a user identification _ time period. For example, the first time period is month 8, the any user identifier is 13753, the first server generates a first keyword to be 13753_8, the first keyword is used as a key, the first digit group is used as a value, and the first keyword and the first digit group are correspondingly stored in a key-value form to indicate that the user identified by the any user identifier performs an interactive action on the content identified by the plurality of content identifiers represented by the first digit group in the first time period.
In the above technical solution, each bit set corresponding to the user identifier is respectively used for representing the content of the executed interactive behavior in the time period of one target duration. A plurality of different digit groups are divided according to a time period of a certain duration and are correspondingly stored, and the corresponding digit groups can be flexibly acquired according to the filtering requirement to determine whether the interaction behavior is executed on a certain content. For example, the filtering requirement is to filter recommended content in the latest 3 months, and if the target duration is one month and the current month is 3 months, the filtering is performed based on the bit array corresponding to month 1, the bit array corresponding to month 2, and the bit array corresponding to month 3, so that the availability of the stored bit array is improved.
In another alternative implementation, the first historical behavior data on which the first bit array is generated is used to record the content of the target type of interactive behavior that was executed within the first time period. Correspondingly, the first server generates a second keyword based on the any user identifier, the first time period and the target type interactive behavior in the first historical behavior data; and storing the corresponding relation between the second key word and the first bit array. Optionally, the second keyword is in the form of an interactive behavior _ user identification _ time period. For example, the exposure behavior is represented by 1, the playing behavior is represented by 2, the collection behavior is represented by 3, the downloading behavior is represented by 4, and if the second keyword is 3_13753_8, the bit array corresponding to the second keyword is used to indicate that the user identified by the user identifier 13753 performed the collection behavior on the content identified by the plurality of content identifiers represented by the bit array in month 8.
According to the technical scheme, the plurality of different bit arrays are divided according to different types of interactive behaviors for corresponding storage, so that the content of any type of interactive behavior which is executed can be independently filtered more flexibly based on the bit array corresponding to any type of interactive behavior according to the filtering requirement, and the usability of the stored bit array is further improved.
In another alternative implementation manner, a plurality of first historical behavior data based on which the first bit array is generated are used for recording the content of the executed target type of interaction behavior in the first time period, and the content identifications in the plurality of first historical behavior data are all used for identifying the same type of content. Correspondingly, the first server generates a third keyword based on the any user identifier, the first time period, the interaction behavior of the target type in the first historical behavior data and the type of the content identified by the content identifier in the first historical behavior data; and storing the corresponding relation between the third key word and the first bit array. Optionally, the third keyword is in the form of type of content _ interactive behavior _ user identification _ time period. For example, the audio type is represented by 1, the video type is represented by 2, the image type is represented by 3, the text type is represented by 4, and if the third keyword is 1_3_13753_8, the bit array corresponding to the third keyword is used to represent that 13753 the user identified by the user identifier performed a favorite behavior in month 8 on the audio content identified by the plurality of content identifiers represented by the bit array.
It should be noted that, the first server generates a more finely classified digit group corresponding to the user identifier in combination with one or any more of the division of the time period, the type of the interactive behavior, and the type of the content, and correspondingly, the first server generates a corresponding keyword based on one or any more of the division of the time period, the type of the interactive behavior, and the type of the content, and stores a corresponding relationship between the keyword and the digit group, which is not listed in this embodiment of the application.
Another point to be noted is that, if the first server further encodes the first bit array after generating the first bit array, and generates a character string, the first server stores the corresponding relationship between the any user identifier and the character string.
It should be noted that, after the first server stores the corresponding relationship between the any user identifier and the first bit array, the stored bit array is periodically updated. In this embodiment, an example in which the first server generates a corresponding bit array according to a time period of the target duration and updates the bit array corresponding to the time period of the target duration is described. Optionally, the update period of the bit array may be flexibly configured, for example, the update period is 1 hour, 1 day, or 1 week, and in this embodiment, the update period of the bit array is 1 day, and the first server updates the bit array once every other day. Correspondingly, the step of updating the bit array by the first server comprises the following steps: the first server responds to the arrival digit group updating time, and generates a second digit group based on a plurality of second content identifications in a plurality of second historical behavior data corresponding to any user identification in the data warehouse, wherein the second historical behavior data is used for recording the content of the executed interactive behavior in a second time period to which the digit group updating time belongs; and the first server updates the stored bit array corresponding to the second time period into a second bit array.
For example, the bit array is updated for a period of 1 day, and the first server updates the bit array at 24 points per day. And if the digit group updating time reaches 24 points of 11, 5 and 11 of 2020, the first server generates a second digit group based on a plurality of second content identifiers in a plurality of second historical behavior data corresponding to any user identifier. Wherein, the second historical behavior data is used for recording the content of the interactive behavior executed in 11 months, namely the content of the interactive behavior executed in 11 months from 1 day to 5 days. The first server updates the stored digit groups for the plurality of contents indicating that the interactive behavior is performed from 11-month-1 to 11-month-4 to the second digit group.
According to the technical scheme, the corresponding bit array is generated according to the time period of the target time length, when the bit array updating time is reached, the bit array corresponding to the time period of the target time length is regenerated and updated only based on the historical behavior data in the time period of the target time length of the bit array updating time, the bit array is regenerated without processing the whole amount of historical behavior data corresponding to the user identification, and under the condition of large data quantity, the data quantity based on which the bit array is generated can be greatly reduced, so that the updating efficiency of the bit array is greatly improved, the calculation resources consumed by updating the bit array can be greatly reduced, and the updating cost of the bit array is reduced.
303. The first server receives at least one piece of real-time behavior data corresponding to any user identification, and the real-time behavior data is used for recording the content of the executed interactive behavior.
The real-time behavior data refers to behavior data generated after the bit array update time closest to the current time. For example, if the current time is 11 months, 6 days, 18 o 'clock, and the digit group update time closest to the current time is 11 months, 5 days, 24 o' clock, the real-time behavior data includes behavior data generated between 11 months, 6 days, 0 o 'clock and 11 months, 6 days, 18 o' clock. Optionally, a real-time behavior data includes a user identifier, a content identifier, an interactive behavior, and an execution time, and is used to indicate that the user identified by the user identifier performs the interactive behavior on the content identified by the content identifier at the execution time.
Optionally, the first server subscribes to the real-time behavioral data through Kafka (a high-throughput distributed publish-subscribe messaging system). For example, the terminal displays audio content in the audio recommendation interface, detects interaction behavior with the displayed audio content, generates behavior data, and sends the behavior data to Kafka. Kafka sends the received behavior data to a first server that subscribes to the behavior data. The first server receives the behavior data sent by Kafka.
304. The first server stores the corresponding relation between any user identification and at least one content identification in the at least one piece of real-time behavior data.
Optionally, the first server stores the correspondence between any user identifier and the content identifier in each piece of real-time behavior data in a key-value form, because the data amount of the real-time behavior data is small. Or, the first server may also generate a bit array based on at least one content identifier, and correspondingly store the corresponding relationship between any user identifier and the bit array, which is not limited in this embodiment of the application. Optionally, the first server stores the correspondence between the any user identifier and the content identifier in each piece of real-time behavior data in a redis database.
305. And the first server responds to the filtering request, and determines a target bit array corresponding to the user identifier carried by the filtering request based on the stored corresponding relation between the user identifier and the bit array.
The first server provides filtering service, and the other servers send filtering requests to the first server to request the first server to judge whether the user identified by a certain user identifier performs interactive behavior on the content identified by a certain content identifier. The filtering request carries a user identifier and a content identifier to be distinguished corresponding to the user identifier. Optionally, a filtering request carries a content identifier corresponding to a user identifier; or, one filtering request carries a plurality of content identifiers corresponding to one user identifier, which is not limited in this embodiment of the present application. In the embodiment of the present application, it is described by taking an example that one filtering request carries one content identifier, and if one filtering request carries a plurality of content identifiers, a processing procedure for each content identifier is the same as a processing procedure for one content identifier carried by the filtering request.
In an optional implementation manner, the bit array corresponding to the user identifier carried by the filtering request is cached in a redis database, and the first server obtains the bit array corresponding to the user identifier carried by the filtering request from the redis database. In another optional implementation manner, the bit array corresponding to the user identifier carried by the filtering request is stored in the mysql database, and the first server obtains the bit array corresponding to the user identifier carried by the filtering request from the mysql database. In another optional implementation manner, if the digit group corresponding to the user identifier carried by the filtering request is not cached in the redis database, the digit group corresponding to the user identifier carried by the filtering request is obtained from the mysql database and cached in the redis database; if the filtering request carrying the user identifier is received again within the caching validity period, the bit array corresponding to the user identifier is directly obtained from the redis data, so that persistent storage of the bit array in the mysql database can be guaranteed, the corresponding bit array can be quickly obtained through caching for processing, and the processing efficiency is further improved.
306. And the first server determines at least one third content identifier corresponding to the user identifier carried by the filtering request based on the corresponding relation between the stored user identifier and at least one content identifier in at least one piece of real-time behavior data.
Wherein, the at least one third content identifier is derived from at least one real-time behavior data corresponding to the user identifier carried by the filtering request. Optionally, the user identifier and the corresponding relationship of the content identifier in each piece of real-time behavior data are stored in a redis database, and the first server obtains at least one third content identifier corresponding to the user identifier carried in the filtering request from the redis database.
307. And the first server determines a filtering result of the content identification carried by the filtering request based on the target bit array and at least one third content identification corresponding to the user identification carried by the filtering request.
The first server determines whether the content identifier carried by the filtering request is in a plurality of content identifiers represented by a target bit array, and determines whether the content identifier carried by the filtering request is in at least one third content identifier; if the content identification carried by the filtering request is determined to be in the plurality of content identifications represented by the target digit array or in at least one third content identification, determining the filtering result as that the content identification carried by the filtering request is filtered; and if the content identifier carried by the filtering request is determined not to be in the plurality of content identifiers represented by the target bit array and not to be in the at least one third content identifier, determining that the filtering result is not to filter the content identifier carried by the filtering request.
The step of determining, by the first server, whether the content id carried by the filtering request is among the plurality of content ids represented by the target bit array based on the target bit array includes: the first server processes the content identification carried by the filtering request based on a hash processing mode corresponding to the target bit array to obtain a plurality of bits of the target bit array corresponding to the content identification carried by the filtering request; in response to that values on a plurality of bits of a target bit array corresponding to the content identifier carried by the filtering request include 0, determining that the content identifier carried by the filtering request is not included in a plurality of content identifiers represented by the target bit array; and responding to that the values on the plurality of bits of the target bit array corresponding to the content identification carried by the filtering request are all 1, wherein the plurality of content identifications represented by the target bit array comprise the content identification carried by the filtering request.
The hash processing method corresponding to the target bit array is a plurality of hash functions based on which the target bit array is generated according to the bloom filter principle. For example, the first server generates the target bit array based on K hash functions, and then the first server processes the content identifier carried by the filtering request based on the K hash functions to obtain a plurality of bits in the target bit array corresponding to the content identifier carried by the filtering request.
According to the technical scheme, the filtering result can be accurately determined in real time based on the content of the historical executed interactive behavior represented by the bit array and the content of the real-time executed interactive behavior, and the accuracy and the instantaneity of the filtering result are improved.
It should be noted that, optionally, the first server does not refer to the real-time behavior data, and determines the filtering result for the content identifier carried by the filtering request based on the target bit array only. That is, after the first server performs step 302, steps 303 and 304 are not performed, step 305 is performed directly, and after the step 305 is performed, step 306 is not performed, and the following steps of determining the filtering result of the content identifier carried in the filtering request based on the target bit array are performed directly: the first server processes the content identification carried by the filtering request based on a hash processing mode corresponding to the target bit array to obtain a plurality of bits of the target bit array corresponding to the content identification carried by the filtering request; responding to that the values on the bits of the target bit array corresponding to the content identification carried by the filtering request include 0, and determining that the filtering result is the content identification not carried by the filtering request if the content identification represented by the target bit array does not include the content identification carried by the filtering request; and responding to that the values on the bits of the target bit array corresponding to the content identification carried by the filtering request are all 1, wherein the content identifications represented by the target bit array comprise the content identification carried by the filtering request, and determining that the filtering result is the content identification carried by the filtering request.
According to the technical scheme, under the large data scene with a large number of content identifications, a large number of content identifications are represented by the binary digit array, the filtering service is supported through the storage digit array, the storage space occupied for supporting the filtering service is greatly reduced, and the utilization rate of the storage space is improved. And when the data is stored, the content identifiers are respectively corresponding to a plurality of bits of the bit array, correspondingly, when the filtering request is received, the content identifiers carried by the filtering request are corresponding to the plurality of bits of the bit array, and whether the content identifiers carried by the filtering request are among the plurality of content identifiers represented by the bit array can be determined based on the values of 0 or 1 on the plurality of bits, compared with the method that the filtering result of a certain content identifier is determined through one-to-one comparison of character strings, the determining efficiency of the filtering result is greatly improved.
And after determining the filtering result of the content identifier carried by the filtering request, the first server returns the filtering result to the second server sending the filtering request. Optionally, the second server is configured to provide a recommendation service, and if the filtering result is to filter the content identifier carried in the filtering request, the second server filters the content identifier indicated by the filtering result from the candidate content to be recommended; if the filtering result is that the content identification carried by the filtering request is not filtered, the second server does not filter the content identification indicated by the filtering result from the candidate content to be recommended; and the second server carries out recommendation based on the remaining candidate contents so as to reduce repeated recommendation of the contents.
In order to make the process of providing the filtering service by the first server clearer, the following description is made with reference to fig. 4. Referring to fig. 4, the first server includes a filtering service, a caching service, Kafka, mysql, and a data warehouse. Optionally, the caching service is implemented based on a redis database. Wherein, the historical behavior data stored in the data warehouse is updated regularly every day; generating a corresponding digit array based on historical behavior data in the data warehouse through the steps 301 to 302, and storing the digit array into mysql in a timing and synchronous manner; subscribing to the real-time behavior data from the Kafka through the steps 303 to 304, and storing the content identification in the real-time behavior data to a cache service; the filtering service responds to the filtering request through the above step 305, and acquires a corresponding digit array from the caching service; if the corresponding digit array does not exist in the cache service, acquiring the corresponding digit array from the mysql and caching the digit array into the cache service; the filtering service acquires content identification derived from the real-time behavior data from the caching service; the filtering result corresponding to the filtering request is determined based on the obtained digit group and the content identification derived from the real-time behavior data through the step 306. Optionally, the filtering service, the caching service, Kafka, mysql, and the data warehouse are respectively deployed on a plurality of different servers, and the first server is a server cluster composed of the plurality of servers. Optionally, the filtering service is based on an LRU (Least recently used) mechanism to preferentially flush the Least recently used bit array when the local cache is low.
All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.
Fig. 5 is a block diagram of a data processing apparatus according to an embodiment of the present application. Referring to fig. 5, the apparatus includes:
a digit group generating module 501, configured to generate a first digit group based on a plurality of first content identifiers in a plurality of first historical behavior data corresponding to any user identifier in a data warehouse, where the first historical behavior data is used to record content of an executed interaction behavior, and each first content identifier corresponds to a plurality of digits in the first digit group respectively;
a first corresponding relation storage module 502, configured to store a corresponding relation between the any user identifier and the first bit array;
a bit array determining module 503, configured to respond to the filtering request, and determine, based on a correspondence between the stored user identifier and the bit array, a target bit array corresponding to the user identifier carried in the filtering request;
and a filtering result determining module 504, configured to determine, based on the target bit array, a filtering result for the content identifier carried in the filtering request.
According to the technical scheme provided by the embodiment of the application, the content identifications in each historical behavior data corresponding to any user identification respectively correspond to a plurality of digits of the digit group, and the digit group represents the plurality of content identifications of the executed interactive behavior corresponding to the user identification, so that when a filtering request carrying any user identification is received, the filtering result of the content identification carried by the filtering request can be determined based on the digit group corresponding to the user identification. Under the large data scene with a large number of content identifications, a large number of content identifications are represented by a binary digit array, and the filtering service is supported by the storage digit array, so that the storage space occupied for supporting the filtering service is greatly reduced, and the utilization rate of the storage space is improved.
In an alternative implementation, the bit array generating module 501 is configured to:
responding to the query compression statement, calling a query compression module, and generating a first digit array in a memory corresponding to the data warehouse tool hive based on a plurality of first content identifications in a plurality of first historical behavior data corresponding to any user identification in the data warehouse;
the query compression statement is used for indicating to obtain a first bit array used for representing a plurality of first content identifications corresponding to any user identification, and the query compression module is a function module expanded on the basis of hive.
In another alternative implementation, the first historical behavior data is used for recording the content of the executed interactive behavior in the first time period;
the first correspondence storing module 502 is configured to:
generating a first keyword based on the any user identification and the first time period;
and storing the corresponding relation between the first key word and the first bit array.
In another optional implementation manner, the digit group generating module 501 is further configured to generate, in response to reaching the digit group update time, a second digit group based on a plurality of second content identifiers in a plurality of second historical behavior data corresponding to the any user identifier in the data warehouse, where the second historical behavior data is used to record content of an interaction behavior executed in a second time period to which the digit group update time belongs;
the first corresponding relation storage module 502 is further configured to update the stored bit array corresponding to the second time period to a second bit array.
In another alternative implementation, the first historical behavior data is used for recording the content of the interactive behavior of the executed target type in the first time period;
the first correspondence storing module 502 is configured to:
generating a second keyword based on the any user identifier, the first time period and the target type of interaction behavior in the first historical behavior data;
and storing the corresponding relation between the second keyword and the first bit array.
In another optional implementation manner, the apparatus further includes:
the digit array coding module is used for coding the first digit array to generate a character string;
the first corresponding relation storage module 502 is further configured to store a corresponding relation between the any user identifier and the character string.
In another optional implementation manner, the apparatus further includes:
the real-time behavior data receiving module is used for receiving at least one piece of real-time behavior data corresponding to any user identification, and the real-time behavior data is used for recording the content of the executed interactive behavior;
the second corresponding relation storage module is used for storing the corresponding relation between the any user identification and at least one content identification in at least one piece of real-time behavior data corresponding to the any user identification;
the filtering result determining module 504 is configured to determine a filtering result for the content identifier carried in the filtering request based on the target bit array and at least one third content identifier corresponding to the user identifier carried in the filtering request, where the at least one third content identifier is derived from at least one real-time behavior data corresponding to the user identifier carried in the filtering request.
In another alternative implementation, the bit array generating module 501 is configured to:
processing each first content identifier based on a hash processing mode corresponding to the first bit array to obtain a plurality of bits of the first bit array corresponding to each first content identifier;
setting the values of a plurality of bits of the first bit array corresponding to each first content identifier as 1.
In another alternative implementation manner, the filtering result determining module 504 is configured to:
processing the content identification carried by the filtering request based on the hash processing mode corresponding to the target bit array to obtain a plurality of bits of the target bit array corresponding to the content identification carried by the filtering request;
in response to that values on a plurality of bits of the target bit array corresponding to the content identifier carried by the filtering request include 0, and if the content identifier carried by the filtering request is not included in the plurality of content identifiers represented by the target bit array, determining that the filtering result is the content identifier not carried by the filtering request;
and responding to that the values on the bits of the target bit array corresponding to the content identifiers carried by the filtering request are all 1, determining that the content identifiers represented by the target bit array comprise the content identifiers carried by the filtering request, and determining that the filtering result is the filtering of the content identifiers carried by the filtering request.
It should be noted that: in the data processing apparatus provided in the above embodiment, only the division of the functional modules is illustrated when performing data processing, and in practical applications, the functions may be distributed by different functional modules as needed, that is, the internal structure of the server is divided into different functional modules to complete all or part of the functions described above. In addition, the data processing apparatus and the data processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.
Fig. 6 is a block diagram of a server 600 according to an embodiment of the present application, where the server 600 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 601 and one or more memories 602, where at least one program code is stored in the memory 602, and the at least one program code is loaded and executed by the processors 601 to implement the data Processing method provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.
In an exemplary embodiment, there is also provided a computer readable storage medium having at least one program code stored therein, the at least one program code being executable by a processor in a server to perform the data processing method in the above-described embodiment. For example, the computer-readable storage medium may be a ROM (Read-Only Memory), a RAM (Random Access Memory), a CD-ROM (Compact Disc Read-Only Memory), a magnetic tape, a floppy disk, an optical data storage device, and the like.
The present application also provides a computer program product or a computer program comprising computer program code stored in a computer readable storage medium, which is read by a processor of a server from the computer readable storage medium, and which is executed by the processor to cause the server to execute the data processing method in the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (12)

1. A method of data processing, the method comprising:
generating a first bit array based on a plurality of first content identifications in a plurality of first historical behavior data corresponding to any user identification in a data warehouse, wherein the first historical behavior data is used for recording the content of the executed interactive behavior, and each first content identification corresponds to a plurality of bits in the first bit array respectively;
storing the corresponding relation between any user identification and the first digit array;
responding to a filtering request, and determining a target digit array corresponding to the user identification carried by the filtering request based on the stored corresponding relation between the user identification and the digit array;
and determining a filtering result of the content identification carried by the filtering request based on the target bit array.
2. The method of claim 1, wherein generating the first bit array based on a plurality of first content identifiers in a plurality of first historical behavior data corresponding to any user identifier in the data store comprises:
responding to a query compression statement, calling a query compression module, executing a plurality of first content identifications in a plurality of first historical behavior data corresponding to any user identification in a data warehouse in a memory corresponding to a data warehouse tool hive, and generating a first digit array;
the query compression statement is used for indicating to obtain a first bit array used for representing a plurality of first content identifications corresponding to any user identification, and the query compression module is a functional module expanded on the basis of the hive.
3. The method of claim 1, wherein the first historical behavior data is used to record the content of the performed interactive behavior during a first time period;
the storing of the corresponding relationship between the any user identifier and the first bit array includes:
generating a first keyword based on the any user identification and the first time period;
and storing the corresponding relation between the first keyword and the first bit array.
4. The method according to claim 1, wherein after storing the correspondence between any of the user identifiers and the first bit array, the method further comprises:
in response to reaching the digit group update time, generating a second digit group based on a plurality of second content identifiers in a plurality of second historical behavior data corresponding to any user identifier in the data warehouse, wherein the second historical behavior data is used for recording the content of the executed interactive behaviors in a second time period to which the digit group update time belongs;
and updating the stored bit array corresponding to the second time period to the second bit array.
5. The method of claim 1, wherein the first historical behavior data is used to record the content of the target type of interactive behavior executed during the first time period;
the storing of the corresponding relationship between the any user identifier and the first bit array includes:
generating a second keyword based on the any user identification, the first time period and the target type of interaction behavior in the first historical behavior data;
and storing the corresponding relation between the second keyword and the first bit array.
6. The method according to claim 1, wherein before storing the correspondence between any of the user identifiers and the first bit array, the method further comprises:
encoding the first digit array to generate a character string;
the storing of the corresponding relationship between the any user identifier and the first bit array includes:
and storing the corresponding relation between any user identification and the character string.
7. The method of claim 1, wherein before determining a filter result for the content identifier carried by the filter request based on the target bit array, the method further comprises:
receiving at least one piece of real-time behavior data corresponding to any user identification, wherein the real-time behavior data is used for recording the content of the executed interactive behavior;
storing the corresponding relation of the any user identification and at least one content identification in at least one piece of real-time behavior data corresponding to the any user identification;
the determining, based on the target bit array, a filtering result of the content identifier carried by the filtering request includes:
and determining a filtering result of the content identification carried by the filtering request based on the target bit array and at least one third content identification corresponding to the user identification carried by the filtering request, wherein the at least one third content identification is derived from at least one real-time behavior data corresponding to the user identification carried by the filtering request.
8. The method of claim 1, wherein generating the first bit array based on a plurality of first content identifiers in a plurality of first historical behavior data corresponding to any user identifier in the data store comprises:
processing each first content identifier based on a hash processing mode corresponding to the first bit array to obtain a plurality of bits of the first bit array corresponding to each first content identifier;
setting values on a plurality of bits of the first bit array corresponding to each first content identifier to be 1.
9. The method of claim 1, wherein determining a filtering result for the content identifier carried by the filtering request based on the target bit array comprises:
processing the content identification carried by the filtering request based on a hash processing mode corresponding to the target bit array to obtain a plurality of bits of the target bit array corresponding to the content identification carried by the filtering request;
in response to that values on a plurality of bits of the target bit array corresponding to the content identifier carried by the filtering request include 0, if the content identifier carried by the filtering request is not included in the plurality of content identifiers represented by the target bit array, determining that the filtering result is that the content identifier carried by the filtering request is not filtered;
and in response to that the values on the bits of the target bit array corresponding to the content identifiers carried by the filtering request are all 1, determining that the content identifiers represented by the target bit array comprise the content identifiers carried by the filtering request, and determining that the filtering result is to filter the content identifiers carried by the filtering request.
10. A data processing apparatus, characterized in that the apparatus comprises:
the digit group generating module is used for generating a first digit group based on a plurality of first content identifications in a plurality of first historical behavior data corresponding to any user identification in a data warehouse, wherein the first historical behavior data is used for recording the content of the executed interactive behavior, and each first content identification corresponds to a plurality of digits in the first digit group respectively;
a first corresponding relation storage module, configured to store a corresponding relation between the any user identifier and the first bit array;
the bit array determining module is used for responding to a filtering request, and determining a target bit array corresponding to the user identifier carried by the filtering request based on the corresponding relation between the stored user identifier and the bit array;
and the filtering result determining module is used for determining the filtering result of the content identification carried by the filtering request based on the target bit array.
11. A server, characterized in that the server comprises a processor and a memory, in which at least one program code is stored, which is loaded and executed by the processor to implement the data processing method according to any of claims 1-9.
12. A computer-readable storage medium, having stored therein at least one program code, which is loaded and executed by a processor, to implement the data processing method of any one of claims 1 to 9.
CN202011251902.4A 2020-11-11 2020-11-11 Data processing method, device, server and storage medium Pending CN112347355A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011251902.4A CN112347355A (en) 2020-11-11 2020-11-11 Data processing method, device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011251902.4A CN112347355A (en) 2020-11-11 2020-11-11 Data processing method, device, server and storage medium

Publications (1)

Publication Number Publication Date
CN112347355A true CN112347355A (en) 2021-02-09

Family

ID=74363322

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011251902.4A Pending CN112347355A (en) 2020-11-11 2020-11-11 Data processing method, device, server and storage medium

Country Status (1)

Country Link
CN (1) CN112347355A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486123A (en) * 2021-09-08 2021-10-08 深圳高灯计算机科技有限公司 Behavior data synchronization method and device, computer equipment and storage medium
CN117453986A (en) * 2023-12-19 2024-01-26 荣耀终端有限公司 Searching method, background server and searching system
CN117453986B (en) * 2023-12-19 2024-05-24 荣耀终端有限公司 Searching method, background server and searching system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9569549B1 (en) * 2010-05-25 2017-02-14 Amazon Technologies, Inc. Location based recommendation and tagging of media content items
CN108133031A (en) * 2017-12-29 2018-06-08 北京搜狐新媒体信息技术有限公司 A kind of method and device of filtered recommendation video candidate result
CN108874803A (en) * 2017-05-09 2018-11-23 腾讯科技(深圳)有限公司 Date storage method, device and storage medium
CN110069714A (en) * 2019-04-25 2019-07-30 北京酷我科技有限公司 A kind of video recommendation system
CN110781386A (en) * 2019-10-10 2020-02-11 支付宝(杭州)信息技术有限公司 Information recommendation method and device, and bloom filter creation method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9569549B1 (en) * 2010-05-25 2017-02-14 Amazon Technologies, Inc. Location based recommendation and tagging of media content items
CN108874803A (en) * 2017-05-09 2018-11-23 腾讯科技(深圳)有限公司 Date storage method, device and storage medium
CN108133031A (en) * 2017-12-29 2018-06-08 北京搜狐新媒体信息技术有限公司 A kind of method and device of filtered recommendation video candidate result
CN110069714A (en) * 2019-04-25 2019-07-30 北京酷我科技有限公司 A kind of video recommendation system
CN110781386A (en) * 2019-10-10 2020-02-11 支付宝(杭州)信息技术有限公司 Information recommendation method and device, and bloom filter creation method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵艳红 等: "基于Bloom Filter的去重方法研究", 计算技术与自动化, vol. 35, no. 01, 31 March 2016 (2016-03-31), pages 95 - 100 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486123A (en) * 2021-09-08 2021-10-08 深圳高灯计算机科技有限公司 Behavior data synchronization method and device, computer equipment and storage medium
CN113486123B (en) * 2021-09-08 2022-02-25 深圳高灯计算机科技有限公司 Behavior data synchronization method and device, computer equipment and storage medium
CN117453986A (en) * 2023-12-19 2024-01-26 荣耀终端有限公司 Searching method, background server and searching system
CN117453986B (en) * 2023-12-19 2024-05-24 荣耀终端有限公司 Searching method, background server and searching system

Similar Documents

Publication Publication Date Title
US11392416B2 (en) Automated reconfiguration of real time data stream processing
US10560465B2 (en) Real time anomaly detection for data streams
CN109074377B (en) Managed function execution for real-time processing of data streams
CN106557470B (en) Data extraction method and device
US20210349872A1 (en) Creating index in blockchain-type ledger
CN105512283A (en) Data quality management and control method and device
CN108200070B (en) Method and device for generating list
CN102929912A (en) Method and system for generating microblog pictures
CN114116827B (en) Query system and method for user portrait data
CN114579533A (en) Method and device for acquiring user activity index, electronic equipment and storage medium
CN112347355A (en) Data processing method, device, server and storage medium
CN101836403A (en) Compressing null columns in rows of the tabular data stream protocol
CN114443599A (en) Data synchronization method and device, electronic equipment and storage medium
CN116595106A (en) User grouping method, device and storage medium
CN114968917A (en) Method and device for rapidly importing file data
US20210141791A1 (en) Method and system for generating a hybrid data model
CN113419957A (en) Rule-based big data offline batch processing performance capacity scanning method and device
CN109739883B (en) Method and device for improving data query performance and electronic equipment
CN113407491A (en) Data processing method and device
CN113360558A (en) Data processing method, data processing device, electronic device, and storage medium
CN112464049A (en) Method, device and equipment for downloading detailed number list
CN113342866B (en) Keyword updating method and device, computer equipment and storage medium
CN115604667B (en) Message sending method, device, computer equipment and storage medium
CN113965536B (en) Message token updating method and device, equipment, medium and product thereof
CN116881247A (en) Service identifier generation method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination