Disclosure of Invention
In order to solve the technical problems mentioned in the background art, the present invention provides a crowd generating method and apparatus, so as to save computing resources and accelerate the generation speed of the crowd in the process of generating the crowd.
The embodiment of the invention provides the following specific technical scheme:
in a first aspect, a method for generating a population of people is provided, the method comprising:
receiving a crowd ID and corresponding crowd conditions for generating a crowd, and storing the crowd ID and the crowd conditions in a relational database in an associated manner;
acquiring a plurality of crowd conditions needing to be calculated from the relational database, and analyzing and converting each crowd condition into a query statement executable by a distributed search engine;
and starting multithreading through a distributed computing engine, carrying out data query on a plurality of query statements in the distributed search engine on the basis of indexes in parallel, and storing the queried data into a Hive table.
Further, the receiving is used for generating a crowd ID of the crowd and a corresponding crowd condition, and the storing the crowd ID and the crowd condition in a relational database in an associated manner includes:
receiving the crowd ID and the corresponding crowd condition from the distributed message queue through Spark Streaming;
analyzing and converting the received crowd condition into a query statement executable by the distributed search engine, and querying the distributed search engine to obtain the number of covered people corresponding to the crowd ID;
and storing the crowd ID, the crowd condition and the coverage number into the relational database in a correlation mode, and setting the calculation state of the crowd ID according to the coverage number.
Further, the setting of the calculation state of the crowd ID according to the number of covered people includes:
judging whether the number of covered people corresponding to the crowd definition data is zero or not;
if yes, setting the state of the crowd ID as a successful calculation state;
and if not, setting the state of the crowd ID as a waiting calculation state.
Further, a user tag database and a corresponding tag index table are pre-stored in the distributed search engine, and the data query based on the index is performed on a plurality of query statements in parallel by starting multithreading through the distributed computing engine, including:
aiming at a plurality of query statements, generating a plurality of crowd computing tasks by the distributed computing engine in a multithreading mode and executing the crowd computing tasks;
and each crowd computing task is used for querying the user tag database according to the tag index value related to each query statement in the tag index table to obtain the user data related to each query statement.
Further, the crowd condition is an SQL condition, the relational database is MySQL, the distributed computing engine is Spark, and the distributed search engine is an Elasticsearch.
Further, the method further comprises:
and comparing the crowd ID stored in the Hive table with the crowd ID in the relational database in the calculation state, judging whether the missing crowd ID exists, and if so, updating the state of the missing crowd ID into a waiting calculation state.
Further, the method further comprises:
and after receiving a data dump instruction, dumping the data stored in the Hive table to a server indicated by the data dump instruction.
In a second aspect, there is provided a crowd generating apparatus, the apparatus comprising:
the receiving module is used for receiving a crowd ID and a corresponding crowd condition for generating a crowd and storing the crowd ID and the crowd condition into a relational database in an associated manner;
the analysis module is used for acquiring a plurality of crowd conditions needing to be calculated from the relational database and analyzing and converting each crowd condition into a query statement executable by a distributed search engine;
and the computing module is used for starting multithreading through a distributed computing engine and carrying out data query on a plurality of query statements in the distributed search engine on the basis of indexes in parallel, and storing the queried data into a Hive table.
Further, the receiving module specifically includes:
the receiving submodule is used for receiving the crowd ID and the corresponding crowd condition from the distributed message queue through Spark Streaming;
the query submodule is used for analyzing and converting the received crowd condition into a query statement executable by the distributed search engine, and querying the distributed search engine to obtain the number of covered people corresponding to the crowd ID;
and the storage submodule is used for storing the crowd ID, the crowd condition and the coverage number into the relational database in a correlation manner, and setting the calculation state of the crowd ID according to the coverage number.
Further, the storage submodule is specifically configured to:
judging whether the number of covered people corresponding to the crowd definition data is zero or not;
if yes, setting the state of the crowd ID as a successful calculation state;
if not, the state of the crowd ID is a waiting calculation state.
Further, a user tag database and a corresponding tag index table are pre-stored in the distributed search engine, and the calculation module is specifically configured to:
aiming at a plurality of query statements, generating a plurality of crowd computing tasks by the distributed computing engine in a multithreading mode and executing the crowd computing tasks;
and each crowd computing task is used for querying the user tag database according to the tag index value related to each query statement in the tag index table to obtain the user data related to each query statement.
Further, the crowd condition is an SQL condition, the relational database is MySQL, the distributed computing engine is Spark, and the distributed search engine is an Elasticsearch.
Further, the calculation module is specifically further configured to:
and comparing the crowd ID stored in the Hive table with the crowd ID in the relational database, the state of which is the calculating state, judging whether the missing crowd ID exists, and if so, updating the state of the missing crowd ID into the calculating waiting state.
Further, the apparatus further comprises:
and the service module is used for dumping the data stored in the Hive table to a server indicated by the data dumping instruction after receiving the data dumping instruction.
In a third aspect, a computer device is provided, comprising:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the crowd generation method according to any of the first aspects.
In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the crowd generation method according to any one of the first aspects.
The embodiment of the invention provides a crowd generation method and a device, wherein a distributed search engine is introduced in the crowd calculation process, the crowd is generated by combining the distributed search engine and the distributed search engine, all data of operation is changed into walking indexes, the data volume of operation is reduced, the crowd generation speed is accelerated, the minute level is changed into the second level, the calculation resources are saved, and the multi-thread parallel calculation is started through the distributed search engine, so that the capacity of parallelly and quickly calculating multiple crowds is realized.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
It is to be understood that, unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including but not limited to".
Furthermore, in the description of the present invention, it is to be understood that the terms "first", "second", etc. are merely for the purpose of conditional forest banking and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.
At present, in the crowd generation process, an offline calculation engine such as hive is usually adopted for calculation, so that full-table data is operated each time, many tasks are created, resources are wasted, and the calculation time is generally in the minute level. Therefore, the embodiment of the invention provides a crowd generation method, which is used for receiving crowd IDs and corresponding crowd conditions for generating crowds, analyzing the crowd conditions and converting the crowd conditions into query statements of a distributed search engine, and parallelly and quickly querying people meeting the conditions based on an index mode by adopting a mode of combining the distributed calculation engine and the distributed search engine.
Example one
An embodiment of the present invention provides a crowd generation method, as shown in fig. 1, the method may include the steps of:
and S11, receiving the crowd ID and the corresponding crowd condition for generating the crowd, and storing the crowd ID and the crowd condition in a relational database in an associated manner.
The user terminal generates the crowd condition and the corresponding crowd ID through receiving label check operation input by a user (such as a merchant user, a platform operator and the like) on a visual interface, and the crowd ID uniquely identifies the crowd condition, namely, generates a crowd packet. In addition, when the user inputs the label checking operation on the user terminal, an effective time period is also set, so that the user terminal can obtain the effective time period of the crowd condition, and the effective time period comprises an effective starting date and an effective ending date.
After the user terminal generates the crowd ID and the crowd condition, the crowd ID and the crowd condition are sent to the distributed message queue Kafka to wait for Spark Streaming to consume.
The crowd condition can adopt a structured query statement SQL, and the relational database can adopt a MySQL database.
Specifically, the implementation process of step S11 may include:
a, receiving the crowd ID and the corresponding crowd condition from the distributed message queue through Spark Streaming.
And b, analyzing and converting the received crowd condition into a query sentence which can be executed by a distributed search engine, and querying the distributed search engine to obtain the number of the coverage people corresponding to the crowd ID.
In this embodiment, the distributed computing engine may adopt Spark, and the distributed search engine may adopt Elasticsearch.
Specifically, the crowd condition in the distributed message queue Kafka is consumed through Spark Streaming, the obtained crowd condition SQL statement is converted into a DSL statement (Domain Specific Language) for querying the Elasticsearch, and the coverage of the crowd ID is found.
In the process of converting the SQL statement into the DSL query statement of the Elasticsearch, the SQL statement may be analyzed to generate an SQL syntax tree, and then the SQL syntax tree may be converted into the DSL statement of the Elasticsearch.
And c, storing the crowd ID, the crowd condition and the number of the covered people in a relational database in a correlation mode, and setting the calculation state of the crowd ID according to the number of the covered people.
Specifically, when the crowd ID, the crowd condition, and the number of covered persons are stored in the relational database in association with each other, it is determined whether the number of covered persons corresponding to the crowd definition data is zero, and if so, the state of the crowd ID is set to a calculation success state in the relational database, and if not, the state of the crowd ID is set to a calculation waiting state.
In this embodiment, if the number of people covered by a crowd is zero, the crowd ID may be directly set to the successfully calculated ID, and then the successfully calculated ID may be returned.
And S12, acquiring a plurality of crowd conditions needing to be calculated from the relational database, and analyzing and converting each crowd condition into a query statement executable by the distributed search engine.
Specifically, obtaining a plurality of crowd conditions needing to be calculated from the relational database may include:
the distributed computing engine is used for acquiring the plurality of crowd IDs in the waiting computing state from the relational database in a timing mode, and correspondingly modifying the states of the plurality of crowd IDs from the waiting computing state to the computing state.
In this embodiment, the spare crowd calculation task may be started at regular time to scan data in MySQL, determine whether there is a crowd ID in a state waiting for calculation (i.e., a crowd needing to be calculated), if there is a crowd ID needing to be calculated, sort the crowd IDs in descending order according to the receiving time, take out the top N-bit crowd IDs (e.g., N = 700) for calculation, and modify the state of the taken out crowd IDs from a state waiting for calculation to a state in calculation.
Specifically, parsing each crowd condition into a query statement executable by a distributed search engine includes:
and analyzing each crowd condition to generate a syntax tree, and converting each syntax tree into a query sentence executable by a distributed search engine.
Specifically, after each crowd condition is analyzed to generate a syntax tree, the syntax tree can be recursively processed to generate a query language executable by the distributed search engine, wherein in the recursive processing process, whether a father node and a son node are in the same nested structure or not can be judged when traversing nodes on the syntax tree, and the query language executable by the distributed search engine can be generated according to the judgment result.
And S13, starting multithreading through the distributed computing engine, carrying out data query on a plurality of query statements in the distributed search engine based on the index in parallel, and storing the queried data into the Hive table.
The distributed search engine stores a user tag database and a corresponding tag index table in advance, the user tag database stores user data, the user data comprises a user ID and a corresponding user tag, and the user data can be obtained by calculating original data in a data warehouse through the distributed calculation engine and dumped in the distributed search engine.
In practical applications, the user tag may include a user identification, a business object, a behavior type, and a timestamp, the business object including at least one of a brand of goods, a category of goods, and a store, the behavior type including at least one of browsing, searching, shopping, collecting, submitting an order, and paying an order for the business object.
Specifically, aiming at a plurality of query statements, generating and executing a plurality of crowd computing tasks in a multithreading mode through a distributed computing engine; and each crowd calculation task is used for inquiring and obtaining the user data relevant to each query statement in the user tag database according to the tag index value relevant to each query statement in the tag index table.
In this embodiment, when performing calculation through Spark, M threads (for example, M = 10) may be started in a thread pool manner at a Drive end of Spark, where one thread is used to process one crowd calculation task, multiple crowd calculation tasks are executed by distributing multiple threads, a tag index value related to a key field in each query statement may be obtained from a tag index table, and user data related to each query statement is obtained by querying in a user tag database according to the obtained tag index value, so as to query required data, and store the required data in a Hive table in a manner that one crowd packet corresponds to one partition, where one crowd packet includes a crowd ID and corresponding user data.
It should be noted that, generally, one crowd condition corresponds to one effective time period, and each day in the effective time period of the crowd condition is calculated according to the crowd condition to generate the crowd for service delivery or activity delivery, so that the embodiment can quickly update the crowd delivered on the same day and the crowd to be delivered on the next day in batches, thereby ensuring the timeliness of the data delivered on the next day.
The embodiment of the invention provides a crowd generation method, which comprises the steps of receiving a crowd ID and corresponding crowd conditions for generating crowds, and storing the crowd ID and the crowd conditions into a relational database in an associated manner; acquiring a plurality of crowd conditions needing to be calculated from a relational database, and analyzing and converting each crowd condition into a query statement executable by a distributed search engine; the distributed computing engine is started to multithread parallelly carry out data query on a plurality of query statements in the distributed search engine based on the index, and the queried data are stored in the Hive table, so that crowd generation is carried out by combining the distributed computing engine and the distributed search engine, all data in operation are changed into walking indexes, the data amount in operation is reduced, the crowd generation speed is accelerated, the class of minutes is changed into the class of seconds, computing resources are saved, multithread parallel computing is started through the distributed computing engine, and the capacity of parallelly and quickly computing a plurality of crowds is realized.
Example two
On the basis of the first embodiment, the embodiment of the present invention further provides a crowd generating method, as shown in fig. 2, the method may include the steps of:
and S21, receiving the crowd ID and the corresponding crowd condition for generating the crowd, and storing the crowd ID and the crowd condition into a relational database in an associated manner.
Specifically, the implementation process of step S21 may refer to step S11 in the first embodiment, and is not described herein again.
And S22, acquiring a plurality of crowd conditions needing to be calculated from the relational database, and analyzing and converting each crowd condition into a query statement executable by the distributed search engine.
Specifically, the implementation process of step S22 may refer to step S12 in the first embodiment, and details are not repeated here.
And step S23, starting multithreading through the distributed computing engine, carrying out data query on a plurality of query sentences in the distributed search engine on the basis of the index in parallel, and storing the queried data into the Hive table.
Specifically, the implementation process of step S23 may refer to step S13 of the first embodiment, and details are not repeated here.
And step S24, comparing the crowd ID stored in the Hive table with the crowd ID in the relational database in the calculation state, judging whether the missing crowd ID exists, if so, executing step S25, otherwise, executing step S26.
Specifically, after the calculation of all the crowd conditions needing to be calculated in the batch is completed, a calculation missing check is started, all the crowd IDs of which the calculation states are in the calculation states are obtained from the MySQL database, the obtained all the crowd IDs are compared with the crowd IDs stored in the Hive table, and if a certain crowd ID exists in the MySQL database and the crowd ID does not exist in the Hive table, the crowd ID is determined to be the calculated missing crowd ID.
In step S25, the state of the missing crowd ID is updated to the waiting state.
Specifically, the statistical number of the missing population IDs is calculated, when the statistical number of the missing population IDs is judged to be more than 50% of the total number of all the population IDs required to be calculated in the batch, an alarm is given, and the state of the population IDs in the MySQL database is set to be a waiting calculation state, so that the next batch can be calculated continuously.
In step S26, after receiving the data dump instruction, the data stored in the Hive table is dumped to the server indicated by the data dump instruction.
The data dump instruction may be input by a platform operator on a user terminal, and the server indicated in the data dump instruction may be a Redis server or an FTP server. Here, the platform may be an e-commerce platform.
In this embodiment, for different service requirements, a platform operator may input a corresponding data dump instruction to dump data stored in the Hive table to different servers, so as to improve different data services.
It is noted that, after dumping the data stored in the Hive table to the server indicated by the data dumping instruction, the calculation state of the corresponding crowd ID is set as the calculation success state.
In addition, the above calculation states include: the wait-to-compute state, the in-compute state, and the compute-successful state may be represented in the MySQL database as "1", "2", and "3", respectively.
In the embodiment, the crowd is generated by combining the distributed computing engine and the distributed search engine, all data is changed into the walking index, the data amount of operation is reduced, the crowd generation speed is accelerated, the minute level is changed into the second level, computing resources are saved, and the multi-thread parallel computing is started through the distributed computing engine, so that the capability of computing multiple crowds in parallel and quickly is realized; in addition, the calculation logic codes generated by the crowd are abstractly decomposed into a receiving layer, an analysis layer, a calculation layer and a service layer for decoupling through a layered design idea, so that the expandability of the codes is increased, the complicated and changeable business logic can be packaged in the service layer, and different services are provided through the service layer.
EXAMPLE III
An embodiment of the present invention provides a crowd generating device, as shown in fig. 3, the device may include:
a receiving module 31, configured to receive a crowd ID and a corresponding crowd condition for generating a crowd, and store the crowd ID and the crowd condition in a relational database in an associated manner;
the analysis module 32 is used for acquiring a plurality of crowd conditions needing to be calculated from the relational database and analyzing and converting each crowd condition into a query statement executable by the distributed search engine;
and the computing module 33 is used for starting multithreading through the distributed computing engine to perform data query on a plurality of query statements in the distributed search engine on the basis of indexes in parallel, and storing the queried data into the Hive table.
Further, the receiving module 31 specifically includes:
the receiving submodule is used for receiving the crowd ID and the corresponding crowd condition from the distributed message queue through Spark Streaming;
the query submodule is used for analyzing and converting the received crowd condition into a query statement executable by the distributed search engine, and querying the distributed search engine to obtain the number of the coverage people corresponding to the crowd ID;
and the storage submodule is used for storing the crowd ID, the crowd condition and the coverage number into the relational database in a correlation mode, and setting the calculation state of the crowd ID according to the coverage number.
Further, the storage submodule is specifically configured to:
judging whether the number of covered people corresponding to the crowd definition data is zero or not;
if so, setting the state of the crowd ID as a successful calculation state;
if not, the state of the crowd ID is a waiting calculation state.
Further, a user tag database and a corresponding tag index table are pre-stored in the distributed search engine, and the calculation module 33 is specifically configured to:
aiming at a plurality of query statements, generating a plurality of crowd computing tasks by a distributed computing engine in a multithreading mode and executing the crowd computing tasks;
and each crowd calculation task is used for inquiring and acquiring user data related to each query statement in the user tag database according to the tag index value related to each query statement in the tag index table.
Further, the crowd condition is SQL condition, the relational database is MySQL, the distributed computing engine is Spark, and the distributed search engine is Elasticsearch.
Further, the calculating module 33 is specifically further configured to:
and comparing the crowd ID stored in the Hive table with the crowd ID in the relational database, the state of which is in the calculating state, judging whether the missing crowd ID exists, and if so, updating the state of the missing crowd ID into the calculating waiting state.
Further, the apparatus further comprises:
and the service module 34 is used for dumping the data stored in the Hive table to the server indicated by the data dumping instruction after receiving the data dumping instruction.
The crowd generating device provided by the embodiment of the invention belongs to the same inventive concept as the crowd generating method provided by the embodiment of the invention, can execute the crowd generating method provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the target crowd circling method. For details of the technology that are not described in detail in this embodiment, reference may be made to the crowd generation method provided in this embodiment of the present invention, and details thereof are not described here.
In addition, an embodiment of the present invention further provides a computer device, where the computer device includes:
one or more processors;
storage means for storing one or more programs;
when executed by one or more processors, cause the one or more processors to implement the steps of the crowd generation method as described in the embodiments above.
Another embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the crowd generation method according to the above embodiment.
As will be appreciated by one of skill in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention have been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.