CN111782686A - User data query method and device, electronic equipment and storage medium - Google Patents

User data query method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111782686A
CN111782686A CN202010404565.1A CN202010404565A CN111782686A CN 111782686 A CN111782686 A CN 111782686A CN 202010404565 A CN202010404565 A CN 202010404565A CN 111782686 A CN111782686 A CN 111782686A
Authority
CN
China
Prior art keywords
user
label
query
data
tag
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010404565.1A
Other languages
Chinese (zh)
Inventor
王露珠
赵领杰
秦思源
冯浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN202010404565.1A priority Critical patent/CN111782686A/en
Publication of CN111782686A publication Critical patent/CN111782686A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a method and a device for querying user data, electronic equipment and a storage medium, wherein the method comprises the following steps: obtaining a data query request, wherein the data query request comprises an initial query condition represented by at least one user portrait tag; determining at least one label code corresponding to each user portrait label, and converting the initial query condition into a target query condition represented by the at least one label code, wherein the label code is the encoded data of the user portrait label; acquiring user data meeting the target query condition from a search engine to obtain a query result, wherein the search engine stores tag codes corresponding to various users in a column storage mode; and outputting the query result. According to the embodiment of the application, the query can be performed on the basis of all user data, so that the accuracy of the query result is improved on the basis of improving the query efficiency.

Description

User data query method and device, electronic equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of internet, in particular to a user data query method and device, electronic equipment and a storage medium.
Background
With the development of the internet, each internet company pays more and more attention to fine operation, and more fine layering and portrayal are performed on user groups to guide decision making and touch, so that the product effect is improved. The extraction of the user population characteristics can be based on a given population of the screening conditions, such as a high-consumption low-viscosity female white collar population using iphone, and feature information of the population, such as population number, behavior, interest preference, search preference, value attribute and the like, is returned.
In the prior art, query computation can be performed on user population characteristics in three ways: distributed query calculation is carried out based on a big data tool spark or Hadoop, calculation is carried out on hundreds of machines or even thousands of machines at the same time under the environment of distributed clusters, and the calculation time is in the minute level; performing query calculation based on spark job server and bit storage, storing (bit storage) data according to a bitmap format, compressing the data volume, and saving cluster scheduling time by combining spark job server on the basis; the method comprises the steps of performing query calculation based on ES (elastic search, a search engine based on inverted index) samples, storing all data in an ES, wherein the ES is difficult to process PB-level data and also needs to be compressed to TB-level data in an index extraction mode, and the TB-level data is still difficult to process in the ES, so that the ES is often used for sampling, and the ES is used for rapidly querying and calculating data based on the sampled data and the ES index to provide second-level calculation capacity of a sampled version.
The first two ways are to obtain results based on all user data queries, but the efficiency is low, the query speed of the first way is in the order of minutes, the query speed of the second way is 10-20 seconds, and cannot reach the order of seconds, while the query speed of the third way can reach the order of seconds, but the calculation based on sampled data is lossy, so that the accuracy of query results is low.
Disclosure of Invention
The embodiment of the application provides a user data query method, a user data query device, electronic equipment and a storage medium, which are beneficial to improving query efficiency and improving query result accuracy.
In order to solve the above problem, in a first aspect, an embodiment of the present application provides a method for querying user data, including:
obtaining a data query request, wherein the data query request comprises an initial query condition represented by at least one user portrait tag;
determining at least one label code corresponding to each user portrait label, and converting the initial query condition into a target query condition represented by the at least one label code, wherein the label code is the encoded data of the user portrait label;
acquiring user data meeting the target query condition from a search engine to obtain a query result, wherein the search engine stores tag codes corresponding to various users in a column storage mode;
and outputting the query result.
In a second aspect, an embodiment of the present application provides an apparatus for querying user data, including:
a query request acquisition module for acquiring a data query request, the data query request including an initial query condition represented by at least one user portrait tag;
the query condition conversion module is used for determining at least one label code corresponding to the at least one user portrait label respectively and converting the initial query condition into a target query condition represented by the at least one label code, wherein the label code is the coded data of the user portrait label;
the query module is used for acquiring the user data meeting the target query condition from a search engine to obtain a query result; the search engine stores tag codes corresponding to various users in a columnar storage mode;
and the query result output module is used for outputting the query result.
In a third aspect, an embodiment of the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the user data query method according to the embodiment of the present application when executing the computer program.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the query method for user data according to the present application.
The query method, device, electronic device and storage medium for user data provided by the embodiments of the present application, by obtaining a data query request, where the data query request includes an initial query condition represented by at least one user portrait label, determining at least one label code corresponding to each of the at least one user portrait label, and converting the initial query condition into a target query condition represented by the at least one label code, obtaining user data satisfying the target query condition from a search engine, and obtaining a query result, and outputting the query result, where the original data is compressed because the label codes are encoded data of the user portrait label, and the search engine stores the label codes corresponding to all users in a columnar storage form, that is, the search engine stores the label codes corresponding to all users and further compresses the label codes through the columnar storage, the data volume is greatly reduced, so that the query can be carried out on the basis of all user data, and the accuracy of the query result is improved on the basis of improving the query efficiency.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a flowchart of a user data query method according to a first embodiment of the present application;
FIG. 2 is a flowchart of a second embodiment of the present application;
fig. 3 is a schematic structural diagram of a user data query device according to a third embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Example one
The query method of user data provided by this embodiment is suitable for querying user group data, as shown in fig. 1, the method includes: step 110 to step 140.
At step 110, a data query request is obtained, the data query request including initial query conditions represented by at least one user portrait tag.
The user portrait label is description information for abstract classification and summarization of a certain feature of a user group, such as male, female, Android, iPhone, and the like. The user representation tags may include user tags, behavior tags, consumption tags, content analysis tags, and the like, wherein the user tags may include at least one of gender, age, income, occupation, academic history, and home city, and the like; the behavior tag comprises at least one of time period, frequency, duration, access path and the like; the consumption label comprises at least one of consumption habit, purchase intention, promotion sensitivity and the like; the content analysis tag is used for analyzing the content browsed by the user at ordinary times, particularly the content with long residence time and multiple browsing times, and analyzing the content which is interested by the user, such as finance, entertainment, education, sports, fashion, science and technology and the like.
A user can input an initial query condition represented by at least one user portrait label through an operation interface of a client, and the client generates a data query request comprising the initial query condition; alternatively, the user may select at least one user portrait tab to be used through a plurality of user portrait tabs displayed in the operation interface of the client, determine an initial query condition represented by the at least one user portrait tab based on the at least one user portrait tab selected by the user, and generate a data query request including the initial query condition. The client sends the data query request to the electronic equipment executing the query method of the user data, so that the electronic equipment obtains the data query request.
Step 120, determining at least one tag code corresponding to each of the at least one user portrait tag, and converting the initial query condition into a target query condition represented by the at least one tag code, where the tag code is encoded data of the user portrait tag.
All user portrait tags are encoded in advance as tag codes, so that the corresponding relation between the user portrait tags and the tag codes is obtained and stored. When a data query request is acquired, determining at least one label code corresponding to at least one user portrait label in the data query request according to the corresponding relation, and converting an initial query condition into a target query condition represented by the at least one label code.
Step 130, obtaining user data meeting the target query condition from a search engine to obtain a query result, wherein the search engine stores tag codes corresponding to each user in a column storage manner.
Wherein the search engine is ES, and is an open source distributed search engine based on RESTful web interface and built on Apache lucene. In the ES search engine, a label corresponding to a user is coded into a document, and the search engine records the identification of the document where each label is coded through an inverted index.
The query request can be a user quantity estimation request or a user group portrait request, if the query request is the user quantity estimation request, the number of documents meeting the target query condition can be determined through the inverted index, the number of the documents is the estimated user quantity, and therefore the number of the documents is used as a query result; and if the query request is a user group portrait request, determining the documents meeting the target query condition through the inverted index, and portraying the features of the user group through the label codes of the users in the documents to obtain a query result.
The logic of the ES search data is described below by taking three users A, B, C as an example, and assuming that the user portrait labels of the users are: a user A, a sex male, loves a food and a trip; user B, gender woman, travel love, beauty love; user C, gender women, love fitness, love make-up. One record is a document in the ES, and assuming that the document numbers are 1, 2 and 3 respectively, the ES will parse each value in all records into words and generate an inverted index of the words to the document, as shown in table 1, "gender female" corresponds to documents 2 and 3, and "user a" corresponds to document 1.
TABLE 1 inverted indexing of words to documents
Word Document
User A 1
User B 2
User C 3
Sex woman [2,3]
…… ……
If the data query request is to check the characteristics of the female user, only the document indexed by the word of 'gender girl' is needed to be found, and the user portrait label is counted, so that the conclusion of 'beauty-like make-up ratio of 100%, fitness-like ratio of 50%, and travel-like ratio of 50%' can be obtained. The reverse index (FST tree) is preferentially put into the memory, the document ID is retrieved based on the reverse index, and then further statistical calculation is carried out. For the case of joint indexing (i.e. multiple query conditions, such as female beauty users), it can be implemented by skip list or bit operation.
The data storage can be line storage or column storage, and for the operation of statistical aggregation class, the ES is implemented by column storage (doc values), which is established when the index is created, is located in the cache of the file system (if the file is too large, it is written to the disk), and the data can be highly compressed by the compression mechanism, so that the document data can be quickly accessed and calculated.
In an embodiment of the present application, the obtaining user data satisfying the target query condition from a search engine to obtain a query result includes: and if the data query request is a user quantity estimation request, determining documents meeting the target query condition from the inverted index of a search engine, counting the number of the documents, and taking the number of the documents as a query result.
The identification of the document where each tag code is recorded in the inverted index, so that the documents meeting the target query condition can be determined through the inverted index, one document corresponds to one user, the number of the documents meeting the target query condition is counted if the data query request is a user quantity estimation request, the number of the documents is the user quantity meeting the target query condition, and the number of the documents is used as a query result. Therefore, when the query request is a user quantity estimation request, the query result meeting the target query condition can be quickly determined through the inverted index of the search engine, query in a large amount of user data which is stored dispersedly is not needed, and the query efficiency is improved.
In another embodiment of the present application, the obtaining user data satisfying the target query condition from a search engine to obtain a query result includes: if the data query request is a user group portrait request, determining documents meeting the target query condition from an inverted index of a search engine; and acquiring tag codes corresponding to the users from the document, performing user group portrait on each tag code, determining a user portrait tag corresponding to each tag code, and taking a user group portrait result corresponding to the user portrait tag as a query result.
The method comprises the steps of recording identification of a document where each label code is located in an inverted index, determining the document meeting a target query condition through the inverted index, enabling one document to correspond to one user, obtaining the label code corresponding to the user from each determined document after determining the document meeting the target query condition if a data query request is a user group portrait request, performing user group portrait on each label code, namely determining statistical characteristics corresponding to each label code, mapping a plurality of label codes for performing user group portrait into user portrait labels according to the corresponding relation between the user portrait labels and the label codes, and taking a user group portrait result corresponding to the user portrait labels as a query result. Documents meeting the target query condition can be quickly determined through the inverted index of the search engine, user group portrayal is further performed according to the documents, the user group portrayal does not need to be determined from a large amount of user data which are stored in a scattered mode, and query efficiency is improved.
In an embodiment of the present application, said encoding each tag to render a user population representation comprises: respectively counting a user ratio, a target group index and a gain degree corresponding to each label code, wherein the gain degree is a statistical caliber associated with the user ratio and the target group index; sequencing the codes of each label according to the sequence of the gain degrees from high to low; and using the user proportion, the target group index and the gain degree corresponding to the sorted label codes as the user group portrait result.
The user Group portrayal of each label code is counted according to a preset statistical caliber, wherein the preset statistical caliber can comprise a user ratio, a Target Group Index (TGI) and a gain. The user proportion is the proportion of the user quantity of the current label code in the dimension effective user quantity, namely the user proportion is equal to the product of 100 and the user quantity of the current label code divided by the dimension effective user quantity; the target population index is the ratio of the user ratio of the current label code in the target population to the ratio of the current label code in the reference population, and is used for describing the tendency; the gain is defined as TA × log (TGI/100), TA being the concentration of the current label of the target population, i.e. the user proportion, and log (TGI) being the representation of the information difference between the two populations from the information point of view. The effective number of users of the dimensionality is the number of users covered by the category corresponding to the current tag, for example, if the current tag is a male, the effective number of users of the dimensionality is all the number of users covered by the gender tag, if the value (namely, each tag) in the tag category is mutually exclusive, the effective number of users of the dimensionality is the sum of the number of users of all tags in the category, if the values are not mutually exclusive, a tag of the category can be set, and the number of users of the tag of the category is the effective number of users of the dimensionality. The target population is a population consisting of users corresponding to the determined multiple documents, and the reference population is a population for comparison and can be a population consisting of all users.
The user proportion provides proportion for each label code, namely, the proportion is provided for each portrait dimension and is expressed in the form of a table or a graph, but the labels cannot express tendencies, for example, the most APPs used by high consumers and low consumers are WeChat, the most of the spending money on the preset APPs is takeout, and the like. TGI is used to describe tendencies, for example, among the nationwide users, the percentage of female users is 50%, and among the users with preset APP, the percentage of female users is 80%, and the percentage of female users of preset APP is 60% higher than that of nationwide users, i.e., preset APP users tend to be female. Because the proportion only considers the mass factor and the TGI only considers the tendency factor, the two statistical calibers cannot comprehensively and macroscopically depict the core characteristics of the crowd, when the label dimensions reach tens of thousands, the human eyes need to spend great efforts to arrange the calculated data, and the problem of missing information may exist, and the gain can integrate the mass and the tendency factor, so that the problem that the proportion and the TGI only consider the mass or the tendency is solved. When the TGI is less than 100, the gain is 0, which indicates no tendency, when the user occupation ratio and the TGI are both high, the gain is higher, the occupation ratio has the problem that a label without difference is advanced, but in the formula of the gain, a dimension TGI item without difference is very small, so the gain value is small, and the TGI has the problem that a small quantity of features are advanced, and in the caliber of the gain, the user occupation ratio in the situation is very small, so the gain can integrate the advantages of two indexes, and the importance of the label is globally ordered. The gain index can rank the most important labels at the top and also can give the ranking of the importance of label categories.
When the user group portrait is displayed on each label code, the user proportion, the target group index and the gain degree corresponding to each label code are respectively counted, the gain degree comprehensively considers factors of the size and the tendency, therefore, each label code is sequenced according to the sequence of the gain degrees from high to low, the user proportion, the target group index and the gain degree corresponding to the sequenced label codes are used as the user group portrait result, the labels with high gain degree can be displayed firstly when the user group portrait result is displayed, and the user can visually see the importance of each label.
It should be noted that the search engine stores the tag codes corresponding to each user in a column-wise storage manner, and since it is not necessary to know the specific user portrait tags corresponding to the users when performing user quantity estimation or user group portrayal, the search engine only needs to distinguish the tag codes corresponding to different users, and does not need to store user identifiers, thereby further reducing the data quantity. The number of ES nodes should be greater than the computation of the full amount of data/(1 second computation of a single node x the number of cores of a single node x the query time that can be tolerated by one computation). The data can be stored in the ES according to the fragments, and the size of each fragment in the fragments can be set to about 10G, so that the condition that the query performance is influenced by the large magnitude of fragment storage data is avoided.
And 140, outputting the query result.
And outputting the query result to display the query result through a display screen, for example, outputting the query result to a client, and displaying the query result through the client.
The query method for user data provided by the embodiment of the application, by obtaining a data query request, the data query request including an initial query condition represented by at least one user portrait label, determining at least one label code corresponding to the at least one user portrait label, respectively, and converting the initial query condition into a target query condition represented by the at least one label code, obtaining user data satisfying the target query condition from a search engine, obtaining a query result, and outputting the query result, because the label codes are the coded data of the user portrait label, the original data is compressed, and the search engine stores the label codes corresponding to all users in a columnar storage form, that is, the search engine stores the label codes corresponding to all users and further compresses through the columnar storage, the data volume is greatly reduced, therefore, the query can be carried out on the basis of all user data, and the accuracy of the query result is improved on the basis of improving the query efficiency.
Example two
The query method of user data provided by this embodiment is suitable for querying user group data, as shown in fig. 2, the method includes: step 210 to step 290.
Step 210, obtain the full number of user portrait tags.
The user portrait tags corresponding to all the users can be obtained, and the user portrait tags corresponding to all the users are integrated to obtain the total amount of user portrait tags.
The user profile tags may be tags corresponding to tag categories, i.e., one or more user profile tags may belong to a category, such that a user may first select a tag category to use and determine a user profile tag to use under the tag category when determining initial query conditions at the client.
In an embodiment of the present application, before the obtaining the full amount of user portrait tags, the method further includes: acquiring full user data; and determining user portrait tags corresponding to the users according to the full user data, and integrating the user portrait tags corresponding to the same users into a table.
The full-scale user data includes behavior data and attribute data of all users, the behavior data includes search data, page access data and the like, and the attribute data includes category information, city information and the like.
In consideration of time consumption of data association and deduplication, data association and deduplication are performed in the preprocessing process, and the data association and deduplication mainly determines a user portrait label corresponding to a user according to the total number of users. The method comprises the steps of obtaining full user data from all databases for storing user data, wherein the data volume of the full user data is PB level, cleaning the full user data according to indexes of users, converting the user data into user portrait labels corresponding to the users, such as female, takeaway control, hot pot search and the like, converting the data into TB level data volume after completion, and integrating the user portrait labels corresponding to the same user into one table. When user portrait labels corresponding to the same user are integrated, data of the same user can be identified through a user identifier or an equipment identifier.
By converting the full amount of user data into the user portrait tags corresponding to the users, the data amount is greatly reduced, and by integrating the user portrait tags corresponding to the same user into one table, the follow-up processing can be facilitated, and the follow-up processing speed is improved.
And step 220, encoding the user portrait label to obtain a label code corresponding to the user portrait label.
The user portrait tags are encoded to further compress the data, reducing the amount of data. The encoding may be by bitmap encoding or by converting the user portrait label to a number. The user portrait label is coded, so that the storage magnitude of data in a disk can be reduced, the index is smaller in a memory as much as possible, and the aim of accelerating query is fulfilled.
In an embodiment of the application, the encoding the user portrait tag to obtain a tag code corresponding to the user portrait tag includes: respectively carrying out discretization processing on the full amount of user portrait labels to obtain discrete labels; and coding the discrete label to obtain a label code corresponding to the user portrait label.
Some user image tags are continuous data, for example, a takeaway purchase ratio of 0.0666, which cannot be enumerated, and the user image tags need to be discrete when encoded, and therefore, it is necessary to perform discretization processing on the entire number of user image tags. When discretization processing is carried out, discretization processing can be carried out on continuous user portrait labels, discrete user portrait labels do not need processing, the discretization mode in the discretization processing is mainly segmentation processing, the segmentation mode can be determined according to business requirements, for example, the user portrait labels such as 'takeaway purchase duty' can be divided into three segments, and the segments are low under 0.3 and high in the range of 0.3-0.7, and the segments are high above 0.7. The discrete labels are obtained by discretizing the user portrait labels, so that each discrete label can be enumerated.
After the user portrait label is discretized to obtain the discrete label, the discrete label is encoded to obtain the label code corresponding to the user portrait label, so that the user portrait label is further compressed, and the data volume is further reduced.
The main purposes of the integration of the data and the discretization of the user portrait label are to ensure that the calculation in the ES is simpler and realize the purpose of real-time calculation only by counting operation and not by association and deduplication operation of the data in the data query process of a subsequent ES search engine.
In an embodiment of the present application, the encoding the discrete tag to obtain a tag code corresponding to a user portrait tag includes: carrying out bitmap coding on the discrete label to obtain a label code corresponding to the user portrait label; or distributing a digital identifier for the discrete label, and using the digital identifier as a label code corresponding to the user portrait label.
The bitmap encoding is to convert the discrete label into a bitmap (bit) of 1 bit (bit).
When the discrete label is coded, the coding mode can adopt bitmap coding or digital coding. And carrying out bitmap coding on the discrete label, namely converting the discrete label into a bitmap. The digital encoding of the discrete labels is to distribute digital marks for the discrete labels, and the digital label corresponding to one discrete label is used as the label encoding of the user portrait label corresponding to the discrete label. Under the ideal condition, tens of times of compression can be realized through bitmap coding, considering that labels in an actual scene are very sparse, a plurality of empty bits exist, discrete labels are not beneficial to column type storage after being converted into bitmaps, and digital coding is similar to bitmap coding in compression effect, so that digital coding can be selected preferentially.
In one embodiment of the present application, the assigning a digital identifier to the discrete tag includes: and allocating the numerical identifications from small to large for the discrete labels according to the sequence of the number of the users covered by the discrete labels from high to low.
Based on the user quantity covered by the discrete labels, the discrete labels are translated into the digital identifications, the more discrete labels covering the user magnitude are, the smaller digital identifications are corresponding, the multiple compression can be realized based on the mode, and the compression effect of the scheme is similar to that of a bitmap coding mode through experiments, and meanwhile, the scheme is more beneficial to being used in an ES.
The magnitude of original full-amount user data is PB level, the data magnitude of the obtained discrete label is TB level after data integration and discretization, the data magnitude of the encoded discrete label is GB level by encoding the discrete label, and therefore the data is compressed by multiple times, and the problem of large data amount is solved.
The integration, discretization and encoding processes of the data can be calculated based on spark, so that the calculation speed is improved.
Step 230, converting the user portrait labels corresponding to the user into label codes corresponding to the user according to the label codes corresponding to the full amount of user portrait labels.
After the label codes corresponding to the full amount of user portrait labels are obtained, the user portrait labels of all the users which are integrated together in advance can be converted, and the user portrait label corresponding to each user can be converted into the label code corresponding to the user.
And step 240, storing the label code corresponding to the user in the HDFS in a column storage mode.
In the column-based database, data is stored in a column-based logical storage unit, and the data in a column exists in a storage medium in a continuous storage form. HDFS (Hadoop Distributed File System) refers to a Distributed File System designed to fit on general purpose hardware. The HDFS is a highly fault-tolerant system and is suitable for being deployed on cheap machines, and the HDFS can provide high-throughput data access and is very suitable for application on large-scale data sets.
And storing the label codes corresponding to the users into the HDFS in a columnar storage mode, so that the label codes corresponding to the users are written into a search engine in the columnar storage mode.
And step 250, writing the label code corresponding to the user in the HDFS into a search engine, and creating an inverted index.
And writing the label codes corresponding to the users in the HDFS into a search engine in a columnar storage mode, creating an inverted index in the writing process, and recording the document stored by each user portrait label in the inverted index.
The ES is writing while constructing the index, the writing of the TB-level data is long, and the magnitude of the data is reduced through the integration, discretization and encoding of the data, so that the magnitude of the data is GB-level, the problem that the writing of the data into the ES is difficult can be solved, and the writing time of the data is reduced. A disc in the electronic device executing the user data query method may employ a PCIE SSD hard disk to obtain better read-write capability.
At step 260, a data query request is obtained, the data query request including initial query conditions represented by at least one user portrait tag.
Step 270, determining at least one label code corresponding to each of the at least one user portrait label, and converting the initial query condition into a target query condition represented by the at least one label code.
Step 280, obtaining the user data meeting the target query condition from the search engine to obtain the query result.
And 290, outputting the query result.
It should be noted that, in the case of a large amount of data, the number of cluster nodes and the number of fragments (the minimum unit blocks of data) may also be optimized, so as to improve the query efficiency by improving the parallelism. Due to the filtering function of the index, if the user defines less crowd, the calculated amount is correspondingly less because only the aggregation operation is carried out on the indexed documents; if the number of the defined people is large, a large amount of documents or even the whole amount of documents need to be taken out for calculation, the calculation amount is large, and the query efficiency can be improved by optimizing the number of the cluster nodes and the fragments.
According to the user data query method provided by the embodiment, the magnitude of data is reduced through the integration, discretization and coding of the data, the compression capacity of the search engine column storage is utilized, the data can be directly retrieved by using the cache during data query, the problem that the retrieval speed is low due to the fact that the retrieval exceeds the upper limit of the cache and overflows to a disk when the magnitude of the data is large is solved, the retrieval speed is improved, only the label codes corresponding to the user are stored in the search engine, therefore, only counting operation is needed during aggregation calculation, the calculation amount is greatly reduced, the data query efficiency is further improved, the query efficiency reaches the second level, and the target of real-time query is achieved.
EXAMPLE III
As shown in fig. 3, the apparatus 300 for querying user data according to this embodiment includes:
a query request obtaining module 310, configured to obtain a data query request, where the data query request includes an initial query condition represented by at least one user portrait tag;
a query condition conversion module 320, configured to determine at least one tag code corresponding to each of the at least one user portrait tag, and convert the initial query condition into a target query condition represented by the at least one tag code, where the tag code is encoded data of the user portrait tag;
the query module 330 is configured to obtain user data meeting the target query condition from a search engine to obtain a query result; the search engine stores tag codes corresponding to various users in a columnar storage mode;
and the query result output module 340 is configured to output the query result.
Optionally, the apparatus further comprises:
the system comprises a full label acquisition module, a full label acquisition module and a full label acquisition module, wherein the full label acquisition module is used for acquiring a full user portrait label;
the label coding module is used for coding the user portrait label to obtain a label code corresponding to the user portrait label;
the user portrait label coding determining module is used for respectively converting the user portrait labels corresponding to the user into label codes corresponding to the user according to the label codes corresponding to the full amount of user portrait labels;
the HDFS storage module is used for storing the label codes corresponding to the users into the HDFS in a column storage mode;
and the search engine writing module is used for writing the label code corresponding to the user in the HDFS into a search engine and creating an inverted index.
Optionally, the tag encoding module includes:
the discretization processing unit is used for performing discretization processing on the full amount of user portrait labels respectively to obtain discrete labels;
and the label coding unit is used for coding the discrete label to obtain a label code corresponding to the user portrait label.
Optionally, the tag encoding unit includes:
the bitmap coding subunit is used for carrying out bitmap coding on the discrete label to obtain a label code corresponding to the user portrait label; or
And the digital coding subunit is used for distributing a digital identifier to the discrete label and using the digital identifier as a label code corresponding to the user portrait label.
Optionally, the digital coding subunit is specifically configured to:
and allocating the numerical identifications from small to large for the discrete labels according to the sequence of the number of the users covered by the discrete labels from high to low.
Optionally, the apparatus further comprises:
the system comprises a full data acquisition module, a full data acquisition module and a full data processing module, wherein the full data acquisition module is used for acquiring full user data;
and the label integration module is used for determining the user portrait labels corresponding to the users according to the full amount of user data and integrating the user portrait labels corresponding to the same user into one table.
Optionally, the query module includes:
and the user quantity estimation unit is used for determining the documents meeting the target query condition from the inverted index of the search engine if the data query request is the user quantity estimation request, counting the number of the documents and taking the number of the documents as a query result.
Optionally, the query module includes:
the document determining unit is used for determining documents meeting the target query condition from the inverted index of a search engine if the data query request is a user group portrait request;
and the user group portrait unit is used for acquiring the label codes corresponding to the users from the document, performing user group portrait on each label code, determining the user portrait label corresponding to each label code, and taking the user group portrait result corresponding to the user portrait label as a query result.
Optionally, the user group representation unit is specifically configured to:
respectively counting a user ratio, a target group index and a gain degree corresponding to each label code, wherein the gain degree is a statistical caliber associated with the user ratio and the target group index;
sequencing the codes of each label according to the sequence of the gain degrees from high to low;
and using the user proportion, the target group index and the gain degree corresponding to the sorted label codes as the user group portrait result.
The user data query device provided in the embodiment of the present application is configured to implement each step of the user data query method described in the embodiment of the present application, and specific implementation of each module of the device refers to the corresponding step, which is not described herein again.
The query device for user data provided by the embodiment of the application obtains a data query request through a query request obtaining module, the data query request comprises an initial query condition represented by at least one user portrait label, a query condition conversion module determines at least one label code corresponding to the at least one user portrait label respectively and converts the initial query condition into a target query condition represented by the at least one label code, the query module obtains user data meeting the target query condition from a search engine to obtain a query result, a query result output module outputs the query result, the original data are compressed because the label codes are the coded data of the user portrait label, and the search engine stores the label codes corresponding to all users in a column storage form, namely the search engine stores the label codes corresponding to all users and further compresses through the column storage, the data volume is greatly reduced, so that the query can be carried out on the basis of all user data, and the accuracy of the query result is improved on the basis of improving the query efficiency.
Example four
Embodiments of the present application also provide an electronic device, as shown in fig. 4, the electronic device 400 may include one or more processors 410 and one or more memories 420 connected to the processors 410. Electronic device 400 may also include input interface 430 and output interface 440 for communicating with another apparatus or system. Program code executed by processor 410 may be stored in memory 420.
The processor 410 in the electronic device 400 calls the program code stored in the memory 420 to perform the query method of the user data in the above-described embodiment.
The above elements in the above electronic device may be connected to each other by a bus, such as one of a data bus, an address bus, a control bus, an expansion bus, and a local bus, or any combination thereof.
Embodiments of the present application also provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the query method for user data according to the embodiments of the present application.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The above detailed description is given to a method, an apparatus, an electronic device, and a storage medium for querying user data provided in the embodiments of the present application, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Claims (12)

1. A method for querying user data, comprising:
obtaining a data query request, wherein the data query request comprises an initial query condition represented by at least one user portrait tag;
determining at least one label code corresponding to each user portrait label, and converting the initial query condition into a target query condition represented by the at least one label code, wherein the label code is the encoded data of the user portrait label;
acquiring user data meeting the target query condition from a search engine to obtain a query result, wherein the search engine stores tag codes corresponding to various users in a column storage mode;
and outputting the query result.
2. The method of claim 1, prior to said determining at least one label encoding to which said at least one user portrait label respectively corresponds, further comprising:
obtaining a full amount of user portrait tags;
encoding the user portrait label to obtain a label code corresponding to the user portrait label;
respectively converting the user portrait labels corresponding to the user into label codes corresponding to the user according to the label codes corresponding to the full amount of user portrait labels;
storing label codes corresponding to users into a Hadoop Distributed File System (HDFS) in a columnar storage mode;
and writing the label code corresponding to the user in the HDFS into a search engine, and creating an inverted index.
3. The method of claim 2, wherein said encoding the user representation tag to obtain a tag encoding corresponding to the user representation tag comprises:
respectively carrying out discretization processing on the full amount of user portrait labels to obtain discrete labels;
and coding the discrete label to obtain a label code corresponding to the user portrait label.
4. The method of claim 2, wherein said encoding the discrete tag to obtain a tag code corresponding to a user portrait tag comprises:
carrying out bitmap coding on the discrete label to obtain a label code corresponding to the user portrait label; or
And distributing a digital identifier for the discrete label, and using the digital identifier as a label code corresponding to the user portrait label.
5. The method of claim 4, wherein said assigning a digital identification to said discrete tag comprises:
and allocating the numerical identifications from small to large for the discrete labels according to the sequence of the number of the users covered by the discrete labels from high to low.
6. The method of claim 2, further comprising, prior to said obtaining a full amount of user portrait tags:
acquiring full user data;
and determining user portrait tags corresponding to the users according to the full user data, and integrating the user portrait tags corresponding to the same users into a table.
7. The method according to any one of claims 1 to 6, wherein the obtaining user data satisfying the target query condition from a search engine to obtain a query result comprises:
and if the data query request is a user quantity estimation request, determining documents meeting the target query condition from the inverted index of a search engine, counting the number of the documents, and taking the number of the documents as a query result.
8. The method according to any one of claims 1 to 6, wherein the obtaining user data satisfying the target query condition from a search engine to obtain a query result comprises:
if the data query request is a user group portrait request, determining documents meeting the target query condition from an inverted index of a search engine;
and acquiring tag codes corresponding to the users from the document, performing user group portrait on each tag code, determining a user portrait tag corresponding to each tag code, and taking a user group portrait result corresponding to the user portrait tag as a query result.
9. The method of claim 8, wherein encoding each tag to render a user population representation comprises:
respectively counting a user ratio, a target group index and a gain degree corresponding to each label code, wherein the gain degree is a statistical caliber associated with the user ratio and the target group index;
sequencing the codes of each label according to the sequence of the gain degrees from high to low;
and using the user proportion, the target group index and the gain degree corresponding to the sorted label codes as the user group portrait result.
10. An apparatus for querying user data, comprising:
a query request acquisition module for acquiring a data query request, the data query request including an initial query condition represented by at least one user portrait tag;
the query condition conversion module is used for determining at least one label code corresponding to the at least one user portrait label respectively and converting the initial query condition into a target query condition represented by the at least one label code, wherein the label code is the coded data of the user portrait label;
the query module is used for acquiring the user data meeting the target query condition from a search engine to obtain a query result; the search engine stores tag codes corresponding to various users in a columnar storage mode;
and the query result output module is used for outputting the query result.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of querying user data according to any one of claims 1 to 9 when executing the computer program.
12. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method for querying user data according to any one of claims 1 to 9.
CN202010404565.1A 2020-05-13 2020-05-13 User data query method and device, electronic equipment and storage medium Withdrawn CN111782686A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010404565.1A CN111782686A (en) 2020-05-13 2020-05-13 User data query method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010404565.1A CN111782686A (en) 2020-05-13 2020-05-13 User data query method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111782686A true CN111782686A (en) 2020-10-16

Family

ID=72753921

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010404565.1A Withdrawn CN111782686A (en) 2020-05-13 2020-05-13 User data query method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111782686A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182021A (en) * 2020-11-03 2021-01-05 浙江大搜车软件技术有限公司 User data query method, device and system
CN112214497A (en) * 2020-10-28 2021-01-12 上海豹云网络信息服务有限公司 Label processing method and device and computer system
CN113297617A (en) * 2021-05-26 2021-08-24 杭州安恒信息技术股份有限公司 Authority data acquisition method and device, computer equipment and storage medium
CN113760915A (en) * 2021-09-07 2021-12-07 百果园技术(新加坡)有限公司 Data processing method, device, equipment and medium
CN114238312A (en) * 2021-11-26 2022-03-25 上海维智卓新信息科技有限公司 User portrait determination method and device based on bitmap calculation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108415978A (en) * 2018-02-09 2018-08-17 北京腾云天下科技有限公司 User tag storage method, user's portrait computational methods and computing device
CN110020086A (en) * 2017-12-22 2019-07-16 中国移动通信集团浙江有限公司 A kind of user draws a portrait querying method and device
CN110955646A (en) * 2019-11-29 2020-04-03 北京达佳互联信息技术有限公司 Data storage and query method, device, equipment and medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020086A (en) * 2017-12-22 2019-07-16 中国移动通信集团浙江有限公司 A kind of user draws a portrait querying method and device
CN108415978A (en) * 2018-02-09 2018-08-17 北京腾云天下科技有限公司 User tag storage method, user's portrait computational methods and computing device
CN110955646A (en) * 2019-11-29 2020-04-03 北京达佳互联信息技术有限公司 Data storage and query method, device, equipment and medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112214497A (en) * 2020-10-28 2021-01-12 上海豹云网络信息服务有限公司 Label processing method and device and computer system
CN112182021A (en) * 2020-11-03 2021-01-05 浙江大搜车软件技术有限公司 User data query method, device and system
CN113297617A (en) * 2021-05-26 2021-08-24 杭州安恒信息技术股份有限公司 Authority data acquisition method and device, computer equipment and storage medium
CN113760915A (en) * 2021-09-07 2021-12-07 百果园技术(新加坡)有限公司 Data processing method, device, equipment and medium
CN114238312A (en) * 2021-11-26 2022-03-25 上海维智卓新信息科技有限公司 User portrait determination method and device based on bitmap calculation

Similar Documents

Publication Publication Date Title
CN111782686A (en) User data query method and device, electronic equipment and storage medium
CN108256119B (en) Resource recommendation model construction method and resource recommendation method based on model
CN109992645B (en) Data management system and method based on text data
WO2021068610A1 (en) Resource recommendation method and apparatus, electronic device and storage medium
US8990241B2 (en) System and method for recommending queries related to trending topics based on a received query
WO2017097231A1 (en) Topic processing method and device
CN102521233B (en) Adapting to image searching database
CN111008321B (en) Logistic regression recommendation-based method, device, computing equipment and readable storage medium
Shi et al. Learning-to-rank for real-time high-precision hashtag recommendation for streaming news
CN108021651B (en) Network public opinion risk assessment method and device
CN110647512B (en) Data storage and analysis method, device, equipment and readable medium
CN105426514A (en) Personalized mobile APP recommendation method
CN108846021B (en) Mass small file storage method based on user access preference model
CN111723260B (en) Recommended content acquisition method and device, electronic equipment and readable storage medium
Sisodia et al. Fast prediction of web user browsing behaviours using most interesting patterns
US9552415B2 (en) Category classification processing device and method
CN111310032A (en) Resource recommendation method and device, computer equipment and readable storage medium
CN110795613A (en) Commodity searching method, device and system and electronic equipment
WO2017201905A1 (en) Data distribution method and device, and storage medium
CN110874366A (en) Data processing and query method and device
CN108009847A (en) The method for taking out shop embedding feature extractions under scene
CN112100177A (en) Data storage method and device, computer equipment and storage medium
KR101823463B1 (en) Apparatus for providing researcher searching service and method thereof
CN107315807B (en) Talent recommendation method and device
JP6260678B2 (en) Information processing apparatus, information processing method, and information processing program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20201016

WW01 Invention patent application withdrawn after publication