WO2017121272A1 - 用户行为数据的处理方法及装置 - Google Patents

用户行为数据的处理方法及装置 Download PDF

Info

Publication number
WO2017121272A1
WO2017121272A1 PCT/CN2017/070150 CN2017070150W WO2017121272A1 WO 2017121272 A1 WO2017121272 A1 WO 2017121272A1 CN 2017070150 W CN2017070150 W CN 2017070150W WO 2017121272 A1 WO2017121272 A1 WO 2017121272A1
Authority
WO
WIPO (PCT)
Prior art keywords
search
data set
dimension
user
search term
Prior art date
Application number
PCT/CN2017/070150
Other languages
English (en)
French (fr)
Inventor
周强
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017121272A1 publication Critical patent/WO2017121272A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Definitions

  • the present invention relates to the field of computers, and in particular to a method and apparatus for processing user behavior data.
  • users use Internet products (such as shopping on the portal) to generate a large amount of structured data
  • merchants often use the above structured data to achieve crowd orientation to analyze user interests, for example, DMP tag population Orientation technology, using the user's basic information and basic behaviors, completes the activity of marking the person's marking, and further pushes the advertisement or application to the targeted user group.
  • the embodiment of the invention provides a method and a device for processing user behavior data, so as to at least solve the technical problem that the population orientation is achieved through structured data, and the positioning result is not accurate enough.
  • a method for processing user behavior data including: acquiring user behavior data, where the user behavior data includes a set of access data generated after a plurality of users access the target object, and accessing the data set.
  • the data set includes at least three dimensions: a keyword set, an attribute information set, and a classification information set; determining a preference score of the search item included in the data set corresponding to each dimension of the user, where each dimension
  • the data set includes at least one search item; after obtaining the search word to be located, the plurality of positioning search items corresponding to the search word are obtained according to the search word query, and the data set corresponding to each dimension of each positioning search item is obtained.
  • Weight value calculate each user and search according to the preference score of the retrieval item included in the data set on each dimension and the weight value of the data set on each dimension corresponding to each positioning retrieval item.
  • the behavior weight value determined by the coupling relationship between the words; determining the user group to which the search term to be located is located according to the behavior weight value determined by the coupling relationship between each user and the search word.
  • a device for processing user behavior data including: a first acquiring unit, configured to acquire user behavior data, where the user behavior data includes a plurality of users accessing the target object.
  • the generated access data set includes at least three sets of data sets: a keyword set, an attribute information set, and a classification information set; and a first determining unit, configured to determine a data set corresponding to each dimension of the user a preference score of the included search item, wherein the data set in each dimension includes at least one search item; and the second obtaining unit is configured to obtain a corresponding search term according to the search term query after acquiring the search word to be located a plurality of positioning retrieval items of the relationship, and obtaining weight values of the data sets on each dimension corresponding to each of the positioning retrieval items; and a third obtaining unit, according to the retrieval items included in the data set of each user on each dimension Preference scores and the weight values of the data sets on each dimension corresponding to each of the positioned search terms are calculated
  • a behavior weight value determined by a coupling relationship between each user and a search term a behavior weight value determined by a coupling relationship between each user and a search term
  • a second determining unit that determines a search term to be located according to a behavior weight value determined by a coupling relationship between each user and the search term Targeted user groups.
  • the user behavior data is acquired, where the user behavior data includes an access data set generated after the plurality of users access the target object, and the access data set includes at least the data set in the following three dimensions: a keyword set. And an attribute information set and a classification information set; determining a preference score of the search item included in the data set corresponding to each dimension by the user, wherein the data set in each dimension includes at least one search item; and obtaining the search to be located
  • a plurality of positioning retrieval items corresponding to the search word are obtained according to the search word query, and the weight value of the data set corresponding to each dimension of each positioning retrieval item is obtained; according to the data set in each dimension
  • the preference score of the retrieved item and the weight value of the data set on each dimension corresponding to each of the positioned search terms, and the behavior weight value determined by the coupling relationship between each user and the search term is calculated; according to each user The behavior weight value determined by the coupling relationship with the search term, determining the search term to be located
  • FIG. 1 is a block diagram showing the hardware structure of a computer terminal for processing a user behavior data according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a method for processing user behavior data according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of an alternative method of processing user behavior data according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of an alternative method of processing user behavior data according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a device for processing user behavior data according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of an apparatus for processing user behavior data according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of an apparatus for processing user behavior data according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of an apparatus for selectively processing user behavior data according to an embodiment of the present invention.
  • FIG. 9 is a block diagram showing the hardware structure of a computer terminal for processing a user behavior data according to an embodiment of the present invention.
  • ETL is an abbreviation of English Extract-Transform-Load, which describes the process of extracting, transforming, and loading data from the source to the destination.
  • the term ETL is more commonly used in data warehousing, but its objects are not limited to data warehousing.
  • ETL is an important part of building a data warehouse. Users extract the required data from the data source, clean it through data, and finally load the data into the data warehouse according to the predefined data warehouse model.
  • LR Logistic regression, a commonly used linear classifier.
  • SVM Support Vector Machine
  • Lucene is a subproject of the 4jakarta project team of the Apache Software Foundation. It is an open source full-text search engine toolkit, but it is not a complete full-text search engine, but a full-text search engine architecture that provides complete The query engine and index engine, part of the text analysis engine (English and German two Western languages).
  • an embodiment of a method of processing user behavior data is also provided, it being noted that the steps illustrated in the flowchart of the figures may be performed in a computer system such as a set of computer executable instructions And, although the logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in a different order than the ones described herein.
  • FIG. 1 is a hardware structural block diagram of a computer terminal for processing a user behavior data according to an embodiment of the present invention.
  • computer terminal 10 may include one or more (only one shown) processor 102 (processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA)
  • processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA)
  • a memory 104 for storing data
  • a transmission module 106 for communication functions.
  • computer terminal 10 may also include more or fewer components than those shown in FIG. 1, or have a different configuration than that shown in FIG.
  • the memory 104 can be used to store software programs and modules of the application software, such as program instructions/modules corresponding to the processing method of the user behavior data in the embodiment of the present invention, and the processor 102 runs the software program and the module stored in the memory 104, thereby Perform various functional applications and data processing, that is, implement the vulnerability detection method of the above application.
  • Memory 104 may include high speed random access memory, and may also include non-volatile memory such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory.
  • memory 104 may further include memory remotely located relative to processor 102, which may be coupled to computer terminal 10 via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the transmission module 106 is configured to receive or transmit data via a network.
  • the network specific examples described above may include a wireless network provided by a communication provider of the computer terminal 10.
  • the transmission module 106 includes a Network Interface Controller (NIC) that can be compared to other network devices through the base station. It can even communicate with the Internet.
  • the transmission module 106 can be a Radio Frequency (RF) module for communicating with the Internet wirelessly.
  • NIC Network Interface Controller
  • RF Radio Frequency
  • step S22 user behavior data is acquired.
  • the user behavior data includes an access data set generated after the plurality of users access the target object, and the access data set includes at least the data set in the following three dimensions: a keyword set, an attribute information set, and a classification information set.
  • the user may be an access user USER of a portal website (such as a shopping website), and the target object may be a product ITEM in the portal website, and the product ITEM may be a product, a video, a music, etc., in the user USER.
  • a portal website such as a shopping website
  • the product ITEM may be a product, a video, a music, etc., in the user USER.
  • each access data set obtained by the website server can be described by using three dimensions: the category CATEGORY, that is, the above classification information, which is used to describe the classification of the product ITEM, and the attribute PROPERTY, which is used to express the self of the product ITEM.
  • the keyword KEYWORD which is used to represent the name of the product ITEM.
  • Each keyword can have the word frequency or the weight of the TFIDF. It should be noted that in the three dimensions for describing the product ITEM, each product ITEM can only have one category CATEGORY, and each product ITEM can have multiple attributes PROPERTY.
  • the solution can statistically summarize the user's original behavior data through a targeted supervised learning algorithm (such as LR, SVM), and then decompose the behavior of USER on the ITEM product into the above three dimensions, optionally
  • a targeted supervised learning algorithm such as LR, SVM
  • the data specification of the product ITEM in this scheme can be the following table 1.
  • the data specification of the user USER behavior can be the following table 2.
  • the product classification can be beauty makeup, maternal and child, food, video, song, and the like, and the user can perform specific products under the classification.
  • the user USER can click the "Chow Sing Chi Movie” index button under the movie category in the TB page, then the target object selected by the user USER is the "Chou Xingchi Movie” product, and the "Zhou Xingchi Movie” product can adopt three dimensions ( Category, attribute, keyword)
  • the above-mentioned "Chou Xingchi Movie” product category is a movie
  • the attribute is video
  • the keyword is Stephen Chow.
  • Step S24 determining a preference score of the search item included in the data set corresponding to the user in each dimension.
  • the data set on each dimension contains at least one search item.
  • each dimension may include a plurality of retrieval items, and the plurality of retrieval items may be multiple attributes of each dimension, and the user may The specific search item below operates, and then the solution can determine the user's preference score for each search item according to the specific operation of the user for each search item.
  • the user USER accesses the shopping website TB.
  • the category “CATXORY” of the above “Chow Sing Chi Movie” product is “movie”
  • category CATEGORY The “movie” may include a first search item “domestic movie”, a second search item “comedy movie”, etc.
  • the attribute PROPERTY of the above "Chow Sing Chi movie” product is "video”
  • the attribute PROPERTY "video” may include a third search item " HD video”
  • the attribute of the keyword of the product can be its own.
  • the user USER can perform any operation on the plurality of search items, such as the first search item, the second search item, the third search item, and the fourth search item.
  • the solution can be based on the specific operation behavior of the user USER for multiple search items (for example, The number of operations is used to determine a preference score of the user for a plurality of search terms, such as the first search term, the second search term, the third search term, and the fourth search term.
  • Step S26 after obtaining the search term to be located, obtaining a correspondence with the search term according to the search term query A plurality of positioning retrieval items of the system, and obtaining a weight value of the data collection on each dimension corresponding to each positioning retrieval item.
  • step S26 if the operator of the website wishes to realize the crowd orientation by using the search term, that is, the operator of the website wishes to delineate any one or more users who are interested in the search term A, that is, to locate a group of users according to the search term.
  • the operator of the website In order to further perform corresponding data push, analysis, and the like on the located user group, for example, after locating a certain vocabulary as a search term to locate interests of different consumer groups, the user may be pushed to the same group.
  • the advertisement information related to the above search term that is, an optional example, the operator of the website here can directly input the above-mentioned search term to be located to the server, or can provide a text to the server, and the server can filter through the word segmentation.
  • the search term to be located is obtained in the text.
  • the search words input by the operator may also be described by three dimensions, and each dimension may also include multiple positioning search items, which need to be described to describe each dimension in three dimensions of the search term to be located.
  • the attribute is "Location Retrieval”
  • the attributes of each dimension in the three dimensions of the above-mentioned access products accessed by the user are "retrieves”.
  • the solution may extend a plurality of positioning search terms TERM corresponding to the search word by query, and the plurality of positioning search items TERM may be included in three for describing the search word. In the dimension.
  • the solution can obtain the weight value in each dimension corresponding to each positioning retrieval item TREM by a preset algorithm. It should be noted that the operator wants to group users who are interested in the search term.
  • the website operator of the shopping website TB can input the text TXT to the website server, and the data processing terminal can perform word segmentation on the text TXT.
  • the data processing terminal can query and After the "Zhou Xingchi Movie" has a plurality of positioning retrieval items TERM corresponding to each other, the weight value of each dimension corresponding to each positioning retrieval item TERM can be obtained by a preset algorithm.
  • the TXT text input by the website service provider may be a text content describing a related product of the website, and the program may perform word segmentation on the text content to obtain the above search term.
  • Step S28 calculating, according to the preference score of the retrieval item included in the data set on each dimension and obtaining the weight value of the data set on each dimension corresponding to each positioning retrieval item, calculating between each user and the search term The behavior weight value determined by the coupling relationship.
  • the solution may be based on the preference score of the retrieval item included in the data set in each dimension obtained in the above step S24 and the data collection in each dimension corresponding to each positioning retrieval item in step S26.
  • the weight value is used to calculate the behavior weight value determined by the coupling relationship between each user and the search term. It should be noted that the above behavior weight value can be used to represent the search term to be located input by each user for the website operator. Feel The degree of interest.
  • the operation relationship between the user and the search term can be generated by the operation (click, browse, download, etc.) of the search term in the website, for example, the user clicks on the search word.
  • a first coupling relationship may be generated.
  • the first coupling relationship may be used to represent the user's degree of interest in the search term. The more the user clicks, the larger the first coupling relationship is. The greater the behavior weight value determined according to the first coupling relationship, the greater the user's interest in the search term.
  • the user USER accesses the shopping website TB.
  • the data processing terminal of the website server can query a plurality of positioning search items corresponding to the “Chou Xingchi movie” according to the search term “Zhou Xingchi movie” input by the website operator, and then Calculating a first weight value of each of the positioning retrieved items for each of the associated dimensions, and then obtaining a preference score of the user USER for each search item of the product "Chow Sing Chi movie" in the TB website, and then according to the first weight value and the above preference score
  • the behavior weight value can be used to characterize the user's interest in "Chow Sing Chi Movie".
  • Step S30 Determine, according to the behavior weight value determined by the coupling relationship between each user and the search term, the user group to which the search term to be located is located.
  • the solution may select a plurality of users that meet the predetermined condition according to the size of the behavior weight value determined by the coupling relationship between each user and the search term, and then select the plurality of users that meet the predetermined condition. Determined as the user group associated with the above search term.
  • the user whose weight value determined by the above coupling relationship is greater than 0 may also be determined as a user group. It should be noted that after determining the user group of the search term, the operator may push relevant advertisement information to each user in the above user group.
  • the solution may acquire user behavior data, where the user behavior data includes access data generated after the plurality of users access the target object.
  • the set, the access data set includes at least the data set in the following three dimensions: a keyword set, an attribute information set, and a classification information set; and then determining a preference score of the search item included in the data set corresponding to each dimension of the user,
  • the data set in each dimension includes at least one search item. Then, after obtaining the search word to be located, a plurality of positioning search items corresponding to the search word are obtained according to the search word query, and each positioning search is obtained.
  • the item corresponds to the weight value of the data set on each dimension; then, according to the preference score of the retrieval item included in the data set on each dimension and the weight value of the data set on each dimension corresponding to each positioning retrieval item Calculate the behavioral rights determined by the coupling relationship between each user and the search term Value; Finally, according to the present embodiment can conduct weighting values determined weights coupling between the user and each search word, the search word the user group is determined to be positioned in the positioning.
  • the program According to the user's behavior data, a preference score of the user's search item for the product is generated, and then the first weight value of each of the search terms in the search term for the dimension is generated according to the search term input by the operator, and finally, according to the above preference And the first weight value is used to generate the user's behavior weight value, and the user's behavior weight value can intuitively see the degree of interest of the user corresponding to the search term, and then group the users, compared with the prior art, the program is on the website.
  • the text data generated by the server is effectively utilized, and the crowd positioning result generated by the scheme is more accurate than the existing techniques for analyzing structured data to locate the crowd. Therefore, the solution of the first embodiment provided by the present application solves the technical problem of simply implementing the orientation of the crowd through the structured data, and the positioning result is not accurate enough.
  • step S24 the step of determining that the user corresponds to the preference score of the retrieval item included in the data set in each dimension may include:
  • Step S241 respectively acquiring at least one first search item included in the keyword set, at least one second search item included in the attribute information set, and at least one third search item included in the classification information set.
  • Step S242 respectively counting the number of times of access of the search items in the data set in each dimension, and the number of accesses of the search items in the data set of the user accessing each dimension.
  • Step S243 calculating, according to the number of visits of the search items in the data set in each dimension, and the number of accesses of the search items in the data set of the user accessing each dimension, calculating the data set included in the data corresponding to each dimension of the user Retrieve the preference score for the item.
  • each document can include three fields: CATEGORY, PROPERTY, KEYWORD .
  • Each field contains a number of terms in which the user's preference score for each search item can be described.
  • step S243 is calculated according to the number of visits of the search items in the data set in each dimension, and the number of accesses of the search items in the data set accessed by the user on each dimension.
  • Obtaining a preference score of the search item included in the data set corresponding to each dimension of the user may be calculated by the following calculation formula to obtain a preference score tf(t, d) of the search item included in the data set corresponding to the user in any one dimension :
  • w i is the weight value of the access behavior occurring in the data set in the i-th dimension, and N i is the number of accesses counted after the user performs the access behavior on the retrieval item t in the data set on the i-th dimension; n i
  • the search item t is any one of the data items in the data set, wherein the access behavior includes any one of the following types: click, favorite, and review.
  • step S26 after acquiring the search term to be located, a plurality of positioning retrieval items corresponding to the search term are obtained according to the search term query, and each positioning retrieval item is obtained.
  • the steps corresponding to the weight values of the data sets on each dimension may include:
  • Step S261 Acquire a search term to be located, and obtain a plurality of positioning search items corresponding to the search term according to the search term query.
  • Step S262 determining, according to the plurality of positioning retrieval items obtained by the query, the dimensional relationship of the data set corresponding to each dimension of the search term.
  • Step S263 Calculate, according to the dimensional relationship of the data set corresponding to each dimension of the search term, the weight value of the data set on each dimension corresponding to each of the positioning search terms.
  • the solution may further perform a query according to the search term to be located input by the operator, to obtain a plurality of positioning search items corresponding to the search term to be located, and it is necessary to explain that the multiple In the three dimensions of the location search term and the search term to be located, the solution may first determine the dimension relationship of the data set corresponding to each dimension of the search term, and then calculate each location search according to the dimension relationship.
  • the item corresponds to the weight value of the data set on each dimension.
  • the dimensional relationship of the data set corresponding to each dimension in the search term may be determined by the following calculation formula:
  • A represents a data set containing any one of the search terms in the data set in three dimensions
  • B represents a data set containing any one of the positioned search items t in the data set in the three dimensions.
  • the scheme can generate the relationship of the search term to the three dimensions of the ITEM.
  • the scheme When the operator inputs the search term for the crowd orientation, the scheme generates the relationship of the search term to the three dimensions of the ITEM through the query extension. That is, WORD-CATEGORY, WORD-PROPERTY, KEYWORD-KEYWORD, this scheme can use the Jaccord Distance algorithm to consider the co-occurrence relationship of search terms to other dimensions on ITEM.
  • the solution may calculate the weight value of the data set in each dimension corresponding to each positioning search item by using the following calculation formula:
  • r(w, t) is the dimension relationship of the data set corresponding to each dimension of the search term
  • w is the correlation between the search term w and the search term t
  • I(w) is the word frequency of the search term in the text.
  • each domain in the above document can be assigned a weight value.
  • step S261 includes:
  • Step S2611 after receiving the query keyword input by the user, determining that the input keyword is the search term to be located.
  • the querying user may be an operator who wants to implement the crowd positioning.
  • the solution may directly determine that the keyword input by the operator is the search term to be located.
  • step S2612 after receiving the text input by the querying user, the text is subjected to word segmentation processing, and at least one keyword obtained by the word segmentation process is a search term to be located.
  • the scheme may perform segmentation screening on the text TXT, and then at least one keyword obtained through the word segmentation processing is the search term to be located.
  • step S28 according to the preference score of the retrieval item included in the data set in each dimension and the weight of the data set in each dimension corresponding to each positioning retrieval item is obtained.
  • the value, the step of calculating the behavior weight value determined by the coupling relationship between each user and the search term includes:
  • Step S281 obtaining an IDF value idf(t) of the positioning retrieval item in the user behavior data.
  • Step S282 obtaining a highest weight value coord(q, d) of the positioning retrieval item in the plurality of documents.
  • Step S283 normalizing the search terms searched in the same document to obtain a normalized search term score queryNorm(q, d).
  • Step S284 the positioning retrieval item is normalized in the weight values of the plurality of documents to obtain a normalized score norm (t.field) of the plurality of documents.
  • step S285 the behavior weight value Score(q, d) determined by the coupling relationship between each user and the search term is obtained by the following calculation formula.
  • Score(q,d) coord(q,d)*queryNorm(q,d)* ⁇ t ⁇ q tf(t,d)*idf 2 (t)*t.boost*norm(t.field), where , tf(t, d) is the preference score of the search item included in the data set corresponding to each dimension of the user, and t.boost is the weight value of the data set corresponding to each dimension of each positioning search item, f. Boost is the weight value of the data collection on each dimension.
  • the solution can be calculated by the following calculation formula.
  • the solution can calculate the highest weight value coord(q, d) of the location search item in multiple documents by using the following calculation formula:
  • the solution can calculate the normalized search term score queryNorm(q, d) by the following formula:
  • the solution may calculate a normalized score norm (t.field) of the plurality of documents by using a calculation formula as follows:
  • the domain is a collection of data on any dimension in the access data set.
  • the algorithm used in this scheme ignores the weight of the document (d.boost), the overall weight of the query (query) q.boost, and the corresponding TERM. There is only one f.boost, that is, there is only one domain for each TERM.
  • Step A the data extraction abstraction module imports the user behavior data into a data warehouse, such as ODPS, Hadoop, and performs an ETL process to generate offline data conforming to the data specification.
  • a data warehouse such as ODPS, Hadoop
  • the embodiment needs to abstract two subjects: USER (user), representing the body of the circle, and the final output of the crowd is a subset of the overall USER, USER can have a TAG attribute, used to Describe the demographic characteristics of the user, such as gender, age.
  • ITEM (item), which represents the user's behavior, including but not limited to goods, videos, music, and so on.
  • CATEGORY which indicates the classification of ITEM, is a many-to-one relationship, that is, each ITEM has one and only one CATEGORY.
  • PROPERTY attribute
  • ITEM's own attribute is a many-to-many relationship.
  • music as an ITEM can have multiple attributes such as composer, lyricist, singer, and style.
  • KEYWORD keyword
  • Each keyword can have the word frequency or the weight of TFIDF. It should be noted that only three KEYWORDs are required, and others can not be reflected in the data (CATEGORY is unique, PROPERTY is empty).
  • the user document generation module decomposes the behavior of the ITE on the ITEM into the preference scores of the UESR for the three dimensions of the ITEM, namely: UESER-CATEGORY, USER-PROPERTY, USER-KEYWORD.
  • the scheme can use the targeted supervised learning algorithm (such as LR, SVM) to statistically summarize the data and normalize it to 0-1.
  • a summary of all preferences generates each user's own preference document (Document).
  • a document includes three fields: CATEGORY, PROPERTY, KEYWORD. Each field contains a number of terms that describe the user's preference for a category, a word. Because the results of the circle are generally not real-time, the data volume (million to one billion) is far less than the text search system (100 million to 100 billion), so the document does not need to maintain the inverted index, the technology implements relative text search. The system is simple.
  • step C the keyword correlation calculation module calculates the relationship of the three dimensions of the search term to the ITEM, and provides the function of the query extension in the process of inputting the keyword to circle the person. Calculate the relationship between the search terms and the three dimensions of ITEM, namely WORD-CATEGORY, WORD-PROPERTY, KEYWORD-KEYWORD.
  • step D the label definition generation module inputs the text or the keyword provided by the user, and the text system needs to perform the word segmentation processing to obtain the keyword, and the query expands the corresponding positioning term (term).
  • the tag definition generation module finally generates the weight of each of the positioning search terms in each dimension according to the relationship of the search terms to the three dimensions of the ITEM, and the weight calculation can simply use the weighted summation.
  • the extended tag definition of the query is obtained, which is equivalent to the query (Query) in the search system.
  • Step E The scoring module generates a user behavior weight value according to Lucened's search scoring algorithm according to the weight of each positioning retrieval item in each dimension and the UESR's preference for the three dimensions of the ITEM, and the user behavior weight value can be used for Characterize the size of interest for ITEM.
  • the above scoring algorithm may be the BM25 algorithm.
  • the present invention provides a universal solution that allows operators to complete a specific group of people by providing keywords, and can provide an interpretable definition of the crowd, which can improve product iteration efficiency and reduce development costs. Can achieve more accurate crowd orientation, improve the effectiveness of the operator's advertising services.
  • the technical solution of the present invention which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • a storage medium such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a cell phone, a computer, a server, or a network device, etc.) to perform the methods of various embodiments of the present invention.
  • the device may include:
  • the first obtaining unit 50 is configured to acquire user behavior data, where the user behavior data includes a set of access data generated after the plurality of users access the target object, and the access data set includes at least the data set in the following three dimensions: a keyword set. , attribute information collection and classification information collection.
  • the above user may be a visiting user USER of a portal website (such as a shopping website), and the target object may be a product ITEM in the portal website, the product ITEM may be a product, a video, a music, etc., in accessing the user USER to the portal product ITEM
  • a large number of access data collections (such as text data) are generated, and the web server can obtain the access data set generated by the user access target object.
  • each access data set obtained by the website server can be described by using three dimensions: the category CATEGORY, that is, the above classification information, which is used to describe the classification of the product ITEM, and the attribute PROPERTY, which is used to express the self of the product ITEM.
  • the keyword KEYWORD which is used to represent the name of the product ITEM.
  • Each keyword can have the word frequency or the weight of the TFIDF. It should be noted that in the three dimensions for describing the product ITEM, each product ITEM can only have one category CATEGORY, and each product ITEM can have multiple attributes PROPERTY.
  • the first determining unit 52 is configured to determine a preference score of the search item included in the data set corresponding to the user in each dimension, wherein the data set in each dimension includes at least one search item.
  • each dimension may include a plurality of search terms, the plurality of search terms may be multiple attributes of each dimension, and the user may have specific search terms for each dimension. Operation is performed, and then the solution can determine the user's preference score for each search item according to the specific operation of the user for each search item.
  • the second obtaining unit 54 is configured to obtain a plurality of positioning retrieval items corresponding to the search words according to the search word query, and obtain a data set corresponding to each dimension of each positioning retrieval item after acquiring the search word to be located. Weight value.
  • the operator of the website wishes to achieve crowd orientation through search terms, that is, the operator of the website wishes to delineate any one or more users who are interested in the search term A, that is, to locate a group of users according to the search term, thereby further advancing
  • the user group performs corresponding data push, analysis, and the like on the located user group. For example, after locating a vocabulary as a search term to locate interests of different consumer groups, the search term may be pushed to users positioned as the same group.
  • Relevant advertising information that is, an optional example, the operator of the website here can directly input the above-mentioned search term to be located to the server, or provide a text to the server, and the server can obtain the text from the text through the word segmentation.
  • the search term to be located can directly input the above-mentioned search term to be located to the server, or provide a text to the server, and the server can obtain the text from the text through the word segmentation.
  • the search term to be located can directly input the above-menti
  • the third obtaining unit 56 calculates each user and the search term according to the preference score of the retrieval item included in the data set in each dimension and the weight value of the data set in each dimension corresponding to each positioning retrieval item.
  • the behavioral weight value determined by the coupling relationship.
  • the operation relationship between the user and the search term can be generated by the operation of the search word in the website (click, browse, download, etc.), for example, when the user clicks on the search word, the user's behavior
  • a first coupling relationship is generated between the search term and the first coupling relationship, and the first coupling relationship can be used to represent the user's degree of interest in the search term. The more the user clicks, the larger the first coupling relationship is, according to the first coupling relationship.
  • the greater the determined behavior weight value the greater the user's interest in the search term.
  • the second determining unit 58 determines the user group to which the search term to be located is located according to the behavior weight value determined by the coupling relationship between each user and the search term.
  • the solution may select a plurality of users that meet the predetermined condition according to the size of the behavior weight value determined by the coupling relationship between each user and the search term, and then determine the plurality of users that meet the predetermined condition as the search term.
  • Related user groups Preferably, in this embodiment, the user whose weight value determined by the above coupling relationship is greater than 0 may also be determined as a user group. It should be noted that after determining the user group of the search term, the operator may push relevant advertisement information to each user in the above user group.
  • the solution may acquire user behavior data, where the user behavior data includes access data generated after the plurality of users access the target object.
  • the set, the access data set includes at least the data set in the following three dimensions: a keyword set, an attribute information set, and a classification information set; and then determining a preference score of the search item included in the data set corresponding to each dimension of the user,
  • the data set in each dimension includes at least one search item. Then, after obtaining the search word to be located, a plurality of positioning search items corresponding to the search word are obtained according to the search word query, and each positioning search is obtained.
  • the item corresponds to the weight value of the data set on each dimension; then, according to the preference score of the retrieval item included in the data set on each dimension and the weight value of the data set on each dimension corresponding to each positioning retrieval item Calculate the behavioral rights determined by the coupling relationship between each user and the search term Value; Finally, according to the present embodiment can conduct weighting values determined weights coupling between the user and each search word, the search location to be determined The user group to which the word is located. It is easy to notice that the program can obtain the user's behavior data from the website server, generate a preference score of the user's search item for the product according to the user's behavior data, and then generate each of the search words according to the search term input by the operator.
  • Positioning the first weight value of the retrieved item for the dimension, and finally generating the behavior weight value of the user according to the preference score and the first weight value, and the user's behavior weight value can intuitively see the degree of interest of the user corresponding search term.
  • the users are grouped, and compared with the prior art, the solution effectively utilizes the text data generated by the website server, and the crowd generated by the solution is compared with the existing technology for analyzing the structured data to locate the crowd.
  • the positioning results are more accurate. Therefore, the solution of the foregoing embodiment 2 provided by the present application solves the technical problem of simply implementing the orientation of the crowd through the structured data, and the positioning result is not accurate enough.
  • the first determining unit 52 includes: a first obtaining module 521, configured to respectively acquire at least one first search item and an attribute included in the keyword set. At least one second search item included in the information set and at least one third search item included in the classification information set; a statistics module 523, configured to separately count the number of visits of the search items in the data set in each dimension, and the user Accessing the number of accesses of the retrieved items in the data set on each dimension; a first calculation module 524 for using the number of visits per the number of items retrieved in the data set on each dimension, and the user accessing the data set on each dimension The number of visits of the item is retrieved, and the preference score of the search item included in the data set corresponding to each dimension of the user is calculated.
  • a first obtaining module 521 configured to respectively acquire at least one first search item and an attribute included in the keyword set. At least one second search item included in the information set and at least one third search item included in the classification information set
  • a statistics module 523 configured to separately count the number of
  • the first calculation module 524 includes: a sub-calculation module, configured to calculate, by using a calculation formula, a preference score of a search item included in a data set corresponding to a user in any one dimension.
  • Tf(t,d) preference score
  • w i is the weight value of the access behavior occurring in the data set in the i-th dimension
  • N i is the number of accesses counted after the user performs the access behavior on the retrieval item t in the data set on the i-th dimension
  • n i is the number of times of accessing the item t in the data set on the i-th dimension
  • the item t is any one of the items in the data set
  • the access behavior includes any one of the following types: click, favorite, and review .
  • the second obtaining unit 54 includes: a second obtaining module 541, configured to acquire a search term to be located, and obtain a search term according to the search term query. a plurality of positioning search items having a corresponding relationship; a first determining module 542, configured to determine, according to the plurality of positioning search items obtained by the query, a dimensional relationship of the data set corresponding to each dimension of the search word; the second calculating module 543, The weight value of the data set on each dimension corresponding to each positioning search item is calculated according to the dimensional relationship of the data set corresponding to each dimension of the search term.
  • the foregoing apparatus further includes: a first calculating unit, configured to determine, by using a calculation formula, a dimensional relationship of the data set corresponding to each dimension of the search term: Where A represents a data set containing any one of the search terms in the data set in three dimensions, and B represents a data set containing any one of the positioned search terms t in the data set in the three dimensions.
  • the foregoing apparatus further includes: a second calculating unit, configured to calculate a weight value of the data set in each dimension corresponding to each positioning search item by using a calculation formula: Where r(w, t) is the dimensional relationship of the search term corresponding to the data set on each dimension, w is the correlation between the search term w and the search term t, and I(w) is the word frequency of the search term in the text.
  • the second obtaining module 541 includes: a second determining module, configured to: after receiving the querying the keyword input by the user, determining that the input keyword is a search term to be located;
  • the first processing module is configured to perform word segmentation processing on the text after receiving the text input by the querying user, and the at least one keyword obtained by the word segmentation processing is the search term to be located.
  • the foregoing apparatus further includes: a third calculating unit, configured to calculate, by using the following calculation formula, an IDF value idf(t) of the positioning retrieval item in the user behavior data:
  • the foregoing apparatus further includes: a fourth calculating unit, configured to calculate, by using the following calculation formula, a highest weight value coord(q, d) of the positioning retrieval item in the plurality of documents:
  • the apparatus further includes: a fifth calculating unit, configured to calculate a normalized search term score queryNorm(q, d) by using the following formula:
  • the foregoing apparatus further includes: a sixth calculating unit, configured to calculate a normalized score norm (t.field) of the plurality of documents by using the following calculation formula:
  • the domain is a collection of data on any dimension in the access data set.
  • Embodiments of the present invention may provide a computer terminal, which may be any one of computer terminal groups.
  • the computer terminal may be located in at least one network device of the plurality of network devices of the computer network.
  • the computer terminal may execute the program code of the following steps in the vulnerability detection method of the application: acquiring user behavior data, where the user behavior data includes a set of access data generated after the plurality of users access the target object, and accessing
  • the data set includes at least the data set in the following three dimensions: a keyword set, an attribute information set, and a classification information set; determining a preference score of the search item included in the data set corresponding to each dimension of the user, wherein each dimension
  • the data set includes at least one search item; after obtaining the search word to be located, obtaining a plurality of positioning search items corresponding to the search word according to the search word query, and acquiring each of the positioning search items corresponding to each dimension Weight value of the data set; calculating each user and the search term according to the preference score of the retrieved item included in the data set on each dimension and the weight value of the data set on each dimension corresponding to each of the positioned search terms
  • the behavioral weight value determined by the coupling relationship according to each user with A weight value of the determined weight acts
  • FIG. 9 is a structural block diagram of a computer terminal according to an embodiment of the present invention.
  • the computer terminal A may include one or more (only one shown in the figure) processor, memory.
  • the memory can be used to store software programs and modules, such as the security vulnerability detection method and the program instruction/module corresponding to the device in the embodiment of the present invention.
  • the processor executes various functions by running a software program and a module stored in the memory.
  • Application and data processing that is, the detection method for implementing the above system vulnerability attack.
  • the memory may include a high speed random access memory, and may also include non-volatile memory such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory.
  • the memory can further include memory remotely located relative to the processor, which can be connected to terminal A via a network. Instance package for the above network These include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the processor may call the memory stored information and the application program by the transmission module to perform the following steps: acquiring user behavior data, where the user behavior data includes a set of access data generated after the plurality of users access the target object, and the access data set is at least
  • the data set includes the following three dimensions: a keyword set, an attribute information set, and a classification information set; determining a preference score of the search item included in the data set corresponding to each dimension of the user, wherein the data on each dimension
  • the set includes at least one search item; after obtaining the search word to be located, obtaining a plurality of positioning search items corresponding to the search word according to the search word query, and acquiring a data set corresponding to each dimension of each of the positioning search items Weight value; according to the preference score of the retrieval item included in the data set on each dimension and the weight value of the data set on each dimension corresponding to each positioning retrieval item, the calculation between each user and the search term is calculated
  • the behavior weight value determined by the coupling relationship; based on each user and the
  • the processor may further execute the following steps: acquiring at least one first search item included in the keyword set, at least one second search item included in the attribute information set, and the category information set included in the classification information set respectively At least one third search term; separately counting the number of visits per query of the search items in the data set on each dimension, and the number of visits of the search items in the data set accessed by the user on each dimension; according to the data set in each dimension The number of visits per item of the search item, and the number of visits by the user to the search item in the data set on each dimension, and the preference score of the search item included in the data set corresponding to each dimension of the user is calculated.
  • the processor may further execute the following program code: calculate a preference score tf(t, d) of the search item included in the data set of the user corresponding to any dimension by using the following calculation formula: preference score
  • w i is the weight value of the access behavior occurring in the data set in the i-th dimension
  • N i is the number of accesses counted after the user performs the access behavior on the retrieval item t in the data set on the i-th dimension
  • n i is the number of times of accessing the item t in the data set on the i-th dimension
  • the item t is any one of the items in the data set
  • the access behavior includes any one of the following types: click, favorite, and review .
  • the processor may further execute the following program code: obtain a search term to be located, and obtain a plurality of positioning search items corresponding to the search word according to the search term query; and multiple positioning searches according to the query. Item, determining a dimension relationship of the data set corresponding to each dimension on the search term; calculating a weight value of the data set on each dimension corresponding to each of the positioned search terms according to the dimension relationship of the data set corresponding to each dimension of the search term .
  • the foregoing processor may further execute the following program code: Where A represents a data set containing any one of the search terms in the data set in three dimensions, and B represents a data set containing any one of the positioned search terms t in the data set in the three dimensions.
  • the foregoing processor may further execute the following program code: Where r(w, t) is the dimensional relationship of the search term corresponding to the data set on each dimension, w is the correlation between the search term w and the search term t, and I(w) is the word frequency of the search term in the text.
  • the foregoing processor may further execute the following program code: after receiving the querying the keyword input by the user, determining that the input keyword is a search term to be located; or, after receiving the querying the text input by the user , the word segmentation process is performed, and at least one keyword obtained by the word segmentation process is a search term to be located.
  • the foregoing processor may further execute the following program code: obtain an IDF value idf(t) of the positioning retrieval item in the user behavior data; and obtain a highest weight value coord(q, the positioning retrieval item in the plurality of documents.
  • the foregoing processor may further execute the following program code: the IDF value idf(t) of the location retrieval item in the user behavior data is calculated by using the following calculation formula:
  • the foregoing processor may further execute program code of the following steps: calculating, by using the following calculation formula, a highest weight value coord(q, d) of the positioning retrieval item in the plurality of documents:
  • the foregoing processor may further execute the following program code: the normalized search term score queryNorm(q, d) is calculated by the following calculation formula:
  • the foregoing processor may further execute the following program code: calculate a normalized score norm (t.field) of the plurality of documents by using the following calculation formula:
  • the domain is a collection of data on any dimension in the access data set.
  • a method for processing user behavior data is provided.
  • the access data set includes at least the data set in the following three dimensions: a keyword set, an attribute information set, and classification information.
  • a set determining a preference score of the search item included in the data set corresponding to each dimension of the user, wherein the data set on each dimension includes at least one search item; after obtaining the search word to be located, querying according to the search term Obtaining a plurality of positioning retrieval items corresponding to the search term, and obtaining a weight value of the data set corresponding to each dimension of each positioning retrieval item; according to the preference score of the retrieval item included in the data set on each dimension And obtaining a weight value of the data set in each dimension corresponding to each positioning search item, and calculating a behavior weight value determined by a coupling relationship between each user and the search word; according to the coupling between each user and the search word
  • the behavior weight value determined by the relationship, determining the user group to which the search term to be located is located, A purely technical problem to achieve the crowd directed by structured data, results were not accurate positioning.
  • FIG. 1 does not limit the structure of the above electronic device.
  • computer terminal 10 may also include more or fewer components (such as a network interface, display device, etc.) than shown in FIG. 1, or have a different configuration than that shown in FIG.
  • Embodiments of the present invention also provide a storage medium.
  • the foregoing storage medium may be used to save the program code executed by the processing method of the user behavior data provided in the first embodiment.
  • the foregoing storage medium may be located in any one of the computer terminal groups in the computer network, or in any one of the mobile terminal groups.
  • the storage medium is configured to store program code for performing the following steps: acquiring user behavior data, wherein the user behavior data includes a set of access data generated after the plurality of users access the target object,
  • the access data set includes at least the data set in the following three dimensions: a keyword set, an attribute information set, and a classification information set; determining a preference score of the search item included in the data set corresponding to each dimension of the user,
  • the data set in each dimension includes at least one search item; after obtaining the search word to be located, a plurality of positioning search items corresponding to the search word are obtained according to the search word query, and each corresponding search item is obtained.
  • the weight value of the data set on each dimension; the weight value of the data item on each dimension corresponding to each positioning search item is obtained according to the preference score of the search item included in the data set on each dimension, and the calculation is performed for each
  • the behavior weight value determined by the coupling relationship between the user and the search term; determining the user group to which the search term to be located is located according to the behavior weight value determined by the coupling relationship between each user and the search term.
  • the disclosed technical contents may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of a unit is only a logical function division.
  • multiple units or components may be combined or may be integrated into Another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, unit or module, and may be electrical or otherwise.
  • the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • An integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, can be stored in a computer readable storage medium.
  • the technical solution of the present invention which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium.
  • a number of instructions are included to cause a computer device (which may be a personal computer, server or network device, etc.) to perform all or part of the steps of the various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种用户行为数据的处理方法及装置。其中,该方法包括:获取用户行为数据(S22),确定用户对应每个维度上的数据集合所包含的检索项的偏好分值(S24),在获取待定位的搜索词之后,根据搜索词查询得到与搜索词具有对应关系的多个定位检索项,并获取每个定位检索项对应每个维度上的数据集合的权重值(S26);根据每个维度上的数据集合所包含的检索项的偏好分值和获取每个定位检索项对应每个维度上的数据集合的权重值,计算得到每个用户与搜索词之间的耦合关系所确定的行为权重值(S28);根据每个用户与搜索词之间的耦合关系所确定的行为权重值,确定待定位的搜索词所定位的用户组(S30)。该方法解决了单纯的通过结构化数据来实现人群定向,定位结果不够准确的技术问题。

Description

用户行为数据的处理方法及装置
本申请要求2016年01月12日递交的申请号为201610018733.7、发明名称为“用户行为数据的处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及计算机领域,具体而言,涉及一种用户行为数据的处理方法及装置。
背景技术
目前,用户在使用互联网产品(例如在门户网站进行购物)时会产生大量的结构化数据,商家往往会通过上述结构化数据来实现人群定向以此分析出用户的兴趣,比如,DMP的标签人群定向技术,利用用户的基础信息和基础行为,完成圈人打标定向的活动,进一步向定向的用户组推送广告或应用。
这里需要说明的,在用户使用互联网产品时也会产生大量的非结构化数据(例如文本数据),同上述结构化数据相比,文本数据中的用户的评论、标题也可以更加反映出用户更加细粒度的兴趣偏好,从文本数据中挖掘的商业信息会更有价值,因此,在相关技术中,单纯的通过上述结构化数据来实现人群定向,定位结果不够准确。
针对上述单纯的通过结构化数据来实现人群定向,定位结果不够准确的问题,目前尚未提出有效的解决方案。
发明内容
本发明实施例提供了一种用户行为数据的处理方法及装置,以至少解决单纯的通过结构化数据来实现人群定向,定位结果不够准确的技术问题。
根据本发明实施例的一个方面,提供了一种用户行为数据的处理方法,包括:获取用户行为数据,其中,用户行为数据包括多个用户访问目标对象之后所产生的访问数据集合,访问数据集合至少包括如下三个维度上的数据集合:关键词集合、属性信息集合和分类信息集合;确定用户对应每个维度上的数据集合所包含的检索项的偏好分值,其中,每个维度上的数据集合包含至少一个检索项;在获取待定位的搜索词之后,根据搜索词查询得到与搜索词具有对应关系的多个定位检索项,并获取每个定位检索项对应每个维度上的数据集合的权重值;根据每个维度上的数据集合所包含的检索项的偏好分值和获取每个定位检索项对应每个维度上的数据集合的权重值,计算得到每个用户与搜索 词之间的耦合关系所确定的行为权重值;根据每个用户与搜索词之间的耦合关系所确定的行为权重值,确定待定位的搜索词所定位的用户组。
根据本发明实施例的另一方面,还提供了一种用户行为数据的处理装置,包括:第一获取单元,用于获取用户行为数据,其中,用户行为数据包括多个用户访问目标对象之后所产生的访问数据集合,访问数据集合至少包括如下三个维度上的数据集合:关键词集合、属性信息集合和分类信息集合;第一确定单元,用于确定用户对应每个维度上的数据集合所包含的检索项的偏好分值,其中,每个维度上的数据集合包含至少一个检索项;第二获取单元,用于在获取待定位的搜索词之后,根据搜索词查询得到与搜索词具有对应关系的多个定位检索项,并获取每个定位检索项对应每个维度上的数据集合的权重值;第三获取单元,根据每个用户在每个维度上的数据集合所包含的检索项的偏好分值和获取每个定位检索项对应每个维度上的数据集合的权重值,计算得到每个用户与搜索词之间的耦合关系所确定的行为权重值;第二确定单元,根据每个用户与搜索词之间的耦合关系所确定的行为权重值,确定待定位的搜索词所定位的用户组。
在本发明实施例中,采用获取用户行为数据,其中,用户行为数据包括多个用户访问目标对象之后所产生的访问数据集合,访问数据集合至少包括如下三个维度上的数据集合:关键词集合、属性信息集合和分类信息集合;确定用户对应每个维度上的数据集合所包含的检索项的偏好分值,其中,每个维度上的数据集合包含至少一个检索项;在获取待定位的搜索词之后,根据搜索词查询得到与搜索词具有对应关系的多个定位检索项,并获取每个定位检索项对应每个维度上的数据集合的权重值;根据每个维度上的数据集合所包含的检索项的偏好分值和获取每个定位检索项对应每个维度上的数据集合的权重值,计算得到每个用户与搜索词之间的耦合关系所确定的行为权重值;根据每个用户与搜索词之间的耦合关系所确定的行为权重值,确定待定位的搜索词所定位的用户组,解决了单纯的通过结构化数据来实现人群定向,定位结果不够准确的技术问题。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1是根据本发明实施例的一种用户行为数据的处理方法的计算机终端的硬件结构框图;
图2是根据本发明实施例的一种用户行为数据的处理方法的流程图;
图3是根据本发明实施例的一种可选地用户行为数据的处理方法的示意图;
图4是根据本发明实施例的一种可选地用户行为数据的处理方法的示意图;
图5是根据本发明实施例的一种用户行为数据的处理装置的结构示意图;
图6是根据本发明实施例的一种可选地用户行为数据的处理装置的结构示意图;
图7是根据本发明实施例的一种可选地用户行为数据的处理装置的结构示意图;
图8是根据本发明实施例的一种可选地用户行为数据的处理装置的结构示意图;以及
图9是根据本发明实施例的一种用户行为数据的处理方法的计算机终端的硬件结构框图。
具体实施方式
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
本申请中专业术语解释如下:
ETL:是英文Extract-Transform-Load的缩写,用来描述将数据从来源端经过抽取(extract)、转换(transform)、加载(load)至目的端的过程。ETL一词较常用在数据仓库,但其对象并不限于数据仓库。ETL是构建数据仓库的重要一环,用户从数据源抽取出所需的数据,经过数据清洗,最终按照预先定义好的数据仓库模型,将数据加载到数据仓库中去。
LR:Logistic regression的简称,一种常用的线性分类器。
SVM:支持向量机SVM(Support Vector Machine)是一个有监督的学习模型,通常用来进行模式识别、分类、以及回归分析。
Lucene:Lucene是apache软件基金会4jakarta项目组的一个子项目,是一个开放源代码的全文检索引擎工具包,但它不是一个完整的全文检索引擎,而是一个全文检索引擎的架构,提供了完整的查询引擎和索引引擎,部分文本分析引擎(英文与德文两种西方语言)。
实施例1
根据本发明实施例,还提供了一种用户行为数据的处理方法的实施例,需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
本申请实施例一所提供的方法实施例可以在计算机终端或者类似的运算装置中执行。以运行在计算机终端上为例,图1是本发明实施例的一种用户行为数据的处理方法的计算机终端的硬件结构框图。如图1所示,计算机终端10可以包括一个或多个(图中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)、用于存储数据的存储器104、以及用于通信功能的传输模块106。本领域普通技术人员可以理解,图1所示的结构仅为示意,其并不对上述电子装置的结构造成限定。例如,计算机终端10还可包括比图1中所示更多或者更少的组件,或者具有与图1所示不同的配置。
存储器104可用于存储应用软件的软件程序以及模块,如本发明实施例中的用户行为数据的处理方法对应的程序指令/模块,处理器102通过运行存储在存储器104内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的应用程序的漏洞检测方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端10。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
传输模块106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括计算机终端10的通信供应商提供的无线网络。在一个实例中,传输模块106包括一个网络适配器(Network Interface Controller,NIC),其可通过基站与其他网络设备相 连从而可与互联网进行通讯。在一个实例中,传输模块106可以为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。
在上述运行环境下,本申请提供了如图2所示的用户行为数据的处理方法。图2是根据本发明实施例一的用户行为数据的处理方法的流程图,该方法可以包括:
步骤S22,获取用户行为数据。
其中,用户行为数据包括多个用户访问目标对象之后所产生的访问数据集合,访问数据集合至少包括如下三个维度上的数据集合:关键词集合、属性信息集合和分类信息集合。
在上述步骤S22中,上述用户可以为门户网站(比如购物网站)的访问用户USER,上述目标对象可以为门户网站中的产品ITEM,上述产品ITEM可以为商品、视频、音乐等,在访问用户USER对门户网站的产品ITEM进行点击、搜索查询、评论、收藏网页等行为之后,会生成大量的访问数据集合(比如文本数据),网站服务器可以获取上述用户访问目标对象生成的访问数据集合。需要说明的是,网站服务器获取的每条访问数据集合都可以使用三个维度去描述:类目CATEGORY,即上述分类信息,用于表述产品ITEM的分类,属性PROPERTY,用于表述产品ITEM的自有属性,关键词KEYWORD,用于表述产品ITEM的名称,每个关键词可以带词频或者TFIDF的权重。需要说明的是,在用于描述产品ITEM的三个维度中,每个产品ITEM只能有一个类目CATEGORY,每个产品ITEM可以有多个属性PROPERTY。
需要说明的是,本方案可以通过有目标的监督学习算法(例如LR、SVM)将用户的原始行为数据进行统计汇总,然后,将USER对ITEM产品的行为分解成上述三个维度,可选地,本方案中产品ITEM的数据规范可以为下表一,用户USER行为的数据规范可以为下表二。
表一:
列名 字段说明
item_id 物品ID
category 类目
keywords 关键词
description 描述
properties 属性
表二:
列名 字段说明
user_id 用户ID
item_id 物品ID
bhv_type 行为类型
count 物品个数
下面以用户USER访问购物网站TB为例,在购物网站TB中会有很多产品,产品的分类可以为美妆、母婴、食品、视频、歌曲等类目,用户可以对分类下的具体产品进行操作,比如,用户USER可以点击TB页面中电影分类下的“周星驰电影”索引按钮,则用户USER所选择操作的目标对象则为“周星驰电影”产品,“周星驰电影”产品可以采用三个维度(类目、属性、关键词)去表述,上述“周星驰电影”产品的类目为电影,属性为视频,关键词为周星驰电影。
步骤S24,确定用户对应每个维度上的数据集合所包含的检索项的偏好分值。
其中,每个维度上的数据集合包含至少一个检索项。
在上述步骤S24中,在用于表述产品ITEM的三个维度中,每个维度都可以包括多个检索项,上述多个检索项可以是每个维度的多个属性,用户可以对每个维度下的具体的检索项进行操作,然后,本方案可以根据用户对每个检索项的具体操作来确定用户对于每个检索项的偏好分值。
仍旧以用户USER访问购物网站TB为例,用户USER在TB页面所选择的目标对象“周星驰电影”产品的三个维度中,上述“周星驰电影”产品的类目CATEGORY为“电影”,类目CATEGORY“电影”可以包括第一检索项“国内电影”,第二检索项“喜剧电影”等,上述“周星驰电影”产品的属性PROPERTY为“视频”,属性PROPERTY“视频”可以包括第三检索项“高清视频”,第四检索项“标清视频”。需要说明的是,产品的关键词的属性可以为其本身。用户USER可以对上述第一检索项、第二检索项、第三检索项、第四检索项等多个检索项进行任意操作,本方案可以根据用户USER对多个检索项的具体操作行为(例如操作次数)来确定用户对第一检索项、第二检索项、第三检索项、第四检索项等多个检索项的偏好分值。
步骤S26,在获取待定位的搜索词之后,根据搜索词查询得到与搜索词具有对应关 系的多个定位检索项,并获取每个定位检索项对应每个维度上的数据集合的权重值。
在上述步骤S26中,如果网站的运营商希望通过搜索词来实现人群定向,即网站的运营商希望圈定对搜索词A感兴趣的任意一个或多个用户,即根据搜索词来定位成一组用户,以此进一步进行对该定位的用户组进行相应的数据推送、分析等应用,例如,在将某一词汇作为搜索词来定位不同消费群体的兴趣爱好之后,可以向定位为同一组的用户推送关于上述搜索词有关的广告信息,即一种可选示例中,此处网站的运营商可以直接向服务器输入上述待定位的检索词,也可以向服务器提供一份文本,服务器可以通过分词筛选从该文本中得到待定位的搜索词。
需要说明的是,运营商输入的搜索词也可以用三个维度来描述,每个维度也可以包括多个定位检索项,需要说明的是,描述待定位搜索词的三个维度下每个维度的属性为“定位检索项”,上述访问用户访问的产品的三个维度下每个维度的属性为“检索项”二者不同。本方案在接收到运营商输入的搜索词汇之后,可以通过查询扩展出与上述搜索词相对应的多个定位检索项TERM,上述多个定位检索项TERM可以包含于用于描述搜索词的三个维度中。本方案可以通过预设的算法来获取每个定位检索项TREM对应的每个维度上的权重值。需要说明的是,运营商希望将对搜索词感兴趣的用户进行分组。
仍旧以用户USER访问购物网站TB为例,在网站服务器采集了大量的用户的行为数据之后,购物网站TB的网站运营商可以向网站服务器输入文本TXT,数据处理终端可以对上述文本TXT进行分词筛选处理,生成搜索词“周星驰电影”,在数据处理终端中预存了用于表述“周星驰电影”的三个维度,在每个维度中预存着多个定位检索项TERM,数据处理终端可以查询到与“周星驰电影”有对应关系的多个定位检索项TERM之后,可以通过预设的算法来获取每个定位检索项TERM对应每个维度的权重值。需要说明的是,上述网站服务商输入的TXT文本可以为描述网站相关产品的文本内容,本方案可以对上述文本内容进行分词筛选,从而得到上述搜索词。
步骤S28,根据每个维度上的数据集合所包含的检索项的偏好分值和获取每个定位检索项对应每个维度上的数据集合的权重值,计算得到每个用户与搜索词之间的耦合关系所确定的行为权重值。
在上述步骤S28中,本方案可以根据上述步骤S24中得到的每个维度上的数据集合所包含的检索项的偏好分值以及步骤S26中的每个定位检索项对应每个维度上的数据集合的权重值来计算每个用户与搜索词之间的耦合关系所确定的行为权重值,需要说明的是,上述行为权重值可以用于表征每个用户对于网站运营商输入的待定位的搜索词感兴 趣的程度。
需要说明的是,在用户访问门户网站时,通过对网站中搜索词的操作(点击、浏览、下载等操作)可以产生用户与搜索词之间的耦合关系,例如,用户对搜索词进行点击操作时,用户的行为与搜索词之间就会产生第一耦合关系,第一耦合关系可以用于表征用户对上述搜索词的感兴趣程度,用户点击的次数越多,第一耦合关系越大,根据第一耦合关系确定的行为权重值就越大,也表明用户对搜索词的感兴趣程度越大。
仍旧以用户USER访问购物网站TB为例,网站服务器的数据处理终端可以根据网站营运商输入的待定位的搜索词“周星驰电影”来查询得到与“周星驰电影”对应的多个定位检索项,然后计算每个定位检索项对于每个所属维度的第一权重值,然后获取用户USER对于TB网站中产品“周星驰电影”的每个检索项的偏好分,然后根据上述第一权重值以及上述偏好分来计算生成用户的对于“周星驰电影”的行为权重值,该行为权重值可以用于表征用户对于“周星驰电影”的感兴趣程度。
步骤S30,根据每个用户与搜索词之间的耦合关系所确定的行为权重值,确定待定位的搜索词所定位的用户组。
在上述步骤S30中,本方案可以根据每个用户与搜索词之间的耦合关系所确定的行为权重值的大小来挑选出符合预定条件的多个用户,然后将上述符合预定条件的多个用户确定为与上述搜索词相关的用户组。优选地,本实施例也可以将上述耦合关系确定的权重值大于0的用户确定为用户组。需要说明的是,在确定出搜索词的用户组之后,运营商可以对上述用户组中的每个用户推送相关的广告信息。
本申请上述实施例一公开的方案中,如果想对产品感兴趣的人群进行定位,首先,本方案可以获取用户行为数据,其中,用户行为数据包括多个用户访问目标对象之后所产生的访问数据集合,访问数据集合至少包括如下三个维度上的数据集合:关键词集合、属性信息集合和分类信息集合;然后,确定用户对应每个维度上的数据集合所包含的检索项的偏好分值,其中,每个维度上的数据集合包含至少一个检索项;接着,在获取待定位的搜索词之后,根据搜索词查询得到与搜索词具有对应关系的多个定位检索项,并获取每个定位检索项对应每个维度上的数据集合的权重值;接着,根据每个维度上的数据集合所包含的检索项的偏好分值和获取每个定位检索项对应每个维度上的数据集合的权重值,计算得到每个用户与搜索词之间的耦合关系所确定的行为权重值;最后,本方案可以根据每个用户与搜索词之间的耦合关系所确定的行为权重值,确定待定位的搜索词所定位的用户组。容易注意到,本方案可以从网站服务器中获取到的用户的行为数 据,根据用户的行为数据生成用户针对产品的检索项的偏好分,然后根据运营商输入的搜索词来生成搜索词中每条定位检索项对于所属维度的第一权重值,最后根据上述偏好分和第一权重值来生成用户的行为权重值,通过用户的行为权重值可以直观的看出用户对应检索词的感兴趣程度,进而对用户进行分组,与现有技术相比,本方案对网站服务器产生的文本数据进行了有效的利用,而且,与现有分析结构化数据来定位人群的技术相比,本方案产生的人群定位结果更加准确。由此,本申请提供的上述实施例一的方案解决了单纯的通过结构化数据来实现人群定向,定位结果不够准确的技术问题。
在本申请提供的一种可选实施例中,步骤S24,确定用户对应每个维度上的数据集合所包含的检索项的偏好分值的步骤可以包括:
步骤S241,分别获取关键词集合中包含的至少一个第一检索项、属性信息集合中包含的至少一个第二检索项和分类信息集合中包含的至少一个第三检索项。
步骤S242,分别统计每个维度上的数据集合中检索项的人均访问次数,以及用户访问每个维度上的数据集合中检索项的访问次数。
步骤S243,根据每个维度上的数据集合中检索项的人均访问次数,以及用户访问每个维度上的数据集合中检索项的访问次数,计算得到用户对应每个维度上的数据集合所包含的检索项的偏好分值。
在上述步骤S241至步骤S243中,本方案可以通过获取产品的三个维度中的每个维度中的每个检索项,然后根据用户对每个检索项的访问次数以及每个检索项的人均访问次数来计算用户对于每个维度中的每个检索项的偏好分,然后形成文档(Document),与搜索引擎类似,每篇文档(Document)可以包括三个域(field):CATEGORY,PROPERTY,KEYWORD。每个域包含若干检索项(term),在文档中可以描述用户对于每个检索项的偏好分。由于人群定位(圈人)的结果一般实时性要求不高,数据体量(百万至十亿)也远远小于文本搜索系统(亿到千亿),所以文档不需要维护倒排索引,技术实现相对文本搜索系统要简单。
在本申请提供的一种可选实施例中,步骤S243,根据每个维度上的数据集合中检索项的人均访问次数,以及用户访问每个维度上的数据集合中检索项的访问次数,计算得到用户对应每个维度上的数据集合所包含的检索项的偏好分值可以通过如下计算公式计算得到用户对应任意一个维度上的数据集合所包含的检索项的偏好分值tf(t,d):
偏好分
Figure PCTCN2017070150-appb-000001
其中,
wi为在第i个维度上的数据集合中发生访问行为的权重值,Ni为在第i个维度上的 数据集合中用户对检索项t执行访问行为后所统计的访问次数;ni为在第i个维度上的数据集合中检索项t的人均访问次数,检索项t为数据集合中的任意一个检索项,其中,访问行为包括如下任意一种类型:点击、收藏和点评。
在本申请提供的一种可选实施例中,步骤S26,在获取待定位的搜索词之后,根据搜索词查询得到与搜索词具有对应关系的多个定位检索项,并获取每个定位检索项对应每个维度上的数据集合的权重值的步骤可包括:
步骤S261,获取待定位的搜索词,并根据搜索词查询得到与搜索词具有对应关系的多个定位检索项。
步骤S262,根据查询得到的多个定位检索项,确定搜索词对应每个维度上的数据集合的维度关系。
步骤S263,根据搜索词对应每个维度上的数据集合的维度关系,计算得到每个定位检索项对应每个维度上的数据集合的权重值。
在上述步骤S261至步骤S263中,本方案可以根据运营商输入的待定位的搜索词还进行查询,以得到与待定位的搜索词对应的多个定位检索项,需要说明的是,上述多个定位检索项存在与用于描述上述待定位的搜索词的三个维度中,本方案可以先确定搜索词对应每个维度上的数据集合的维度关系,然后根据该维度关系计算得到每个定位检索项对应每个维度上的数据集合的权重值。
在本申请提供的一种可选实施例中,在上述步骤S262中,可以通过如下计算公式确定搜索词对应每个维度上的数据集合的维度关系:
Figure PCTCN2017070150-appb-000002
其中,
A表示三个维度上的数据集合中包含任意一个搜索词的数据集合,B表示三个维度上的数据集合中包含任意一个定位检索项t的数据集合。
在上述公式中,本方案可以生成搜索词到ITEM的三个维度的关系,当运营商输入搜索词进行人群定向的过程中,本方案通过查询扩展生成搜索词到ITEM的三个维度的关系,即WORD-CATEGORY,WORD-PROPERTY,KEYWORD-KEYWORD,本方案可以使用杰卡德距离算法(Jaccord Distance),考量搜索词到其他维度在ITEM上的共现关系。
在本申请提供的一种可选实施例中,在上述步骤S263中,本方案可以通过如下计算公式计算得到每个定位检索项对应每个维度上的数据集合的权重值:
Figure PCTCN2017070150-appb-000003
其中,r(w,t)为搜索词对应每个维度上的数据集合的维 度关系,w为搜索词w与检索项t的相关性,I(w)为搜索词在文本中的词频。
需要说明的是,在上述公式中,权重计算可以简单的使用加权求和,最终得到查询扩展后的标签定义,在本方案中,上述文档中的每一个域都可以赋一个权重值。
在本申请提供的一种可选实施例中,步骤S261中获取待定位的搜索词的步骤包括:
步骤S2611,在接收到查询用户输入的关键词之后,确定输入的关键词为待定位的搜索词。
在上述步骤S2611中,上述查询用户可以为希望实现人群定位的运营商,在运营商输入关键词后,本方案可以直接确定运营商输入的关键词为待定位的搜索词。
步骤S2612,在接收到查询用户输入的文本之后,对文本进行分词处理,分词处理得到的至少一个关键词为待定位的搜索词。
在上述步骤S2612中,如果运营商输入的为一个文本TXT,本方案可以对上述文本TXT进行分词筛选,然后将经过分词处理得到的至少一个关键词为待定位的搜索词。
需要说明的是,上述步骤S2611以及步骤S2612中为两个并列的方案,在本方案中,运营商既可以输入关键词,也可以输入文本。
在本申请提供的一种可选实施例中,步骤S28,根据每个维度上的数据集合所包含的检索项的偏好分值和获取每个定位检索项对应每个维度上的数据集合的权重值,计算得到每个用户与搜索词之间的耦合关系所确定的行为权重值的步骤包括:
步骤S281,获取定位检索项在用户行为数据中的IDF值idf(t)。
步骤S282,获取定位检索项在多个文档中的最高权重值coord(q,d)。
步骤S283,同一个文档中查询的搜索词的归一化处理,得到归一化的搜索词分值queryNorm(q,d)。
步骤S284,定位检索项在多个文档的权重值进行归一化处理,得到多个文档的归一分值norm(t.field)。
步骤S285,通过如下计算公式获取每个用户与搜索词之间的耦合关系所确定的行为权重值Score(q,d)。
Score(q,d)=coord(q,d)*queryNorm(q,d)*∑t∈qtf(t,d)*idf2(t)*t.boost*norm(t.field),其中,tf(t,d)为用户对应每个维度上的数据集合所包含的检索项的偏好分值,t.boost为每个定位检索项对应每个维度上的数据集合的权重值,f.boost为每个维度上的数据集合的权重值。
在本申请提供的一种可选实施例中,本方案可以通过如下计算公式计算得到定位检 索项在用户行为数据中的IDF值idf(t):
Figure PCTCN2017070150-appb-000004
在本申请提供的一种可选实施例中,本方案可以通过如下计算公式计算得到定位检索项在多个文档中的最高权重值coord(q,d):
Figure PCTCN2017070150-appb-000005
在本申请提供的一种可选实施例中,本方案可以通过如下计算公式计算得到归一化的搜索词分值queryNorm(q,d):
Figure PCTCN2017070150-appb-000006
在本申请提供的一种可选实施例中,本方案可以通过如下计算公式计算得到上述多个文档的归一分值norm(t.field):
Figure PCTCN2017070150-appb-000007
其中,域为访问数据集合中任意一个维度上的数据集合。
需要说明的是,与标准的搜索评分算法不同的是,本方案中使用的算法忽略的文档(Document)的权重d.boost,查询(Query)的整体权重q.boost,而且每个TERM对应的f.boost只有一个,也就是每个TERM对应只有一个域。
下面结合图3至图4介绍本申请的一种可选实施例,本实施例可以包括步骤如下;
步骤A,数据提取抽象模块将用户行为数据导入数据仓库,如ODPS、Hadoop,进行ETL过程,产出合乎数据规范的离线数据。
在上述步骤A中,本实施例需要抽象出两个主体:USER(用户),表示圈人的主体,最终产出的人群即是整体USER的子集,USER可以有一个TAG的属性,用来描述用户的人口统计特征,如性别,年龄。ITEM(物品),表示用户发生行为的对象,包括但不限于商品、视频、音乐等。每个ITEM会由三个维度去描述:CATEGORY(类目),表示ITEM的分类,是一种多对一的关系,即每个ITEM有且只有一个CATEGORY。PROPERTY(属性),表示ITEM的自有属性,是一个多对多的关系,比如音乐作为ITEM就可以有作曲人、作词人、歌手、风格等多个属性。KEYWORD(关键词),表示ITEM的描述信息,每个关键词可以带词频或者TFIDF的权重。需要说明的是,三个维度只有KEYWORD是必须的,其他可以不在数据中体现(CATEGORY唯一,PROPERTY为空)。
步骤B,用户文档生成模块将USER对ITEM的行为,分解为UESR对ITEM的三个维度的偏好分,即:UESER-CATEGORY,USER-PROPERTY,USER-KEYWORD。本方案可以采用有目标的监督学习算法(例如LR、SVM)对数据进行统计汇总,再归一化到0-1。所有偏好的汇总生成每个用户自己的偏好文档(Document),参考图4,与搜索引擎类似,一篇文档(Document)包括三个域(field):CATEGORY,PROPERTY,KEYWORD。每个域包含若干检索项(term),描述用户对某个类目,某个词的偏好分。因为圈人的结果一般实时性要求不高,数据体量(百万至十亿)也远远小于文本搜索系统(亿到千亿),所以文档不需要维护倒排索引,技术实现相对文本搜索系统要简单。
步骤C,关键词相关性计算模块计算搜索词到ITEM的三个维度的关系,当输入关键词进行圈人的过程中,提供查询扩展的功能。计算搜索词到ITEM的三个维度的关系,即WORD-CATEGORY,WORD-PROPERTY,KEYWORD-KEYWORD。
步骤D,标签定义生成模块通过用户提供的文本或者关键词输入,提供文本系统需要先进行分词筛选处理得到关键词,查询扩展出相应的定位检索项(term)。标签定义生成模块根据搜索词到ITEM的三个维度的关系,最终产生每个定位检索项在每个维度上的权重,权重计算可以简单的使用加权求和。最终得到查询扩展后的标签定义,相当于搜索系统中的查询(Query)。
步骤E,打分模块根据Lucened的搜索评分算法来根据每个定位检索项在每个维度上的权重以及UESR对ITEM的三个维度的偏好分生成用户行为权重值,该用户行为权重值可以用于表征用于对ITEM的感兴趣大小。需要说明的是,上述评分算法可以为BM25算法。
综上,本发明提供了一套通用的解决方案,运营商只需提供关键词,即可完成一个特定人群圈定,并且可以提供可解释的人群定义,可以提高产品迭代效率,减少开发成本,从而可以完成更加精准的人群定向,提升了运营商的广告服务效果。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多 情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本发明各个实施例的方法。
实施例2
根据本发明实施例,还提供了一种用于实施上述用户行为数据的处理方法的用户行为数据的处理装置,如图5所示,该装置可以包括:
第一获取单元50,用于获取用户行为数据,其中,用户行为数据包括多个用户访问目标对象之后所产生的访问数据集合,访问数据集合至少包括如下三个维度上的数据集合:关键词集合、属性信息集合和分类信息集合。
上述用户可以为门户网站(比如购物网站)的访问用户USER,上述目标对象可以为门户网站中的产品ITEM,上述产品ITEM可以为商品、视频、音乐等,在访问用户USER对门户网站的产品ITEM进行点击、搜索查询、评论、收藏网页等行为之后,会生成大量的访问数据集合(比如文本数据),网站服务器可以获取上述用户访问目标对象生成的访问数据集合。需要说明的是,网站服务器获取的每条访问数据集合都可以使用三个维度去描述:类目CATEGORY,即上述分类信息,用于表述产品ITEM的分类,属性PROPERTY,用于表述产品ITEM的自有属性,关键词KEYWORD,用于表述产品ITEM的名称,每个关键词可以带词频或者TFIDF的权重。需要说明的是,在用于描述产品ITEM的三个维度中,每个产品ITEM只能有一个类目CATEGORY,每个产品ITEM可以有多个属性PROPERTY。
第一确定单元52,用于确定用户对应每个维度上的数据集合所包含的检索项的偏好分值,其中,每个维度上的数据集合包含至少一个检索项。
在用于表述产品ITEM的三个维度中,每个维度都可以包括多个检索项,上述多个检索项可以是每个维度的多个属性,用户可以对每个维度下的具体的检索项进行操作,然后,本方案可以根据用户对每个检索项的具体操作来确定用户对于每个检索项的偏好分值。
第二获取单元54,用于在获取待定位的搜索词之后,根据搜索词查询得到与搜索词具有对应关系的多个定位检索项,并获取每个定位检索项对应每个维度上的数据集合的权重值。
如果网站的运营商希望通过搜索词来实现人群定向,即网站的运营商希望圈定对搜索词A感兴趣的任意一个或多个用户,即根据搜索词来定位成一组用户,以此进一步进 行对该定位的用户组进行相应的数据推送、分析等应用,例如,在将某一词汇作为搜索词来定位不同消费群体的兴趣爱好之后,可以向定位为同一组的用户推送关于上述搜索词有关的广告信息,即一种可选示例中,此处网站的运营商可以直接向服务器输入上述待定位的检索词,也可以向服务器提供一份文本,服务器可以通过分词筛选从该文本中得到待定位的搜索词。
第三获取单元56,根据每个维度上的数据集合所包含的检索项的偏好分值和获取每个定位检索项对应每个维度上的数据集合的权重值,计算得到每个用户与搜索词之间的耦合关系所确定的行为权重值。
在用户访问门户网站时,通过对网站中搜索词的操作(点击、浏览、下载等操作)可以产生用户与搜索词之间的耦合关系,例如,用户对搜索词进行点击操作时,用户的行为与搜索词之间就会产生第一耦合关系,第一耦合关系可以用于表征用户对上述搜索词的感兴趣程度,用户点击的次数越多,第一耦合关系越大,根据第一耦合关系确定的行为权重值就越大,也表明用户对搜索词的感兴趣程度越大。
第二确定单元58,根据每个用户与搜索词之间的耦合关系所确定的行为权重值,确定待定位的搜索词所定位的用户组。
本方案可以根据每个用户与搜索词之间的耦合关系所确定的行为权重值的大小来挑选出符合预定条件的多个用户,然后将上述符合预定条件的多个用户确定为与上述搜索词相关的用户组。优选地,本实施例也可以将上述耦合关系确定的权重值大于0的用户确定为用户组。需要说明的是,在确定出搜索词的用户组之后,运营商可以对上述用户组中的每个用户推送相关的广告信息。
本申请上述实施例二公开的方案中,如果想对产品感兴趣的人群进行定位,首先,本方案可以获取用户行为数据,其中,用户行为数据包括多个用户访问目标对象之后所产生的访问数据集合,访问数据集合至少包括如下三个维度上的数据集合:关键词集合、属性信息集合和分类信息集合;然后,确定用户对应每个维度上的数据集合所包含的检索项的偏好分值,其中,每个维度上的数据集合包含至少一个检索项;接着,在获取待定位的搜索词之后,根据搜索词查询得到与搜索词具有对应关系的多个定位检索项,并获取每个定位检索项对应每个维度上的数据集合的权重值;接着,根据每个维度上的数据集合所包含的检索项的偏好分值和获取每个定位检索项对应每个维度上的数据集合的权重值,计算得到每个用户与搜索词之间的耦合关系所确定的行为权重值;最后,本方案可以根据每个用户与搜索词之间的耦合关系所确定的行为权重值,确定待定位的搜 索词所定位的用户组。容易注意到,本方案可以从网站服务器中获取到的用户的行为数据,根据用户的行为数据生成用户针对产品的检索项的偏好分,然后根据运营商输入的搜索词来生成搜索词中每条定位检索项对于所属维度的第一权重值,最后根据上述偏好分和第一权重值来生成用户的行为权重值,通过用户的行为权重值可以直观的看出用户对应检索词的感兴趣程度,进而对用户进行分组,与现有技术相比,本方案对网站服务器产生的文本数据进行了有效的利用,而且,与现有分析结构化数据来定位人群的技术相比,本方案产生的人群定位结果更加准确。由此,本申请提供的上述实施例二的方案解决了单纯的通过结构化数据来实现人群定向,定位结果不够准确的技术问题。
在本申请提供的一种可选实施例中,如图6所示,第一确定单元52包括:第一获取模块521,用于分别获取关键词集合中包含的至少一个第一检索项、属性信息集合中包含的至少一个第二检索项和分类信息集合中包含的至少一个第三检索项;统计模块523,用于分别统计每个维度上的数据集合中检索项的人均访问次数,以及用户访问每个维度上的数据集合中检索项的访问次数;第一计算模块524,用于根据每个维度上的数据集合中检索项的人均访问次数,以及用户访问每个维度上的数据集合中检索项的访问次数,计算得到用户对应每个维度上的数据集合所包含的检索项的偏好分值。
在本申请提供的一种可选实施例中,第一计算模块524包括:子计算模块,用于通过如下计算公式计算得到用户对应任意一个维度上的数据集合所包含的检索项的偏好分值tf(t,d):偏好分
Figure PCTCN2017070150-appb-000008
其中,wi为在第i个维度上的数据集合中发生访问行为的权重值,Ni为在第i个维度上的数据集合中用户对检索项t执行访问行为后所统计的访问次数;ni为在第i个维度上的数据集合中检索项t的人均访问次数,检索项t为数据集合中的任意一个检索项,其中,访问行为包括如下任意一种类型:点击、收藏和点评。
在本申请提供的一种可选实施例中,如图7所示,第二获取单元54包括:第二获取模块541,用于获取待定位的搜索词,并根据搜索词查询得到与搜索词具有对应关系的多个定位检索项;第一确定模块542,用于根据查询得到的多个定位检索项,确定搜索词对应每个维度上的数据集合的维度关系;第二计算模块543,用于根据搜索词对应每个维度上的数据集合的维度关系,计算得到每个定位检索项对应每个维度上的数据集合的权重值。
在本申请提供的一种可选实施例中,上述装置还包括:第一计算单元,用于通过如下计算公式确定搜索词对应每个维度上的数据集合的维度关系:
Figure PCTCN2017070150-appb-000009
其中,A表示三个维度上的数据集合中包含任意一个搜索词的数据集合,B表示三个维度上的数据集合中包含任意一个定位检索项t的数据集合。
在本申请提供的一种可选实施例中,上述装置还包括:第二计算单元,用于通过如下计算公式计算得到每个定位检索项对应每个维度上的数据集合的权重值:
Figure PCTCN2017070150-appb-000010
Figure PCTCN2017070150-appb-000011
其中,r(w,t)为搜索词对应每个维度上的数据集合的维度关系,w为搜索词w与检索项t的相关性,I(w)为搜索词在文本中的词频。
在本申请提供的一种可选实施例中,第二获取模块541包括:第二确定模块,用于在接收到查询用户输入的关键词之后,确定输入的关键词为待定位的搜索词;或者,第一处理模块,用于在接收到查询用户输入的文本之后,对文本进行分词处理,分词处理得到的至少一个关键词为待定位的搜索词。
在本申请提供的一种可选实施例中,如图8所示,第二确定单元58包括:第三获取模块581,用于获取定位检索项在用户行为数据中的IDF值idf(t);第四获取模块582,用于获取定位检索项在多个文档中的最高权重值coord(q,d);第二处理模块583,将同一个文档中查询的搜索词的归一化处理,得到归一化的搜索词分值queryNorm(q,d);第三处理模块584,定位检索项在多个文档的权重值进行归一化处理,得到多个文档的归一分值norm(t.field);第三计算模块585,用于通过如下计算公式获取每个用户与搜索词之间的耦合关系所确定的行为权重值Score(q,d):Score(q,d)=coord(q,d)*queryNorm(q,d)*∑t∈qtf(t,d)*idf2(t)*t.boost*norm(t.field),其中,tf(t,d)为用户对应每个维度上的数据集合所包含的检索项的偏好分值,t.boost为每个定位检索项对应每个维度上的数据集合的权重值,f.boost为每个维度上的数据集合的权重值。
在本申请提供的一种可选实施例中,上述装置还包括:第三计算单元,用于通过如下计算公式计算得到定位检索项在用户行为数据中的IDF值idf(t):
Figure PCTCN2017070150-appb-000012
在本申请提供的一种可选实施例中,上述装置还包括:第四计算单元,用于通过如下计算公式计算得到定位检索项在多个文档中的最高权重值coord(q,d):
Figure PCTCN2017070150-appb-000013
Figure PCTCN2017070150-appb-000014
在本申请提供的一种可选实施例中,上述装置还包括:第五计算单元,用于通过如 下计算公式计算得到归一化的搜索词分值queryNorm(q,d):
Figure PCTCN2017070150-appb-000015
Figure PCTCN2017070150-appb-000016
在本申请提供的一种可选实施例中,上述装置还包括:第六计算单元,用于通过如下计算公式计算得到多个文档的归一分值norm(t.field):
Figure PCTCN2017070150-appb-000017
Figure PCTCN2017070150-appb-000018
其中,域为访问数据集合中任意一个维度上的数据集合。
实施例3
本发明的实施例可以提供一种计算机终端,该计算机终端可以是计算机终端群中的任意一个计算机终端设备。
可选地,在本实施例中,上述计算机终端可以位于计算机网络的多个网络设备中的至少一个网络设备。
在本实施例中,上述计算机终端可以执行应用程序的漏洞检测方法中以下步骤的程序代码:获取用户行为数据,其中,用户行为数据包括多个用户访问目标对象之后所产生的访问数据集合,访问数据集合至少包括如下三个维度上的数据集合:关键词集合、属性信息集合和分类信息集合;确定用户对应每个维度上的数据集合所包含的检索项的偏好分值,其中,每个维度上的数据集合包含至少一个检索项;在获取待定位的搜索词之后,根据搜索词查询得到与搜索词具有对应关系的多个定位检索项,并获取每个定位检索项对应每个维度上的数据集合的权重值;根据每个维度上的数据集合所包含的检索项的偏好分值和获取每个定位检索项对应每个维度上的数据集合的权重值,计算得到每个用户与搜索词之间的耦合关系所确定的行为权重值;根据每个用户与搜索词之间的耦合关系所确定的行为权重值,确定待定位的搜索词所定位的用户组。
可选地,图9是根据本发明实施例的一种计算机终端的结构框图。如图9所示,该计算机终端A可以包括:一个或多个(图中仅示出一个)处理器、存储器。
其中,存储器可用于存储软件程序以及模块,如本发明实施例中的安全漏洞检测方法和装置对应的程序指令/模块,处理器通过运行存储在存储器内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的系统漏洞攻击的检测方法。存储器可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器可进一步包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至终端A。上述网络的实例包 括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
处理器可以通过传输模块调用存储器存储的信息及应用程序,以执行下述步骤:获取用户行为数据,其中,用户行为数据包括多个用户访问目标对象之后所产生的访问数据集合,访问数据集合至少包括如下三个维度上的数据集合:关键词集合、属性信息集合和分类信息集合;确定用户对应每个维度上的数据集合所包含的检索项的偏好分值,其中,每个维度上的数据集合包含至少一个检索项;在获取待定位的搜索词之后,根据搜索词查询得到与搜索词具有对应关系的多个定位检索项,并获取每个定位检索项对应每个维度上的数据集合的权重值;根据每个维度上的数据集合所包含的检索项的偏好分值和获取每个定位检索项对应每个维度上的数据集合的权重值,计算得到每个用户与搜索词之间的耦合关系所确定的行为权重值;根据每个用户与搜索词之间的耦合关系所确定的行为权重值,确定待定位的搜索词所定位的用户组。
可选的,上述处理器还可以执行如下步骤的程序代码:分别获取关键词集合中包含的至少一个第一检索项、属性信息集合中包含的至少一个第二检索项和分类信息集合中包含的至少一个第三检索项;分别统计每个维度上的数据集合中检索项的人均访问次数,以及用户访问每个维度上的数据集合中检索项的访问次数;根据每个维度上的数据集合中检索项的人均访问次数,以及用户访问每个维度上的数据集合中检索项的访问次数,计算得到用户对应每个维度上的数据集合所包含的检索项的偏好分值。
可选的,上述处理器还可以执行如下步骤的程序代码:通过如下计算公式计算得到用户对应任意一个维度上的数据集合所包含的检索项的偏好分值tf(t,d):偏好分
Figure PCTCN2017070150-appb-000019
其中,wi为在第i个维度上的数据集合中发生访问行为的权重值,Ni为在第i个维度上的数据集合中用户对检索项t执行访问行为后所统计的访问次数;ni为在第i个维度上的数据集合中检索项t的人均访问次数,检索项t为数据集合中的任意一个检索项,其中,访问行为包括如下任意一种类型:点击、收藏和点评。
可选的,上述处理器还可以执行如下步骤的程序代码:获取待定位的搜索词,并根据搜索词查询得到与搜索词具有对应关系的多个定位检索项;根据查询得到的多个定位检索项,确定搜索词对应每个维度上的数据集合的维度关系;根据搜索词对应每个维度上的数据集合的维度关系,计算得到每个定位检索项对应每个维度上的数据集合的权重值。
可选的,上述处理器还可以执行如下步骤的程序代码:
Figure PCTCN2017070150-appb-000020
其中,A表示三个维度上的数据集合中包含任意一个搜索词的数据集合,B表示三个维度上的数据集合中包含任意一个定位检索项t的数据集合。
可选的,上述处理器还可以执行如下步骤的程序代码:
Figure PCTCN2017070150-appb-000021
其中,r(w,t)为搜索词对应每个维度上的数据集合的维度关系,w为搜索词w与检索项t的相关性,I(w)为搜索词在文本中的词频。
可选的,上述处理器还可以执行如下步骤的程序代码:在接收到查询用户输入的关键词之后,确定输入的关键词为待定位的搜索词;或者,在接收到查询用户输入的文本之后,对文本进行分词处理,分词处理得到的至少一个关键词为待定位的搜索词。
可选的,上述处理器还可以执行如下步骤的程序代码:获取定位检索项在用户行为数据中的IDF值idf(t);获取定位检索项在多个文档中的最高权重值coord(q,d);将同一个文档中查询的搜索词的归一化处理,得到归一化的搜索词分值queryNorm(q,d);定位检索项在多个文档的权重值进行归一化处理,得到多个文档的归一分值norm(t.field);通过如下计算公式获取每个用户与搜索词之间的耦合关系所确定的行为权重值Score(q,d):Score(q,d)=coord(q,d)*queryNorm(q,d)*∑t∈qtf(t,d)*idf2(t)*t.boost*norm(t.field),其中,tf(t,d)为用户对应每个维度上的数据集合所包含的检索项的偏好分值,t.boost为每个定位检索项对应每个维度上的数据集合的权重值,f.boost为每个维度上的数据集合的权重值。
可选的,上述处理器还可以执行如下步骤的程序代码:通过如下计算公式计算得到定位检索项在用户行为数据中的IDF值idf(t):
Figure PCTCN2017070150-appb-000022
可选的,上述处理器还可以执行如下步骤的程序代码:通过如下计算公式计算得到定位检索项在多个文档中的最高权重值coord(q,d):
Figure PCTCN2017070150-appb-000023
可选的,上述处理器还可以执行如下步骤的程序代码:通过如下计算公式计算得到归一化的搜索词分值queryNorm(q,d):
Figure PCTCN2017070150-appb-000024
可选的,上述处理器还可以执行如下步骤的程序代码:通过如下计算公式计算得到多个文档的归一分值norm(t.field):
Figure PCTCN2017070150-appb-000025
其中, 域为访问数据集合中任意一个维度上的数据集合。
采用本发明实施例,提供了一种用户行为数据的处理方法。通过获取用户行为数据,其中,用户行为数据包括多个用户访问目标对象之后所产生的访问数据集合,访问数据集合至少包括如下三个维度上的数据集合:关键词集合、属性信息集合和分类信息集合;确定用户对应每个维度上的数据集合所包含的检索项的偏好分值,其中,每个维度上的数据集合包含至少一个检索项;在获取待定位的搜索词之后,根据搜索词查询得到与搜索词具有对应关系的多个定位检索项,并获取每个定位检索项对应每个维度上的数据集合的权重值;根据每个维度上的数据集合所包含的检索项的偏好分值和获取每个定位检索项对应每个维度上的数据集合的权重值,计算得到每个用户与搜索词之间的耦合关系所确定的行为权重值;根据每个用户与搜索词之间的耦合关系所确定的行为权重值,确定待定位的搜索词所定位的用户组,解决了单纯的通过结构化数据来实现人群定向,定位结果不够准确的技术问题。
本领域普通技术人员可以理解,本申请附图中所示的结构仅为示意,计算机终端也可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌声电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。图1其并不对上述电子装置的结构造成限定。例如,计算机终端10还可包括比图1中所示更多或者更少的组件(如网络接口、显示装置等),或者具有与图1所示不同的配置。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。
实施例4
本发明的实施例还提供了一种存储介质。可选地,在本实施例中,上述存储介质可以用于保存上述实施例一所提供的用户行为数据的处理方法所执行的程序代码。
可选地,在本实施例中,上述存储介质可以位于计算机网络中计算机终端群中的任意一个计算机终端中,或者位于移动终端群中的任意一个移动终端中。
可选地,在本实施例中,存储介质被设置为存储用于执行以下步骤的程序代码:获取用户行为数据,其中,用户行为数据包括多个用户访问目标对象之后所产生的访问数据集合,访问数据集合至少包括如下三个维度上的数据集合:关键词集合、属性信息集合和分类信息集合;确定用户对应每个维度上的数据集合所包含的检索项的偏好分值, 其中,每个维度上的数据集合包含至少一个检索项;在获取待定位的搜索词之后,根据搜索词查询得到与搜索词具有对应关系的多个定位检索项,并获取每个定位检索项对应每个维度上的数据集合的权重值;根据每个维度上的数据集合所包含的检索项的偏好分值和获取每个定位检索项对应每个维度上的数据集合的权重值,计算得到每个用户与搜索词之间的耦合关系所确定的行为权重值;根据每个用户与搜索词之间的耦合关系所确定的行为权重值,确定待定位的搜索词所定位的用户组。
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。
在本发明的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
以上仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说, 在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。

Claims (24)

  1. 一种用户行为数据的处理方法,其特征在于,包括:
    获取用户行为数据,其中,所述用户行为数据包括多个用户访问目标对象之后所产生的访问数据集合,所述访问数据集合至少包括如下三个维度上的数据集合:关键词集合、属性信息集合和分类信息集合;
    确定用户对应每个维度上的数据集合所包含的检索项的偏好分值,其中,每个维度上的数据集合包含至少一个检索项;
    在获取待定位的搜索词之后,根据所述搜索词查询得到与所述搜索词具有对应关系的多个定位检索项,并获取每个定位检索项对应每个维度上的数据集合的权重值;
    根据在所述每个维度上的数据集合所包含的检索项的偏好分值和获取所述每个定位检索项对应每个维度上的数据集合的权重值,计算得到每个用户与所述搜索词之间的耦合关系所确定的行为权重值;
    根据所述每个用户与所述搜索词之间的耦合关系所确定的行为权重值,确定所述待定位的搜索词所定位的用户组。
  2. 根据权利要求1所述的方法,其特征在于,确定用户对应每个维度上的数据集合所包含的检索项的偏好分值,包括:
    分别获取所述关键词集合中包含的至少一个第一检索项、所述属性信息集合中包含的至少一个第二检索项和所述分类信息集合中包含的至少一个第三检索项;
    分别统计每个维度上的数据集合中检索项的人均访问次数,以及所述用户访问所述每个维度上的数据集合中检索项的访问次数;
    根据所述每个维度上的数据集合中检索项的人均访问次数,以及所述用户访问所述每个维度上的数据集合中检索项的访问次数,计算得到所述用户对应每个维度上的数据集合所包含的检索项的偏好分值。
  3. 根据权利要求2所述的方法,其特征在于,根据所述每个维度上的数据集合中检索项的人均访问次数,以及所述用户访问所述每个维度上的数据集合中检索项的访问次数,计算得到所述用户对应每个维度上的数据集合所包含的检索项的偏好分值:
    通过如下计算公式计算得到所述用户对应任意一个维度上的数据集合所包含的检索项的偏好分值tf(t,d):
    偏好分
    Figure PCTCN2017070150-appb-100001
    其中,
    wi为在第i个维度上的数据集合中发生访问行为的权重值,Ni为在第i个维度上的数据集合中所述用户对检索项t执行所述访问行为后所统计的访问次数;ni为在第i个维度上的数据集合中检索项t的人均访问次数,检索项t为数据集合中的任意一个检索项,其中,所述访问行为包括如下任意一种类型:点击、收藏和点评。
  4. 根据权利要求3所述的方法,其特征在于,在获取待定位的搜索词之后,根据所述搜索词查询得到与所述搜索词具有对应关系的多个定位检索项,并获取所述每个定位检索项对应每个维度上的数据集合的权重值,包括:
    获取所述待定位的搜索词,并根据所述搜索词查询得到与所述搜索词具有对应关系的多个定位检索项;
    根据查询得到的所述多个定位检索项,确定所述搜索词对应所述每个维度上的数据集合的维度关系;
    根据所述搜索词对应所述每个维度上的数据集合的维度关系,计算得到所述每个定位检索项对应每个维度上的数据集合的权重值。
  5. 根据权利要求4所述的方法,其特征在于,通过如下计算公式确定所述搜索词对应所述每个维度上的数据集合的维度关系:
    Figure PCTCN2017070150-appb-100002
    其中,
    A表示三个维度上的数据集合中包含任意一个所述搜索词的数据集合,B表示三个维度上的数据集合中包含任意一个定位检索项t的数据集合,w为所述搜索词w与检索项t的相关性。
  6. 根据权利要求5所述的方法,其特征在于,通过如下计算公式计算得到所述每个定位检索项对应每个维度上的数据集合的权重值:
    Figure PCTCN2017070150-appb-100003
    其中,
    所述r(w,t)为所述搜索词对应所述每个维度上的数据集合的维度关系,w为所述搜索词w与检索项t的相关性,I(w)为所述搜索词在文本中的词频。
  7. 根据权利要求6所述的方法,其特征在于,获取所述待定位的搜索词的步骤包括:
    在接收到查询用户输入的关键词之后,确定所述输入的关键词为所述待定位的搜索词;或者,
    在接收到所述查询用户输入的文本之后,对所述文本进行分词处理,所述分词处理 得到的至少一个关键词为所述待定位的搜索词。
  8. 根据权利要求7所述的方法,其特征在于,在所述定位检索项为多个文档中分词根据所述每个维度上的数据集合所包含的检索项的偏好分值和所述每个定位检索项对应每个维度上的数据集合的权重值,计算得到所述每个用户与所述搜索词之间的耦合关系所确定的行为权重值,包括:
    获取所述定位检索项在所述用户行为数据中的IDF值idf(t);
    获取所述定位检索项在多个文档中的最高权重值coord(q,d);
    将同一个文档中查询的所述搜索词的归一化处理,得到归一化的搜索词分值queryNorm(q,d);
    所述定位检索项在所述多个文档的权重值进行归一化处理,得到多个文档的归一分值norm(t.field);
    通过如下计算公式获取所述每个用户与所述搜索词之间的耦合关系所确定的行为权重值Score(q,d):
    Score(q,d)=coord(q,d)*queryNorm(q,d)*∑t∈qtf(t,d)*idf2(t)*t.boost*norm(t.field),其中,tf(t,d)为所述用户对应每个维度上的数据集合所包含的检索项的偏好分值,t.boost为所述每个定位检索项对应每个维度上的数据集合的权重值。
  9. 根据权利要求8所述的方法,其特征在于,通过如下计算公式计算得到所述定位检索项在所述用户行为数据中的IDF值idf(t):
    Figure PCTCN2017070150-appb-100004
  10. 根据权利要求8所述的方法,其特征在于,通过如下计算公式计算得到所述定位检索项在多个文档中的最高权重值coord(q,d):
    Figure PCTCN2017070150-appb-100005
  11. 根据权利要求8所述的方法,其特征在于,通过如下计算公式计算得到归一化的搜索词分值queryNorm(q,d):
    Figure PCTCN2017070150-appb-100006
  12. 根据权利要求8所述的方法,其特征在于,通过如下计算公式计算得到多个文档的归一分值norm(t.field):
    Figure PCTCN2017070150-appb-100007
    其中,所述域为所述访问数据集合中任意一个维度上的数据集合,f.boost为每个维度上的数据集合的权重值。
  13. 一种用户行为数据的处理装置,其特征在于,包括:
    第一获取单元,用于获取用户行为数据,其中,所述用户行为数据包括多个用户访问目标对象之后所产生的访问数据集合,所述访问数据集合至少包括如下三个维度上的数据集合:关键词集合、属性信息集合和分类信息集合;
    第一确定单元,用于确定用户对应每个维度上的数据集合所包含的检索项的偏好分值,其中,每个维度上的数据集合包含至少一个检索项;
    第二获取单元,用于在获取待定位的搜索词之后,根据所述搜索词查询得到与所述搜索词具有对应关系的多个定位检索项,并获取每个定位检索项对应每个维度上的数据集合的权重值;
    第三获取单元,根据所述每个维度上的数据集合所包含的检索项的偏好分值和获取所述每个定位检索项对应每个维度上的数据集合的权重值,计算得到每个用户与所述搜索词之间的耦合关系所确定的行为权重值;
    第二确定单元,根据所述每个用户与所述搜索词之间的耦合关系所确定的行为权重值,确定所述待定位的搜索词所定位的用户组。
  14. 根据权利要求13所述的装置,其特征在于,所述第一确定单元包括:
    第一获取模块,用于分别获取所述关键词集合中包含的至少一个第一检索项、所述属性信息集合中包含的至少一个第二检索项和所述分类信息集合中包含的至少一个第三检索项;
    统计模块,用于分别统计每个维度上的数据集合中检索项的人均访问次数,以及所述用户访问所述每个维度上的数据集合中检索项的访问次数;
    第一计算模块,用于根据所述每个维度上的数据集合中检索项的人均访问次数,以及所述用户访问所述每个维度上的数据集合中检索项的访问次数,计算得到所述用户对应每个维度上的数据集合所包含的检索项的偏好分值。
  15. 根据权利要求14所述的装置,其特征在于,所述第一计算模块包括:
    子计算模块,用于通过如下计算公式计算得到所述用户对应任意一个维度上的数据集合所包含的检索项的偏好分值tf(t,d):
    偏好分
    Figure PCTCN2017070150-appb-100008
    其中,
    wi为在第i个维度上的数据集合中发生访问行为的权重值,Ni为在第i个维度上的数据集合中所述用户对检索项t执行所述访问行为后所统计的访问次数;ni为在第i个维度上的数据集合中检索项t的人均访问次数,检索项t为数据集合中的任意一个检索项,其中,所述访问行为包括如下任意一种类型:点击、收藏和点评。
  16. 根据权利要求15所述的装置,其特征在于,所述第二获取单元包括:
    第二获取模块,用于获取所述待定位的搜索词,并根据所述搜索词查询得到与所述搜索词具有对应关系的多个定位检索项;
    第一确定模块,用于根据查询得到的所述多个定位检索项,确定所述搜索词对应所述每个维度上的数据集合的维度关系;
    第二计算模块,用于根据所述搜索词对应所述每个维度上的数据集合的维度关系,计算得到所述每个定位检索项对应每个维度上的数据集合的权重值。
  17. 根据权利要求16所述的装置,其特征在于,所述装置还包括:
    第一计算单元,用于通过如下计算公式确定所述搜索词对应所述每个维度上的数据集合的维度关系:
    Figure PCTCN2017070150-appb-100009
    其中,
    A表示三个维度上的数据集合中包含任意一个所述搜索词的数据集合,B表示三个维度上的数据集合中包含任意一个定位检索项t的数据集合,w为所述搜索词w与检索项t的相关性。
  18. 根据权利要求17所述的装置,其特征在于,所述装置还包括:
    第二计算单元,用于通过如下计算公式计算得到所述每个定位检索项对应每个维度上的数据集合的权重值:
    Figure PCTCN2017070150-appb-100010
    其中,
    所述r(w,t)为所述搜索词对应所述每个维度上的数据集合的维度关系,w为所述搜索词w与检索项t的相关性,I(w)为所述搜索词在文本中的词频。
  19. 根据权利要求18所述的装置,其特征在于,所述第二获取模块包括:
    第二确定模块,用于在接收到查询用户输入的关键词之后,确定所述输入的关键词为所述待定位的搜索词;或者,
    第一处理模块,用于在接收到所述查询用户输入的文本之后,对所述文本进行分词处理,所述分词处理得到的至少一个关键词为所述待定位的搜索词。
  20. 根据权利要求19所述的装置,其特征在于,所述第二确定单元包括:
    第三获取模块,用于获取所述定位检索项在所述用户行为数据中的IDF值idf(t);
    第四获取模块,用于获取所述定位检索项在多个文档中的最高权重值coord(q,d);
    第二处理模块,将同一个文档中查询的所述搜索词的归一化处理,得到归一化的搜索词分值queryNorm(q,d);
    第三处理模块,所述定位检索项在所述多个文档的权重值进行归一化处理,得到多个文档的归一分值norm(t.field);
    第三计算模块,用于通过如下计算公式获取所述每个用户与所述搜索词之间的耦合关系所确定的行为权重值Score(q,d):
    Score(q,d)=coord(q,d)*queryNorm(q,d)*∑t∈qtf(t,d)*idf2(t)*t.boost*norm(t.field),其中,tf(t,d)为所述用户对应每个维度上的数据集合所包含的检索项的偏好分值,t.boost为所述每个定位检索项对应每个维度上的数据集合的权重值。
  21. 根据权利要求20所述的装置,其特征在于,所述装置还包括:
    第三计算单元,用于通过如下计算公式计算得到所述定位检索项在所述用户行为数据中的IDF值idf(t):
    Figure PCTCN2017070150-appb-100011
  22. 根据权利要求20所述的装置,其特征在于,所述装置还包括:
    第四计算单元,用于通过如下计算公式计算得到所述定位检索项在多个文档中的最高权重值coord(q,d):
    Figure PCTCN2017070150-appb-100012
  23. 根据权利要求20所述的装置,其特征在于,所述装置还包括:
    第五计算单元,通过如下计算公式计算得到归一化的搜索词分值queryNorm(q,d):
    Figure PCTCN2017070150-appb-100013
  24. 根据权利要求20所述的装置,其特征在于,所述装置还包括:
    第六计算单元,通过如下计算公式计算得到多个文档的归一分值norm(t.field):
    Figure PCTCN2017070150-appb-100014
    其中,所述域为所述访问数据集合中任意一个维度上的数据集合,f.boost为每个维度上的数据集合的权重值。
PCT/CN2017/070150 2016-01-12 2017-01-04 用户行为数据的处理方法及装置 WO2017121272A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610018733.7 2016-01-12
CN201610018733.7A CN106959971B (zh) 2016-01-12 2016-01-12 用户行为数据的处理方法及装置

Publications (1)

Publication Number Publication Date
WO2017121272A1 true WO2017121272A1 (zh) 2017-07-20

Family

ID=59310849

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/070150 WO2017121272A1 (zh) 2016-01-12 2017-01-04 用户行为数据的处理方法及装置

Country Status (2)

Country Link
CN (1) CN106959971B (zh)
WO (1) WO2017121272A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111127059A (zh) * 2018-10-31 2020-05-08 北京国双科技有限公司 用户质量的分析方法及装置
CN111143516A (zh) * 2019-12-30 2020-05-12 广州探途网络技术有限公司 一种文章搜索结果展示方法及相关装置
CN111563769A (zh) * 2020-04-26 2020-08-21 北京深演智能科技股份有限公司 数据处理方法、装置、非易失性存储介质和处理器
CN113052646A (zh) * 2019-12-27 2021-06-29 阿里巴巴集团控股有限公司 数据处理系统、方法、装置及电子设备
CN113064927A (zh) * 2021-03-24 2021-07-02 深圳市道通科技股份有限公司 客户筛选方法、装置、电子设备及计算机可读存储介质
CN116628317A (zh) * 2023-04-19 2023-08-22 上海顺多网络科技有限公司 一种使用少量信息定向用户群体偏好分析的方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108415903B (zh) * 2018-03-12 2021-09-07 武汉斗鱼网络科技有限公司 判断搜索意图识别有效性的评价方法、存储介质和设备
CN110827080A (zh) * 2019-11-04 2020-02-21 恩亿科(北京)数据科技有限公司 一种定向推送方法及装置
CN111368552B (zh) * 2020-02-26 2023-09-26 北京市公安局 一种面向特定领域的网络用户群组划分方法及装置
CN111966948B (zh) * 2020-09-25 2023-08-01 北京百度网讯科技有限公司 信息投放方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760138A (zh) * 2011-04-26 2012-10-31 北京百度网讯科技有限公司 用户网络行为的分类方法和装置及对应的搜索方法和装置
US20130073546A1 (en) * 2011-09-16 2013-03-21 Microsoft Corporation Indexing Semantic User Profiles for Targeted Advertising
CN103838756A (zh) * 2012-11-23 2014-06-04 阿里巴巴集团控股有限公司 一种确定推送信息的方法及装置
CN104090888A (zh) * 2013-12-10 2014-10-08 深圳市腾讯计算机系统有限公司 一种用户行为数据的分析方法和装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8843393B2 (en) * 2008-11-18 2014-09-23 Doapp, Inc. Method and system for improved mobile device advertisement
CN103632294A (zh) * 2013-12-20 2014-03-12 互动通天图信息技术有限公司 基于媒体和第三方数据平台的用户数据整合方法
CN104021209A (zh) * 2014-06-19 2014-09-03 北京博雅立方科技有限公司 关键词投放效果的统计方法及浏览客户端

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760138A (zh) * 2011-04-26 2012-10-31 北京百度网讯科技有限公司 用户网络行为的分类方法和装置及对应的搜索方法和装置
US20130073546A1 (en) * 2011-09-16 2013-03-21 Microsoft Corporation Indexing Semantic User Profiles for Targeted Advertising
CN103838756A (zh) * 2012-11-23 2014-06-04 阿里巴巴集团控股有限公司 一种确定推送信息的方法及装置
CN104090888A (zh) * 2013-12-10 2014-10-08 深圳市腾讯计算机系统有限公司 一种用户行为数据的分析方法和装置

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111127059A (zh) * 2018-10-31 2020-05-08 北京国双科技有限公司 用户质量的分析方法及装置
CN111127059B (zh) * 2018-10-31 2023-04-18 北京国双科技有限公司 用户质量的分析方法及装置
CN113052646A (zh) * 2019-12-27 2021-06-29 阿里巴巴集团控股有限公司 数据处理系统、方法、装置及电子设备
CN111143516A (zh) * 2019-12-30 2020-05-12 广州探途网络技术有限公司 一种文章搜索结果展示方法及相关装置
CN111563769A (zh) * 2020-04-26 2020-08-21 北京深演智能科技股份有限公司 数据处理方法、装置、非易失性存储介质和处理器
CN111563769B (zh) * 2020-04-26 2024-01-26 北京深演智能科技股份有限公司 数据处理方法、装置、非易失性存储介质和处理器
CN113064927A (zh) * 2021-03-24 2021-07-02 深圳市道通科技股份有限公司 客户筛选方法、装置、电子设备及计算机可读存储介质
CN116628317A (zh) * 2023-04-19 2023-08-22 上海顺多网络科技有限公司 一种使用少量信息定向用户群体偏好分析的方法

Also Published As

Publication number Publication date
CN106959971B (zh) 2021-07-06
CN106959971A (zh) 2017-07-18

Similar Documents

Publication Publication Date Title
WO2017121272A1 (zh) 用户行为数据的处理方法及装置
Wen et al. A hybrid approach for personalized recommendation of news on the Web
EP2823410B1 (en) Entity augmentation service from latent relational data
Jiang et al. Mining search and browse logs for web search: A survey
US20150262069A1 (en) Automatic topic and interest based content recommendation system for mobile devices
US20070214133A1 (en) Methods for filtering data and filling in missing data using nonlinear inference
US20060155751A1 (en) System and method for document analysis, processing and information extraction
US20070214131A1 (en) Re-ranking search results based on query log
KR20150031234A (ko) 어플리케이션 검색들을 가능하게 하기 위해 사용되는 검색 인덱스의 업데이트
Xu et al. Web content mining
CN103400286A (zh) 一种基于用户行为进行物品特征标注的推荐系统及方法
Serrano Neural networks in big data and Web search
KR20140026932A (ko) 사용자 성향 분석을 통한 맞춤형 쇼핑 정보 제공 시스템 및 방법
Misztal-Radecka et al. Meta-User2Vec model for addressing the user and item cold-start problem in recommender systems
Park et al. Keyword extraction for blogs based on content richness
Sharma et al. Web page ranking using web mining techniques: a comprehensive survey
Joorabchi et al. Towards linking libraries and Wikipedia: automatic subject indexing of library records with Wikipedia concepts
US10387934B1 (en) Method medium and system for category prediction for a changed shopping mission
Farina et al. Interest identification from browser tab titles: A systematic literature review
Pitsilis et al. Harnessing the power of social bookmarking for improving tag-based recommendations
Goyal et al. A robust approach for finding conceptually related queries using feature selection and tripartite graph structure
Xia et al. Aspnet: aspect extraction by bootstrapping generalization and propagation using an aspect network
Jelodar et al. Natural language processing via lda topic model in recommendation systems
Shao et al. Active blocking scheme learning for entity resolution
Hong et al. Semantic tag recommendation based on associated words exploiting the interwiki links of Wikipedia

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17738086

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17738086

Country of ref document: EP

Kind code of ref document: A1