CN113204662A - Method and device for predicting user group based on shooting and searching behaviors and computer equipment - Google Patents

Method and device for predicting user group based on shooting and searching behaviors and computer equipment Download PDF

Info

Publication number
CN113204662A
CN113204662A CN202110485570.4A CN202110485570A CN113204662A CN 113204662 A CN113204662 A CN 113204662A CN 202110485570 A CN202110485570 A CN 202110485570A CN 113204662 A CN113204662 A CN 113204662A
Authority
CN
China
Prior art keywords
users
seed
user
group
shooting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110485570.4A
Other languages
Chinese (zh)
Inventor
崔寅生
王伟戌
陶扬
韩均雷
王辰成
李雨桐
潘东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baige Feichi Technology Co ltd
Original Assignee
Zuoyebang Education Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zuoyebang Education Technology Beijing Co Ltd filed Critical Zuoyebang Education Technology Beijing Co Ltd
Priority to CN202110485570.4A priority Critical patent/CN113204662A/en
Publication of CN113204662A publication Critical patent/CN113204662A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of data processing, and provides a method, a device and computer equipment for predicting a group to which a user belongs based on a shooting and searching behavior, wherein the method comprises the following steps: dividing users into different sets, so that the users in the same set have the same group information, wherein the group information is related to the shooting and searching behaviors of the users; for the users in the set, screening out seed users with the confidence coefficient of the group information being greater than a first preset value according to the similarity level of the shooting and searching behaviors of the users; and for non-seed users, calculating similarity of the searching behaviors of the non-seed users and various sub-users, and predicting the group of the non-seed users according to the similarity, wherein the non-seed users comprise users missing the group information and users of which the confidence degrees of the group information in the set are not more than a first preset value. The invention can predict the group of the non-seed user more accurately, further improve the prediction precision and further optimize the method.

Description

Method and device for predicting user group based on shooting and searching behaviors and computer equipment
Technical Field
The invention belongs to the technical field of data processing, is particularly suitable for the field of education, and more particularly relates to a method and a device for predicting a group to which a user belongs based on a shooting and searching behavior, and computer equipment.
Background
With the advent of the big data age, data began to grow explosively. In order to solve the problem of information overload, recommendation systems are widely applied to the fields of online services such as electronic commerce, content sharing, social networks, forums and the like. Thus, recommendation systems need to recommend for different groups of people. In addition to individual-oriented recommendation systems, user group-oriented recommendation systems are also needed.
At present, many electronic education products have a function of shooting and searching, a user shoots questions on a paper surface through a camera of a control terminal, and after the terminal finishes shooting and displays the shot pictures, the user searches answers to the questions through a selection frame displayed in a control terminal interface, so that the function of shooting and searching is finished. However, in the prior art, since the group classification of the users of the electronic education product APP is not accurate enough, a recommendation service customized for each user group cannot be realized, and there are still many problems worth studying in the application of a large amount of data generated by a shooting and searching action and data prediction. In addition, the conventional user group has the technical problems of low prediction precision, low data processing speed caused by large user data amount and the like.
Therefore, it is necessary to provide a method for predicting a group to which a user belongs based on a shooting behavior to solve the above problems.
Disclosure of Invention
Technical problem to be solved
The invention aims to solve the technical problems that the group classification of users in the APP of the existing education products is not accurate, the prediction precision of the existing method is low, the data processing speed is low due to large user data quantity, and the like.
(II) technical scheme
In order to solve the above technical problem, an aspect of the present invention provides a method for predicting a group to which a user belongs based on a search behavior, where the search behavior is a behavior of initiating a photo search request and obtaining a search result, and the method includes the following steps: dividing users into different sets, so that the users in the same set have the same group information, wherein the group information is related to the shooting and searching behaviors of the users; screening out seed users with the confidence degrees of the group information in each set larger than a first preset value according to the similarity level of the shooting and searching behaviors of the users; and for non-seed users, calculating similarity of the searching behaviors of the non-seed users and various sub-users, and predicting the group of the non-seed users according to the similarity, wherein the non-seed users comprise users missing the group information and users of which the confidence degrees of the group information in the set are not more than a first preset value.
According to a preferred embodiment of the present invention, before screening the seed users, the method further comprises: labeling the search result of the shooting and searching behavior, and converting the labeled search result into a characteristic vector to represent the shooting and searching behavior of the user; and subsequently calculating the similarity of the shooting and searching behaviors of the users based on the feature vectors.
According to a preferred embodiment of the present invention, the tagging the search result of the shooting and searching behavior, and converting the tagged search result into a feature vector, includes: acquiring historical shooting and searching behaviors of a user, and characterizing the historical shooting and searching behaviors as a label sequence according to a corresponding search result, wherein each label in the label sequence represents at least one characteristic of the search result; converting the tag sequence into a vector sequence; normalizing the sequence of vectors to the feature vector.
Optionally, the method further comprises: characterizing a user as a feature vector of its act of seeking: and defining the name of the characteristic vector as a user identifier, and defining the length of the characteristic vector as a characteristic expression of a user shooting and searching behavior.
Optionally, characterizing the search result corresponding to the historical shooting behavior as a tag sequence, including: labeling the label of the search result; and carrying out duplicate removal processing on the marked search result data.
Optionally, the act of taking a photo refers to an act of initiating a photo-based search request to obtain a search result.
Optionally, the photo is a whole page photo of a whole page photo; the tag sequence comprises: test questions and pages.
Optionally, the group information comprises at least one of: the region, school, year, class, group to which the user belongs.
Optionally, the tag generated in the step of tagging the search result of the shooting action includes at least one of the following tags: teaching materials, test paper, books, problem books, page numbers and test questions.
According to a preferred embodiment of the present invention, the step of screening out the seed users comprises: clustering the feature vectors; and taking the users in the maximum class obtained after clustering in the set as the seed users.
Optionally, the clustering process is performed on the users in each set by using a community discovery algorithm inside the set.
According to a preferred embodiment of the present invention, after the calculating similarity of the shooting and searching behaviors of the non-seed user and various sub-users, the method further includes: screening out seed users with the similarity to the non-seed users within a preset range; and when the group of the non-seed user is predicted, predicting the group or group characteristics of the non-seed user according to the group information of the screened seed user.
According to the preferred embodiment of the invention, a user space is defined, in the user space, each user is taken as a vertex, the similarity between adjacent users is taken as an edge, the similarity of characteristic vectors of shooting and searching behaviors of the adjacent users is taken as the weight of the edge, and when a seed user with the confidence coefficient of the group information larger than a first preset value is screened out, the users in each set are clustered by using a Louvain community discovery algorithm to obtain the seed user; the calculating the similarity of the shooting and searching behaviors of the non-seed user and various sub-users comprises the following steps: in the user space, calculating the distance between the non-seed user and various seed users as the similarity; screening out seed users with the similarity to the non-seed users within a preset range; and calculating the average distance between any two seed users in the maximum class in each set, and screening out the seed users of which the distance between the non-seed user and the seed user in each set is smaller than the average distance.
According to a preferred embodiment of the present invention, the predicting the group to which the non-seed user belongs according to the original group information of the screened seed user includes: respectively counting the number of the screened seed users according to the original group information, and taking the original group with the maximum number of the corresponding seed users as a predicted group to which the non-seed user belongs; and when the number of the users contained in the cluster to which the non-seed user belongs is within a preset range and the cluster contains the seed user, predicting the group to which the seed user belongs as the group to which the non-seed user belongs.
According to the preferred embodiment of the present invention, before counting the number of the screened seed users according to their original group information, the method further comprises: and screening according to the region information of the screened seed users, and removing the seed users which are not in the same region with the non-seed users.
A second aspect of the present invention provides a prediction apparatus for predicting a group to which a user belongs based on a search behavior, where the search behavior is a behavior of initiating a photo search request and obtaining a search result, the prediction apparatus including: the grouping module is used for dividing users into different sets, so that the users in the same set have the same group information, and the group information is related to the shooting and searching behaviors of the users; the screening module screens out seed users with the confidence degrees of the group information larger than a first preset value in each set according to the similarity level of the shooting and searching behaviors of the users; and the prediction module is used for calculating the similarity of the shooting and searching behaviors of the non-seed users and various sub-users for the non-seed users, and predicting the group of the non-seed users according to the similarity, wherein the non-seed users comprise users missing the group information and users of which the confidence degrees of the group information in the set are not more than a first preset value.
A third aspect of the present invention provides a computer device, comprising a processor and a memory, wherein the memory is used for storing a computer executable program, and when the computer program is executed by the processor, the processor executes the method for predicting the group to which the user belongs based on the shooting behavior.
A fourth aspect of the present invention provides a computer program product, storing a computer executable program, which when executed, implements the method for predicting a group to which a user belongs based on a pat-and-search behavior.
(III) advantageous effects
Compared with the prior art, the method screens out the seed users with the confidence degrees of the group information in each set larger than a first preset value based on the similarity level of the shooting and searching behaviors of the users; for the non-seed users of the group to be predicted, similarity of shooting and searching behaviors of the non-seed users and various sub-users is calculated, and then the group to which the non-seed users belong is predicted according to the similarity, wherein the non-seed users comprise users lacking group information and users with the confidence coefficient of the group information in the set not larger than a first preset value. Therefore, the group to which the non-seed user belongs can be predicted more accurately, the prediction precision can be further improved, and the algorithm is simple and efficient.
Furthermore, vector conversion is carried out by using the search result of the shooting and searching behavior and is used for representing the user vector, so that more accurate shooting and searching behavior data can be obtained, and the group information of the user can be represented more accurately; by removing non-seed users from each original group, a more accurate seed user set (i.e., user group category) can be obtained, and prediction accuracy can be improved; and searching in the seed user database by using a feature search engine, performing similarity calculation on the user feature vector of the non-seed user and the user feature vector of the seed user in the seed user database to screen out the seed user with the similarity within a preset range, and taking the original group with the maximum number of corresponding seed users as the predicted group to which the non-seed user belongs, so that the problem of low data processing speed caused by large user data volume can be effectively avoided, the group to which the non-seed user belongs can be predicted more accurately, the prediction accuracy can be further improved, and the method can be further optimized.
Drawings
Fig. 1 is a flowchart of an example of a method for predicting a group to which a user belongs based on a pat behavior according to embodiment 1 of the present invention;
fig. 2 is a flowchart of another example of a method of predicting a group to which a user belongs based on a pat behavior according to embodiment 1 of the present invention;
FIG. 3 is a diagram showing an example of clustering to obtain a seed user set in the method of embodiment 1 of the present invention;
fig. 4 is a flowchart of another example of the method for predicting the group to which the user belongs based on the pat behavior according to embodiment 1 of the present invention.
Fig. 5 is a schematic diagram of an example of a prediction apparatus for predicting a group to which a user belongs based on a pat behavior according to embodiment 2 of the present invention;
fig. 6 is a schematic diagram of another example of a prediction apparatus for predicting a group to which a user belongs based on a pat behavior according to embodiment 2 of the present invention;
fig. 7 is a schematic diagram of still another example of a prediction apparatus that predicts a group to which a user belongs based on a pat behavior according to embodiment 2 of the present invention;
FIG. 8 is a schematic structural diagram of a computer device of one embodiment of the present invention;
FIG. 9 is a schematic diagram of a computer program product of an embodiment of the invention.
Detailed Description
In describing particular embodiments, specific details of structures, properties, effects, or other features are set forth in order to provide a thorough understanding of the embodiments by one skilled in the art. However, it is not excluded that a person skilled in the art may implement the invention in a specific case without the above-described structures, performances, effects or other features.
The flow chart in the drawings is only an exemplary flow demonstration, and does not represent that all the contents, operations and steps in the flow chart are necessarily included in the scheme of the invention, nor does it represent that the execution is necessarily performed in the order shown in the drawings. For example, some operations/steps in the flowcharts may be divided, some operations/steps may be combined or partially combined, and the like, and the execution order shown in the flowcharts may be changed according to actual situations without departing from the gist of the present invention.
The block diagrams in the figures generally represent functional entities and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different network and/or processing unit devices and/or microcontroller devices.
The same reference numerals denote the same or similar elements, components, or parts throughout the drawings, and thus, a repetitive description thereof may be omitted hereinafter. It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these elements, components, or sections should not be limited by these terms. That is, these phrases are used only to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention. Furthermore, the term "and/or", "and/or" is intended to include all combinations of any one or more of the listed items.
The invention provides a method for predicting a user group based on a shooting and searching behavior, which comprises the steps of establishing a library for all seed users by using a feature search engine, carrying out similarity calculation on a user feature vector of a user to be predicted and a user feature vector of the seed user to screen out the seed users with the similarity within a preset range, and taking a group with the maximum number of corresponding seed users as a predicted user group, so that the problem of low data processing speed caused by large user data volume can be effectively avoided, the non-seed user group can be predicted more accurately, the prediction accuracy can be further improved, and the method can be further optimized. The user to be predicted is a user lacking the group information and a user with low confidence of the group information.
It should be noted that, the above feature search engine refers to a method for searching information from a user database and feeding back the information to a user by using a specific strategy according to user requirements and a specific algorithm, for example, a faiss tool is used as a feature search engine, an ID and a feature vector thereof for each library creating element are input when a user database is created, and when the feature search engine is searched, information can be returned by setting parameters, for example, an element ID smaller than a specific similarity and a similarity with the element ID are set, in the present invention, the library creating elements include at least one of the following elements: the label, identification codes (i.e. element IDs) corresponding to the labels, vectors corresponding to the identification codes, user IDs (i.e. user accounts or user identification codes) of various sub-users, and the like, wherein the label includes teaching materials, test papers, books, problem books, page numbers, test questions, and the like; the identification code is a tag code or tag ID (i.e., element ID) representing each tag.
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.
Fig. 1 is a flowchart of an example of a method for predicting a group to which a user belongs based on a pat behavior according to embodiment 1 of the present invention. As shown in fig. 1, the present invention provides a method for predicting a group to which a user belongs based on a search behavior, the method comprising:
step S101, dividing users into different sets, and enabling the users in the same set to have the same group information, wherein the group information is related to the shooting and searching behaviors of the users.
The technical scheme of the invention is mainly carried out based on that users with the same or similar shooting and searching behaviors generally belong to the same group or have the same group characteristics. The seed user with the group information with higher confidence can be used to predict the group to which other users (e.g. users with missing group information or with group information but with low confidence) belong or the group characteristics they have.
It should be noted that, in the present invention, the group characteristics refer to group characteristics characterizing common characteristics of a certain group (i.e. a certain set of sub-users), such as the senior three students in beijing, which group has at least the territory beijing and the senior three students in the third year.
Based on the principle, the users are divided into different sets in the step, so that the users in the same set have the same group information, and an alternative set for selecting seed users is established, so that the seed users with higher confidence coefficient can be screened subsequently.
And S102, screening out seed users with the confidence degrees of the group information larger than a first preset value according to the similarity level of the shooting and searching behaviors of the users.
And for each set, screening out the users with higher group information confidence coefficient as seed users according to the similarity level of the shooting and searching behaviors of the users. Specifically, users whose confidence of the group information is greater than the first preset value may be screened out. The specific value of the first preset value can be determined by those skilled in the art according to practical situations, for example, the first preset value is determined according to the requirements of accuracy and recall, so as to ensure that the group information of the user is credible.
Optionally, the user confidence in the largest class of the set, which is obtained by clustering, is higher, and the obtained multiple classes are called seed users.
Step S103, for non-seed users, calculating similarity of shooting and searching behaviors of the non-seed users and various sub-users, and predicting a group of the non-seed users according to the similarity, wherein the non-seed users comprise users missing the group information and users of which the confidence degrees of the group information in the set are not larger than a first preset value.
In this step, similarity of the beat and search behaviors of the non-seed users and the seed users of each set is calculated, and if the similarity of the beat and search behaviors of the non-seed users and the seed users in a certain set is higher, the group information of the seed users of the set can be used for predicting the group of the non-seed users.
The method for predicting the user group based on the shooting and searching behaviors can accurately predict the user group with unknown group information or low confidence coefficient based on the shooting and searching behavior similarity of the user, and is further beneficial to providing more accurate service or service pushing for the user.
It should be noted that, in the present invention, the shooting and searching behavior refers to a behavior of initiating a photo search request and obtaining a search result. There are various usage scenarios, and an application scenario of a photo search question in an educational service product including a photo search function is taken as an example and is specifically described below.
Fig. 2 is a flowchart showing another example of a method of predicting a group to which a user belongs based on a pat behavior in embodiment 1 of the present invention.
As shown in fig. 2, in step S201, users are divided into different sets, and users in the same set have the same group information, which is related to the shooting and searching behavior of the user.
In this example, for example, APP registration information of the user in the educational service product may be called, school information or grade information in the APP registration information is identified, and the group information of the school or grade in the APP registration information is used to divide the user into a plurality of sets (i.e., seed user candidate sets for selecting seed users with higher confidence).
It should be noted that the set (i.e., the seed user candidate set) of the present embodiment is a temporary user set established for selecting the seed users based on the predicted demand, for example, if it is to be predicted that the user is a user group using a religion version or a north teachers version, when the user is divided into a plurality of sets, the group information according to may include at least one of a region, a school, a year, and a class. One user can have a plurality of group information at the same time. For example, according to actual needs, the grade of the user can be predicted, but the prediction is not limited to this, and the prediction can also be made for regions and schools, and the prediction can also be made for the same class of users who may have common needs (such as the needs of course compensation), and the like.
Specifically, the APP registration information may further include geographic information, school segment information, a user account, and the like. Preferably, the APP registration information is, for example, registration information of a user for an education service product APP having a photo-taking and question-searching function.
It should be noted that, since the APP registration information includes false information, there is an unreliable group information, and therefore, the APP registration information is referred to as original group information. In other words, this step divides the users with the original group information into different sets, each set having some kind of the same group information. The original group information refers to the group information that is already present but contains the untrusted content.
More specifically, the users in the same set have the same original group information, and specifically, for example, the original group information may include at least one of a school, a year, a class, and a group. The original group information may include, for example, schools and grades, but is not limited thereto, and may also include, in other examples, regions, school segments, and the like.
It should be noted that the above description is only given as a preferred example, and the present invention is not limited thereto.
In the application scenario, historical shooting and searching behavior data of a total number of users are acquired from the education service product APP, and are subjected to labeling processing, where the historical shooting and searching behavior data may include search question request data and/or search question result data, and the total number of users includes the plurality of sets (i.e., alternative sets of seed users) in step S201.
Optionally, the search question request data includes a photo search question request, and the photo search question request is a photo search question request for taking a picture of a whole page. The users in the same group use the same teaching materials and teaching auxiliary materials with high probability, and the group to which the user to be predicted belongs can be quickly calculated based on the whole page photographing. Based on this assumption, the more the photograph contains content, the higher the prediction speed and accuracy, and the like. In some examples, a trained machine learning model may be used to detect whether a captured image in each search question request data is a full page test paper or a full page book, or is an image containing more than a certain number of questions.
Specifically, when the searched image is a whole-page test paper or a whole-page book or an image including more than a specific number of titles, the searched image corresponding to the client may be screened out, and request normalization may be performed.
In some embodiments, the request normalization is performed on the captured image to determine whether the search behavior data of different users are the same request. When the search behavior data of two or more users are the same request, determining the number of times of the same request of each user in a specific time period, and calculating the probability that each user belongs to the same set (user group) for performing deduplication processing on the labeled search question result data subsequently.
For example, if two persons hit the same requirement once in the record of the captured image, then the two persons have p probabilities of being the same class, and when the two persons hit the same requirement n times in the record of the captured image, the probability that the two persons are the same class is 1- (1-p) ^ n when calculated according to the probability, so the larger n, the higher the probability of being the same class.
Therefore, the whole page test paper or the shot and searched image of the whole page book is selected as the search request, the shot and searched image (namely the search request) or the shot and searched result is subjected to marking processing of teaching materials, test papers, books, problem books, page numbers, test questions and the like, vector conversion is carried out by using the marked and searched result and is used for representing the shooting and searching behaviors of each user, and more accurate shooting and searching behavior characteristic data can be obtained so that the group information of the user can be predicted more accurately based on the shooting and searching behavior similarity.
It should be noted that, because the same topic appears in a great number of different teaching materials, exercise books and test papers, that is, the same requirement (searching for the same topic) of different users cannot particularly well express the similarity between two users, in the search requirement scene screening, the embodiment preferentially uses the whole page to search for images, rather than aiming at one topic, thereby being able to determine the group information of each user more accurately.
The above description of the selection of the whole test paper or the whole book is merely a preferable example, and the present invention is not limited thereto. In other examples, a search may also be performed using half-page images, or images with a title number exceeding a certain number, or the like.
Note that the labeling processing is processing for labeling (or identifying) the search question request data and/or the search question result data, and the labeling processing (i.e., labeling processing or identifying processing) will be specifically described below with reference to fig. 2.
As shown in fig. 2, the method of the present invention further comprises: and S202, labeling the search result of the shooting and searching behavior, and converting the labeled search result into a feature vector. Subsequently, in step 203 and step 204, similarity of the search behavior of the user is calculated based on the feature vectors.
It should be noted that, in step S202, the tagging of the search result is only completed before the similarity of the search behavior is calculated, and the specific implementation node is not limited.
In some embodiments, step S202 includes:
step one, labeling the search result of the shooting and searching behavior to form a label sequence corresponding to the shooting and searching behavior.
In the step, the shooting and searching behavior history of the user is obtained, and the data expression form of the shooting and searching behavior history can be a digital label sequence. Illustratively, this step performs tagging (i.e., processing for identifying tags) on the search result corresponding to the acquired historical shooting behavior to form a tag sequence (i.e., characterized as a tag sequence), where each tag in the tag sequence represents at least one feature of the search result. The shooting and searching behavior refers to a behavior of initiating a photo-based searching question request to obtain a searching question result.
And step two, converting the label sequence into a vector sequence. The step vectorizes the label sequence to change the label sequence into a vector sequence.
And step three, normalizing the vector sequence into a feature vector (in this example, a feature vector of the shooting and searching behavior) to represent the shooting and searching behavior of each user. In the step, the vector sequence is calculated to obtain a unique vector (which is subsequently used as a characteristic vector of a shooting and searching behavior of a user), and the dimensionality of the vector is equal to that of the previous vector sequence.
Converting the tag sequence into a vector sequence means vectorizing (or converting into a vector) each tag in the tag sequence to form a plurality of vectors corresponding to each tag, that is, forming a vector sequence; and the normalization is to convert the vectorized vector sequence of each label into a vector, namely the vector is converted into the characteristic vector of the shooting and searching behavior. The feature vector (e.g., feature vector of a specific dimension or feature vector matrix) is used for calculating similarity of shooting and searching behaviors between two users.
For ease of understanding, some specific examples are given below. For example, if the user has searched 10 times, the result is returned for 10 searches. Each search return may be characterized by a tid or pid. tid is the question id and pid is the picture id of the page. Then, the tid and pid characterizing the shooting search behavior are vectorized, and the user searches 10 times, so that the user is equivalently composed of tid/pid sequences with the length of 10. Since tid and pid are already vectorized, the sequence after vectorization can also be subjected to a series of calculations (i.e., normalization) to become a vector. So far a user is represented by a vector. In summary, in the scene of the photo search question, the overall idea of step S202 is to vectorize the user or the search behavior of the user. The method comprises the steps of vectorizing each behavior of the user for shooting and searching, and then carrying out certain normalization calculation on the behavior vectorization sequence to form a vector, so that the behavior sequence is equal to the user, the user becomes vectorized, and the user group can be conveniently predicted based on the similarity of the shooting and searching behaviors.
The label generated by the step of labeling the search result of the shooting search behavior may include at least one of the following: teaching materials, test paper, books, problem books, page numbers and test questions. Furthermore, the labels of the teaching materials, test paper, book, problem book, page number, test question, etc. are labeled (or marked) to form material identification, test paper identification, book identification, problem book identification, page number identification, test question identification.
Optionally, the search results corresponding to the historical shooting and searching behaviors of each user within a specific time are labeled (labeled) to form a label sequence corresponding to the historical search result of each user, and obtain labeled shooting and searching behavior result feature data (i.e., a label sequence including at least one label identifier of the above identifiers), and the label sequence is subjected to vector conversion to generate a shooting and searching behavior feature vector to represent the shooting and searching behaviors of each user, and is used for subsequently calculating the similarity of the shooting and searching behaviors between two users.
For example, the tag sequence includes a plurality of information pieces corresponding to the captured image (i.e., captured request data), each of which is characterized by an identification code, wherein the information piece is, for example, a question without an answer to be retrieved, a question containing an answer, and the like.
For another example, the tag sequence includes a plurality of pieces of information corresponding to the search result data, wherein the search result data includes a test question identifier (represented by TID) and/or a teaching material identifier (represented by PID). Specifically, in one month, the search result of the user 1 includes a1:1+2 ═ 3, P2, a2:15-2 ═ 13, a4, and P5 … An ═ 10-2 ═ 8, and after the search result of the user 1 is tagged, the tag sequence of the generated user 1 is (a1, P2, a2, a4, P5 … An). In this example, a row of tag sequences is shown, but the specific implementation is not limited thereto, and the tag sequences may vary according to the acquisition time period or the search behavior of the user.
And converting the tagged search result into a feature vector, performing vector conversion on the information strips in the tag sequence of each user, and converting the vectors of the information strips in the tag sequence into vectors of specific dimensions.
For vector conversion (or vectorization), in this example, a fastText method is used, and a skip-gram algorithm is adopted to establish a vector conversion model, and the vector conversion model is used to perform vector conversion on the tag sequence of the user.
Specifically, by using an embedded learning method and training the vector conversion model by using a training data set, the neural network hidden layer can output a vector with a fixed dimension of each label (identification code or label identification), wherein the training data set comprises a label sequence of a historical user (a label sequence formed by labeling a shooting search result), a vector corresponding to each identification code in the label sequence, and a characteristic vector of a historical shooting search behavior.
In the present example, vectorization coding may be performed by an average pooling method, that is, averaging vectors of a specific dimension of all labels (identification codes or label identifications) to obtain a feature vector of the search behavior for characterizing the search behavior of the user.
Optionally, the user may also be characterized as a feature vector of the user's (within a specific time period) search behavior, where a name of the feature vector is defined as a user identifier, and a length of the feature vector (i.e. a specific vector dimension) is defined as a feature expression of the user's search behavior.
Note that, in this example, the search result is labeled, but the labeling is not limited to this, and in other examples, labeling of the search behavior data or labeling of the search result data and the search behavior data is also included. The foregoing is described by way of alternative examples only and is not to be construed as limiting the invention.
Preferably, the search result data (and/or the search behavior data) after the labeling processing is subjected to a deduplication processing, and then a tag sequence corresponding to the search result data (and/or the search behavior data) after the deduplication processing is subjected to vector conversion, so as to generate a feature vector, such as the above-mentioned shooting behavior feature vector.
Therefore, the tag sequence corresponding to the shooting and searching behavior is obtained through the step S202, vector conversion is performed, and feature vectors such as the shooting and searching behavior feature vector are further generated, so that the shooting and searching behavior of the user can be more accurately represented, more accurate shooting and searching behavior data can be obtained, and the similarity between subsequent users can be calculated and used.
It should be noted that the above description is only given by way of example, and the present invention is not limited thereto.
Next, in step S203, a seed user whose confidence of the group information is greater than a first preset value is screened out according to the similarity level of the shooting and searching behaviors of the user.
Optionally, a plurality of classes (also called clusters) are formed by clustering using the feature vectors of the capturing behavior of the users obtained in step S202, each class representing a subset with similar capturing behavior, wherein the users include all users in the alternative seed user set in step S201, in other words, the users are full users. Based on the users of the same real group having similar snapping behavior, it may be determined that the confidence of users in the largest class of the same set (e.g., users whose registration information is the same class) is high.
Specifically, clustering is performed by using the feature vectors of the shooting and searching behaviors of all users to determine the confidence level of the group information of each user, and a seed user is screened out, wherein the seed user is a user having the original group information in the set, and the confidence level of the original group information is greater than a first preset value. And the user with the confidence degree larger than the first preset value considers that the original group information is credible.
Further, calculating similarity levels of the shooting and searching behaviors between two users in all the users, namely calculating the similarity of characteristic vectors of the shooting and searching behaviors of any two users, and performing clustering processing according to the calculated similarity of the shooting and searching behaviors to obtain a plurality of sets (namely sets corresponding to original groups).
Optionally, a community discovery algorithm is used to perform clustering processing on each set (or each set obtained in step S201) in the multiple sets obtained by the above method, specifically, perform clustering processing on users in each set.
Specifically, the using the community discovery algorithm includes defining a user space, in which each user is taken as a vertex (e.g., users 1 to 10 in fig. 3), the similarity relationship between adjacent users is an edge (e.g., d1 to d10 in fig. 3), and the similarity of feature vectors of the adjacent users' capturing behaviors is taken as a weight of the edge, so as to form a user relationship network, as shown in fig. 3.
Fig. 3 is a diagram illustrating an example of clustering to obtain a seed user set in the method of embodiment 1 of the present invention.
As shown in fig. 3, users in a set (e.g., the set represented by circles in the figure) are clustered by using a Louvain community discovery algorithm to obtain a set having a maximum class (the set formed by the relationship network of 7 users in total, i.e., users 1, 2, 3, 5, 7, 9, and 10 in the figure), the users in the maximum class in the set obtained after the clustering process are used as seed users, and the maximum class (i.e., 7 users in total, i.e., users 1, 2, 3, 5, 7, 9, and 10 in the figure) is used as a seed user set. Other users (e.g., users 4, 8, 6) whose original group information is true or of low confidence are referred to as non-seed users.
Further, all sets are clustered in sequence to screen the seed users until all sets complete the screening of the seed users.
Thereby, the set of seed users and the non-seed users can be determined more accurately.
Preferably, the seed user database is established using the obtained seed user, seed user set and user IDs (i.e. user account numbers or user identification codes) of various seed users, so as to query the user database using a feature search engine to predict the belonging group or group features of the non-seed users.
Optionally, the seed user database further includes user IDs of various seed users, and feature vectors represented by the test paper identification information and/or the teaching material identification information.
It should be noted that the above description is only given by way of example, and the present invention is not limited thereto.
Next, in step S204, for non-seed users, calculating similarity of the shooting and searching behaviors of the non-seed users and various sub-users, and predicting a group (or group feature) to which the non-seed users belong according to the similarity, where the non-seed users include users lacking the group information and users whose confidence degrees of the group information in the set are not greater than a first preset value.
Specifically, the method includes the steps of obtaining search question request data and/or search question result data of a user to be predicted, wherein the search question result data comprises test question identification and/or teaching material identification.
Further, the non-seed user is characterized as a feature vector of the shooting and searching behavior through test question identification and/or teaching material identification (namely, the shooting and searching behavior), and the user to be predicted is determined to be the non-seed user.
In this example, similarity of the search behavior of the user to be predicted (i.e., a non-seed user) and various sub-users is calculated, specifically, a distance (as similarity) between a search behavior feature vector of the user to be predicted (i.e., a non-seed user) and a search behavior feature vector of a seed user in the seed user database is calculated, for example, in the user space, a distance between the non-seed user and various sub-users is calculated as the similarity, so as to predict a user group to which the non-seed user belongs.
Specifically, a feature search engine is used for searching in the seed user database, and screening out seed users with the similarity to the user to be predicted (i.e. non-seed users) within a predetermined range, and when predicting the group to which the non-seed users belong, predicting the group or group features of the non-seed users according to the group information of the screened seed users.
Further, the screening out the seed users whose similarity with the non-seed users is within a predetermined range includes: and calculating the average distance between any two seed users in the maximum class in each set, and screening out the seed users of which the distance between the non-seed user and the seed user in each set is smaller than the average distance.
And searching in the seed user database based on the characteristic search engine, and returning a search result, wherein the search result is a seed user list within a preset range smaller than the similarity, and the seed user list comprises a specific number of seed users.
Then, the numbers of the screened seed users (i.e. seed user list) are respectively counted according to the sets corresponding to the original group information, and the set with the maximum number of the corresponding seed users is used as the group or group characteristic for predicting the non-seed user, or the group characteristic of the non-seed user is predicted as the group information of the seed user.
Optionally, when the number of users included in the cluster (corresponding set) to which the non-seed user belongs is within a predetermined range and the cluster (corresponding set) includes a seed user, predicting the group to which the non-seed user belongs as the group to which the seed user belongs, or predicting the group characteristics of the non-seed user as the group information of the seed user.
Preferably, when the region information is strongly related to the group information, for example, when the user class is predicted, the relationship between the user region and the user class is more viscous, before the number of the screened seed users is counted according to the original group information, the seed users that are not in the same region as the non-seed users can be screened according to the region information of the screened seed users, and then the seed users that are not in the same region as the non-seed users are removed. And subsequently, respectively counting the number of the seed users (namely the seed user list) without the seed users in the same region according to the sets corresponding to the original group information, and taking the set with the maximum number of the corresponding seed users as the group or group characteristic for predicting the non-seed users.
In addition, the screening work based on the region information may be performed before the similarity calculation. The method comprises the steps of firstly removing users which are not in the same region in a set, then searching in a seed user database by using a feature search engine, carrying out similarity calculation on feature vectors of non-seed users and feature vectors of seed users in the seed user database to screen out seed users with the similarity within a preset range, and taking an original group with the largest number of corresponding seed users as a predicted group to which the non-seed users belong, so that the problem of low data processing speed caused by large user data volume can be effectively avoided, the group to which the non-seed users belong can be predicted more accurately, the prediction accuracy can be further improved, and the method can be further optimized.
It should be noted that the above description is only given by way of example, and the present invention is not limited thereto.
Fig. 4 is a flowchart showing still another example of a method of predicting a group to which a user belongs based on a pat behavior in embodiment 1 of the present invention.
As shown in fig. 4, a step S403 of denoising the seed user is further included, wherein, since steps S401, S402, and S404 are respectively the same as steps S201, S203, and S204 in fig. 1, descriptions of steps S401, S402, and S404 are omitted. Step S403 will be specifically described below.
In step S403, denoising processing is performed on the seed users to generate a more accurate seed user set.
Specifically, the user relationship network graph includes user vertices and edge weights, the user relationship network graph takes users as vertices, similarity between users is edges, and similarity between feature vectors (in this example, feature vectors of a pan action) of two users is an edge weight.
If the user fills in the information, the information may enter a set divided according to the filling information and is a candidate of a seed user, but the information may not be trusted and noise points such as random filling, information expiration and the like need to be filtered, so a community discovery algorithm is adopted to filter the noise points. The seed user is a node in the largest cluster in the community discovery algorithm, and if the user does not fall in the largest cluster, the user is not the seed user although filling information.
Based on this, for the users in the same original group, most of the user vertices will be divided into the same community, and the noise points will be divided into other small communities, so as to determine the noise points (i.e. non-seed users), and remove the noise points to obtain the final seed user set.
In another example, after clustering users in the same original group, seed users in the largest class set that are more than a certain distance from the center point are removed, in other words, outliers (i.e., non-seed users) that are more than a certain distance from the center point are removed to generate a final seed user set.
In another example, the determination is further performed according to the region information of each user in the maximum class set, and the seed users in the regions not belonging to the maximum class set are removed to generate a final seed user set.
Specifically, for example, the non-seed users include users whose class information or grade information is not authentic, users whose address of the school is not consistent with the area information, and the like.
Thus, by removing non-seed users from each original group, a more accurate set of seed users (i.e., user group categories) can be obtained, and prediction accuracy can be improved.
The above-described procedure of the method for predicting the group to which the user belongs based on the act of shooting is merely used for explaining the present invention, and the order and number of the steps are not particularly limited. In addition, the steps in the method can be split into two or three steps, or some steps can be combined into one step, and the steps are adjusted according to practical examples.
Compared with the prior art, the method screens out the seed users with the confidence degrees of the group information in each set larger than a first preset value based on the similarity level of the shooting and searching behaviors of the users; for the non-seed users of the group to be predicted, similarity of shooting and searching behaviors of the non-seed users and various sub-users is calculated, and then the group to which the non-seed users belong is predicted according to the similarity, wherein the non-seed users comprise users lacking group information and users with the confidence coefficient of the group information in the set not larger than a first preset value. Therefore, the group to which the non-seed user belongs can be predicted more accurately, the prediction precision can be further improved, and the algorithm is simple and efficient.
Furthermore, vector conversion is carried out by using the search result of the shooting and searching behavior and is used for representing the user vector, so that more accurate shooting and searching behavior data can be obtained, and the group information of the user can be represented more accurately; by removing non-seed users from each original group, a more accurate seed user set (i.e., user group category) can be obtained, and prediction accuracy can be improved; and searching in the seed user database by using a feature search engine, performing similarity calculation on the user feature vector of the non-seed user and the user feature vector of the seed user in the seed user database to screen out the seed user with the similarity within a preset range, and taking the original group with the maximum number of corresponding seed users as the predicted group to which the non-seed user belongs, so that the problem of low data processing speed caused by large user data volume can be effectively avoided, the group to which the non-seed user belongs can be predicted more accurately, the prediction accuracy can be further improved, and the method can be further optimized.
Example 2
Embodiments of the apparatus of the present invention are described below, which may be used to perform method embodiments of the present invention. The details described in the device embodiments of the invention should be regarded as complementary to the above-described method embodiments; reference is made to the above-described method embodiments for details not disclosed in the apparatus embodiments of the invention.
Referring to fig. 5 to 7, a prediction apparatus 500 for predicting a group to which a user belongs based on a pat behavior according to embodiment 2 of the present invention will be described.
According to the second aspect of the present invention, the present invention further provides a prediction apparatus 500 for predicting a group to which a user belongs based on a search behavior, where the search behavior refers to a behavior of initiating a photo search request and obtaining a search result.
Specifically, the prediction apparatus 500 includes: a grouping module 501, configured to divide users into different sets, so that users in the same set have the same group information, where the group information is related to a shooting and searching behavior of the user; the screening module 502 screens out seed users with the confidence degrees of the group information in each set larger than a first preset value according to the similarity level of the shooting and searching behaviors of the users; the predicting module 503 calculates similarity of the searching behavior of the non-seed user and various sub-users for the non-seed users, and predicts the group to which the non-seed user belongs according to the similarity, where the non-seed users include users who lack the group information and users whose confidence degrees of the group information in the set are not greater than a first preset value.
As shown in fig. 6, the method further includes a clustering module 601, where the clustering module 601 is configured to perform tagging on a search result of the shooting and searching behavior before screening the seed user, and convert the tagged search result into a feature vector to characterize the shooting and searching behavior of the user; and subsequently calculating the similarity of the shooting and searching behaviors of the users based on the feature vectors.
Specifically, tagging the search result of the shooting and searching behavior, and converting the tagged search result into a feature vector, includes: acquiring historical shooting and searching behaviors of a user, and characterizing the historical shooting and searching behaviors as a label sequence according to a corresponding search result, wherein each label in the label sequence represents at least one characteristic of the search result; converting the tag sequence into a vector sequence; normalizing the sequence of vectors to the feature vector.
Optionally, the method further comprises: characterizing a user as a feature vector of its act of seeking: and defining the name of the characteristic vector as a user identifier, and defining the length of the characteristic vector as a characteristic expression of a user shooting and searching behavior.
Optionally, characterizing the historical shooting and searching behavior as a tag sequence according to the corresponding search result, including: labeling the label of the search result; and carrying out duplicate removal processing on the marked search result data.
Optionally, the act of taking a photo refers to an act of initiating a photo-based search request to obtain a search result.
Optionally, the photo is a whole page photo of a whole page photo; the tag sequence comprises: test questions and pages.
Optionally, the group information comprises at least one of: the region, school, year, class, group to which the user belongs.
Further, the tag generated in the step of tagging the search result of the shooting and searching behavior comprises at least one of the following tags: teaching materials, test paper, books, problem books, page numbers and test questions.
Further, the clustering module 601 is further configured to perform clustering processing on the feature vectors; and taking the users in the maximum class obtained after clustering in the set as the seed users.
Optionally, the clustering process is performed on the users in each set by using a community discovery algorithm inside the set.
Optionally, a community discovery algorithm is used in each set to cluster the users in the set, and the users in the largest class in the obtained set are the seed users.
Specifically, the using of the community discovery algorithm includes defining a user space, wherein in the user space, each user is used as a vertex, a similarity relation between adjacent users is used as an edge, the similarity of characteristic vectors of shooting and searching of the adjacent users is used as a weight of the edge, and when a seed user with the confidence coefficient of the group information larger than a first preset value is screened out, the user in each set is clustered by using a Louvain community discovery algorithm to obtain the seed user.
Preferably, the seed user database is established using the obtained seed user, seed user set and user IDs (i.e. user account numbers or user identification codes) of various seed users, so as to query the user database using a feature search engine to predict the belonging group or group features of the non-seed users.
Each seed user set corresponds to a user group, the seed users comprise school labels and grade labels, and the seed users are users with original group information with confidence degrees larger than a first preset value.
As shown in fig. 7, the system further includes a calculating module 701, where the calculating module 701 is configured to calculate similarity of the non-seed user and, for example, similarity of various seed users in a seed user database, screen out seed users whose similarity with the non-seed user is within a predetermined range, and when predicting a group to which the non-seed user belongs, predict the group to which the non-seed user belongs according to group information of the screened seed users.
Further, the calculating module 701 further includes: and calculating the distance between the searching behavior characteristic vector of the non-seed user and the searching behavior characteristic vectors of various sub-users as the similarity.
Specifically, screening out seed users with the similarity to the non-seed users within a preset range from the seed user database; and when the group of the non-seed user is predicted, predicting the group or group characteristics of the non-seed user according to the group information of the screened seed user.
Further, the screening out the seed users whose similarity with the non-seed users is within a predetermined range includes: and calculating the average distance between any two seed users in the maximum class in each set, and screening out the seed users of which the distance between the non-seed user and the seed user in each set is smaller than the average distance.
Then, respectively counting the number of the screened seed users according to the original group information, and taking the original group with the maximum number of the corresponding seed users as the predicted group to which the non-seed user belongs; and when the number of the users contained in the cluster to which the non-seed user belongs is within a preset range and the cluster contains the seed user, predicting the group to which the seed user belongs as the group to which the non-seed user belongs.
Preferably, before counting the number of the screened seed users according to their original group information respectively, the method further comprises: and screening according to the region information of the screened seed users, and removing the seed users which are not in the same region with the non-seed users.
In embodiment 2, the same portions as those in embodiment 1 are not described.
Those skilled in the art will appreciate that the modules in the above-described embodiments of the apparatus may be distributed as described in the apparatus, and may be correspondingly modified and distributed in one or more apparatuses other than the above-described embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
Compared with the prior art, the method screens out the seed users with the confidence degrees of the group information in each set larger than a first preset value based on the similarity level of the shooting and searching behaviors of the users; for the non-seed users of the group to be predicted, similarity of shooting and searching behaviors of the non-seed users and various sub-users is calculated, and then the group to which the non-seed users belong is predicted according to the similarity, wherein the non-seed users comprise users lacking group information and users with the confidence coefficient of the group information in the set not larger than a first preset value. Therefore, the group to which the non-seed user belongs can be predicted more accurately, the prediction precision can be further improved, and the algorithm is simple and efficient.
Furthermore, vector conversion is carried out by using the search result of the shooting and searching behavior and is used for representing the user vector, so that more accurate shooting and searching behavior data can be obtained, and the group information of the user can be represented more accurately; by removing non-seed users from each original group, a more accurate seed user set (i.e., user group category) can be obtained, and prediction accuracy can be improved; and searching in the seed user database by using a feature search engine, performing similarity calculation on the user feature vector of the non-seed user and the user feature vector of the seed user in the seed user database to screen out the seed user with the similarity within a preset range, and taking the original group with the maximum number of corresponding seed users as the predicted group to which the non-seed user belongs, so that the problem of low data processing speed caused by large user data volume can be effectively avoided, the group to which the non-seed user belongs can be predicted more accurately, the prediction accuracy can be further improved, and the method can be further optimized.
Example 3
The embodiment provides a method for predicting the class of a user based on shooting and topic searching behaviors, which is based on the method and takes class prediction of an online education APP user as an example, and mainly comprises the following steps:
step 1, uniformly marking the shooting and searching behaviors of the users, wherein the marking is defined as the normalization of shooting and searching requirements of the users, namely, the searching behaviors of the users on the same topic are regarded as the same requirement (different users can upload different shooting and searching requirements for retrieval); the labeling method comprises the steps that the same question appears in a great number of different teaching materials, exercise books and test papers, namely the similar relation between two users cannot be well expressed by the same requirements (searching for the same question) of different users, so that the requirement scene is screened, the requirement is needed by using the whole page shooting and searching, the whole page shooting and searching user shoots and searches a whole page of test questions instead of aiming at the same question, if the same whole page test questions are uploaded by different users, the same teaching materials are used by other users, the probability of the exercise books or the test papers can reach more than 80 percent, if the same whole page test questions are shot and recorded for multiple times, the probability is calculated according to independent events, the probability is very fast increased, and the confidence coefficient is very high; most of the resources hit by the whole page searching requirement come from teaching materials, the TIDs (test questions) and the PIDs (pages) of the resources are uniformly marked as KEY by the exercise book and the test paper, and a limited number of KEY sets formed by the TIDs and the PIDs can be obtained after the full normalization is completed, namely the KEY sets are label sets, so that the user searching behavior can be represented as a label sequence, and each element in the sequence represents a specific label KEY;
step 2, carrying out digital coding on the KEY, wherein a limited set formed by all the KEY codes can be regarded as a complete dictionary, elements in the dictionary can be regarded as words, and a tag sequence is an article formed by arranging the words according to different sequences, so that each user can be regarded as an independent article, and the digital codes in the dictionary can be vectorized by an embedded learning method; the embedded learning method uses a fasstext tool, adopts a skip-gram algorithm, inputs a plurality of lines of label sequences, each line is an article, each article is formed by arranging words according to different sequences, a model is a shallow neural network, and after training is completed, a neural network hidden layer can output a vector with fixed dimension of each label (word);
step 3, through coding vectorization, articles (label sequences) can also be subjected to vectorization coding through an average pooling method, finally each user can be represented by a vector, the vector cosine similarity of users with similar searching behaviors is also large, in actual calculation, the vector cosine similarity is subtracted to depict the shooting and searching behavior similarity between any two users, the shooting and searching behaviors are more similar, the similarity is smaller, and for users in the same class, the similarity between any two users is high in probability (because textbooks, exercise books and test papers searched by users in the same class in the same time period are very similar);
step 4, a certain part of users fill in school + grade information in the homework side APP, and define the grade as the same grade of the same school, namely, the users can be regarded as filling in the grade information, different users fill in different contents for the same school name during actual filling, more than 30 tens of thousands of primary schools, the names of high school, middle school and regional information in the whole country are sorted, the school names are normalized, and the normalized school names are mapped according to the province city information and school information filled by the users;
step 5, defining users as fixed points, wherein the similarity relation of the users is an edge, the similarity of vectors is the weight of the edge, any user can calculate the similarity of the shooting and searching behaviors with all other users, the operation is helped by billions of monthly and alive users, if the similarity of the shooting and searching behaviors of the two pairwise calculation users is calculated in billions, the users who fill in class information are placed in corresponding class sets, millions of class sets can be obtained, then the users are clustered by using a Louvain community discovery algorithm in each set, the largest cluster in the sets is considered as confidence, and the users who are not in the largest cluster are removed from the sets; the input of the Louvain algorithm is the weight of a vertex and an edge in a graph, the output is a community set formed by the vertex, most of the vertexes are divided into one community in our scene, and noise points are divided into other small communities;
step 6, the users located in the largest class cluster are called seed users, the users without school labels or without credit removed are called unknown class users, a feature search engine is used for building a library of all the users belonging to the seeds, the average distance (similarity measure) of any two users in each largest cluster is calculated, and a feature vector engine can be used for calculating a seed user set smaller than the average distance of each unknown class user; the feature engine uses the faiss tool, the ID and the feature vector of each library-building element are input during library building, and during feature engine search, parameters can be set to return the ID of the element less than the specified distance in the library and the specific distance (similarity) between the ID and the element;
step 7, each unknown class user can calculate a similar seed user set, similar seed users of non-local schools in the set are filtered out as noise points based on the region information of the unknown class users, then voting counting is carried out according to class labels of the seed users, and the class with the largest number of recalled votes is regarded as the label of the unknown class user; in the actual operation process, a certain proportion of similar seed user sets of unknown class users are empty (the shooting and searching behaviors are too few or new users), and the prediction accuracy of the part of users is low.
The class prediction method based on the similarity expansion of the shooting and searching behaviors among the users has the advantages of relatively less calculation amount and high accuracy, and is convenient for providing high-quality service for the users in the follow-up process.
Example 4
In the following, embodiments of the computer apparatus of the present invention are described, which may be seen as specific physical embodiments for the above-described embodiments of the method and apparatus of the present invention. The details described in the computer device embodiment of the invention should be considered as additions to the method or apparatus embodiment described above; for details which are not disclosed in the embodiments of the computer device of the invention, reference may be made to the above-described embodiments of the method or apparatus.
Fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present invention, the computer device including a processor and a memory, the memory storing a computer-executable program, the processor executing the method of fig. 1 when the computer program is executed by the processor.
As shown in fig. 8, the computer device is in the form of a general purpose computing device. The processor can be one or more and can work together. The invention also does not exclude that distributed processing is performed, i.e. the processors may be distributed over different physical devices. The computer device of the present invention is not limited to a single entity, and may be a sum of a plurality of entity devices.
The memory stores a computer executable program, typically machine readable code. The computer readable program may be executed by the processor to enable a computer device to perform the method of the invention, or at least some of the steps of the method.
The memory may include volatile memory, such as Random Access Memory (RAM) and/or cache memory, and may also be non-volatile memory, such as read-only memory (ROM).
Optionally, in this embodiment, the computer device further includes an I/O interface, which is used for data exchange between the computer device and an external device. The I/O interface may be a local bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, and/or a memory storage device using any of a variety of bus architectures.
It should be understood that the computer device shown in fig. 8 is only one example of the present invention, and elements or components not shown in the above examples may also be included in the computer device of the present invention. For example, some computer devices also include display units such as display screens, and some computer devices also include human-computer interaction elements such as buttons, keyboards, and the like. The computer device can be considered to be covered by the present invention as long as the computer device can execute the computer readable program in the memory to implement the method of the present invention or at least part of the steps of the method.
FIG. 9 is a schematic diagram of a computer program product of an embodiment of the invention. As shown in fig. 9, a computer-executable program is stored in the computer program product, and when the computer-executable program is executed, the method of the present invention is implemented. The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
From the above description of the embodiments, those skilled in the art will readily appreciate that the present invention can be implemented by hardware capable of executing a specific computer program, such as the system of the present invention, and electronic processing units, servers, clients, mobile phones, control units, processors, etc. included in the system. The invention may also be implemented by computer software for performing the method of the invention, e.g. control software executed by a microprocessor, an electronic control unit, a client, a server, etc. It should be noted that the computer software for executing the method of the present invention is not limited to be executed by one or a specific hardware entity, and can also be realized in a distributed manner by non-specific hardware. For computer software, the software product may be stored in a computer readable storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or may be distributed over a network, as long as it enables the computer device to perform the method according to the present invention.
While the foregoing detailed description has described the objects, aspects and advantages of the present invention in further detail, it should be appreciated that the present invention is not inherently related to any particular computer, virtual machine, or computer apparatus, as various general purpose devices may implement the present invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims (10)

1. A method for predicting a group to which a user belongs based on a shooting and searching behavior, wherein the shooting and searching behavior refers to a behavior of initiating a photo searching request and obtaining a searching result, and the method is characterized by comprising the following steps of:
dividing users into different sets, so that the users in the same set have the same group information, wherein the group information is related to the shooting and searching behaviors of the users;
screening out seed users with the confidence degrees of the group information in each set larger than a first preset value according to the similarity level of the shooting and searching behaviors of the users;
and for non-seed users, calculating similarity of the searching behaviors of the non-seed users and various sub-users, and predicting the group of the non-seed users according to the similarity, wherein the non-seed users comprise users missing the group information and users of which the confidence degrees of the group information in the set are not more than a first preset value.
2. The method of claim 1, wherein before the screening the seed user, the method further comprises: labeling the search result of the shooting and searching behavior, and converting the labeled search result into a characteristic vector to represent the shooting and searching behavior of the user;
and subsequently calculating the similarity of the shooting and searching behaviors of the users based on the feature vectors.
3. The method for predicting the group to which the user belongs based on the search behavior as claimed in claim 2, wherein the tagging the search result of the search behavior and converting the tagged search result into the feature vector comprises:
acquiring historical shooting and searching behaviors of a user, and characterizing the historical shooting and searching behaviors as a label sequence according to a corresponding search result, wherein each label in the label sequence represents at least one characteristic of the search result;
converting the tag sequence into a vector sequence;
normalizing the sequence of vectors to the feature vector;
optionally, the method further comprises: characterizing a user as a feature vector of its act of seeking: defining the name of the characteristic vector as a user identifier, and defining the length of the characteristic vector as a characteristic expression of a user shooting and searching behavior;
optionally, characterizing the historical shooting and searching behavior as a tag sequence according to the corresponding search result, including: labeling the label of the search result; carrying out duplicate removal processing on the search result data subjected to the labeling processing;
optionally, the act of taking a photo and searching refers to an act of initiating a photo-based search request to obtain a search result;
optionally, the photo is a whole page photo of a whole page photo; the tag sequence comprises: examination questions and pages;
optionally, the group information comprises at least one of: the region, school, grade, class and group of the user;
optionally, the tag generated in the step of tagging the search result of the shooting action includes at least one of the following tags: teaching materials, test paper, books, problem books, page numbers and test questions.
4. The method of claim 2, wherein the step of screening out seed users comprises:
clustering the feature vectors; taking the users in the maximum class obtained after clustering in the set as the seed users;
optionally, the clustering process is performed on the users in each set by using a community discovery algorithm inside the set.
5. The method for predicting the group to which the user belongs based on the pat behavior according to any one of claims 1 to 4, wherein after the calculating the similarity of the pat behaviors of the non-seed user and various sub-users, the method further comprises:
screening out seed users with the similarity to the non-seed users within a preset range;
and when the group of the non-seed user is predicted, predicting the group or group characteristics of the non-seed user according to the group information of the screened seed user.
6. The method for predicting the group to which the user belongs based on the search behavior as claimed in claim 5, wherein a user space is defined, in the user space, each user is taken as a vertex, the similarity between adjacent users is taken as an edge, the similarity of the search behavior feature vectors of the adjacent users is taken as the weight of the edge, and when a seed user with the confidence coefficient of the group information larger than a first preset value is screened out, the users in each set are clustered by using a Louvain community discovery algorithm to obtain the seed user;
the calculating the similarity of the shooting and searching behaviors of the non-seed user and various sub-users comprises the following steps: in the user space, calculating the distance between the non-seed user and various seed users as the similarity;
screening out the seed users with the similarity to the non-seed users within a preset range comprises:
the average distance of any two seed users in the largest class in each set is calculated,
and screening out the seed users of which the distances between the non-seed user and the seed users of each set are smaller than the average distance.
7. The method of claim 5, wherein predicting the group to which the non-seed user belongs according to the filtered original group information of the seed user comprises:
respectively counting the number of the screened seed users according to the original group information, and taking the original group with the maximum number of the corresponding seed users as a predicted group to which the non-seed user belongs;
and when the number of the users contained in the cluster to which the non-seed user belongs is within a preset range and the cluster contains the seed user, predicting the group to which the seed user belongs as the group to which the non-seed user belongs.
8. The method for predicting the group to which the user belongs based on the capturing behavior of claim 7, wherein before counting the number of the screened seed users according to the original group information, the method further comprises: and screening according to the region information of the screened seed users, and removing the seed users which are not in the same region with the non-seed users.
9. A prediction apparatus that predicts a group to which a user belongs based on a search behavior, the search behavior being a behavior of initiating a photo search request and obtaining a search result, the prediction apparatus comprising:
the grouping module is used for dividing users into different sets, so that the users in the same set have the same group information, and the group information is related to the shooting and searching behaviors of the users;
the screening module screens out seed users with the confidence coefficient of the group information larger than a first preset value according to the similarity level of the shooting and searching behaviors of the users in the set;
and the prediction module is used for calculating the similarity of the shooting and searching behaviors of the non-seed users and various sub-users for the non-seed users, and predicting the group of the non-seed users according to the similarity, wherein the non-seed users comprise users missing the group information and users of which the confidence degrees of the group information in the set are not more than a first preset value.
10. A computer device comprising a processor and a memory, the memory for storing a computer executable program, characterized in that:
when the computer program is executed by the processor, the processor performs the method of predicting the group to which a user belongs based on the act of seeking as claimed in any one of claims 1-8.
CN202110485570.4A 2021-04-30 2021-04-30 Method and device for predicting user group based on shooting and searching behaviors and computer equipment Pending CN113204662A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110485570.4A CN113204662A (en) 2021-04-30 2021-04-30 Method and device for predicting user group based on shooting and searching behaviors and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110485570.4A CN113204662A (en) 2021-04-30 2021-04-30 Method and device for predicting user group based on shooting and searching behaviors and computer equipment

Publications (1)

Publication Number Publication Date
CN113204662A true CN113204662A (en) 2021-08-03

Family

ID=77028536

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110485570.4A Pending CN113204662A (en) 2021-04-30 2021-04-30 Method and device for predicting user group based on shooting and searching behaviors and computer equipment

Country Status (1)

Country Link
CN (1) CN113204662A (en)

Similar Documents

Publication Publication Date Title
KR102106462B1 (en) Method for filtering similar problem based on weight
US10776885B2 (en) Mutually reinforcing ranking of social media accounts and contents
CN111046275B (en) User label determining method and device based on artificial intelligence and storage medium
CN112214670A (en) Online course recommendation method and device, electronic equipment and storage medium
CN113505204B (en) Recall model training method, search recall device and computer equipment
CN111539197A (en) Text matching method and device, computer system and readable storage medium
CN112819023A (en) Sample set acquisition method and device, computer equipment and storage medium
CN112434211A (en) Data processing method, device, storage medium and equipment
CN114298122A (en) Data classification method, device, equipment, storage medium and computer program product
Goncalves et al. Gathering alumni information from a web social network
CN112330510A (en) Volunteer recommendation method and device, server and computer-readable storage medium
CN113157867A (en) Question answering method and device, electronic equipment and storage medium
CN115048571A (en) Online education recommendation management system based on cloud platform
CN113656699B (en) User feature vector determining method, related equipment and medium
CN110929169A (en) Position recommendation method based on improved Canopy clustering collaborative filtering algorithm
CN115222443A (en) Client group division method, device, equipment and storage medium
CN111192170A (en) Topic pushing method, device, equipment and computer readable storage medium
CN114037545A (en) Client recommendation method, device, equipment and storage medium
CN111597469B (en) Display position determining method and device, electronic equipment and storage medium
CN112948526A (en) User portrait generation method and device, electronic equipment and storage medium
CN109144999B (en) Data positioning method, device, storage medium and program product
CN108830302B (en) Image classification method, training method, classification prediction method and related device
CN113516205B (en) Employee stability classification method based on artificial intelligence and related equipment
CN113204662A (en) Method and device for predicting user group based on shooting and searching behaviors and computer equipment
CN115238165A (en) Information pushing method and device based on machine learning, storage medium and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230627

Address after: 6001, 6th Floor, No.1 Kaifeng Road, Shangdi Information Industry Base, Haidian District, Beijing, 100085

Applicant after: Beijing Baige Feichi Technology Co.,Ltd.

Address before: 100085 4002, 4th floor, No.1 Kaifa Road, Shangdi Information Industry base, Haidian District, Beijing

Applicant before: ZUOYEBANG EDUCATION TECHNOLOGY (BEIJING) CO.,LTD.

TA01 Transfer of patent application right