WO2019109698A1 - 目标用户群体的确定方法及装置 - Google Patents

目标用户群体的确定方法及装置 Download PDF

Info

Publication number
WO2019109698A1
WO2019109698A1 PCT/CN2018/104939 CN2018104939W WO2019109698A1 WO 2019109698 A1 WO2019109698 A1 WO 2019109698A1 CN 2018104939 W CN2018104939 W CN 2018104939W WO 2019109698 A1 WO2019109698 A1 WO 2019109698A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
address
word
seed
text information
Prior art date
Application number
PCT/CN2018/104939
Other languages
English (en)
French (fr)
Inventor
汪昊宇
彭际群
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2019109698A1 publication Critical patent/WO2019109698A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution

Definitions

  • One or more embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method and apparatus for determining a target user group.
  • the conventional technology when selecting a target user group from a large number of users, it is usually first to manually review the information actively provided by the mass user, and then perform the above selection operation according to the target information determined after the manual review.
  • a target list or a thesaurus is created in advance, and the target list or the thesaurus contains target information, and then the selection operation is performed by matching the respective text information of the mass users with the target list or the thesaurus.
  • One or more embodiments of the present specification describe a method and apparatus for determining a target user population that can determine a target user population more quickly and efficiently.
  • a method for determining a target user population including:
  • the corresponding text information is accurately matched with the keywords in the keyword library, and if the matching is successful, the matching score of the text information is determined;
  • the extended user is expanded to the core user group to obtain a target user group.
  • a determining apparatus for a target user group including:
  • a dividing unit configured to divide the entire user group acquired by the acquiring unit into two or more sub-user groups, wherein different sub-user groups respectively correspond to different text information
  • a screening unit configured to filter out corresponding candidate user groups from the respective sub-user groups according to the screening condition of the text information corresponding to each sub-user group divided by the dividing unit, to obtain two or more candidate user groups;
  • a matching unit configured to accurately match the corresponding text information with the keywords in the keyword library for each candidate user group selected by the screening unit, and if the matching is successful, determine a matching score of the text information
  • a merging unit configured to merge the two or more candidate user groups that are filtered by the screening unit to obtain a core user group
  • a selecting unit configured to select a seed user from the core user group according to a matching score of various types of text information of the user in the core user group;
  • a calculating unit configured to separately calculate a similarity between each type of text information of the seed user selected by the selecting unit and the type of text information of other users in the entire user group except the seed user;
  • the selecting unit is further configured to select an extended user from the other users according to the similarity calculated by the calculating unit;
  • an expansion unit configured to expand the extended user selected by the selecting unit to the core user group, thereby obtaining a target user group.
  • the method and device for determining a target user group divides the obtained entire user group into two or more sub-user groups according to different text information.
  • the corresponding candidate user groups are selected from the respective sub-user groups.
  • the text information corresponding to each candidate user group is precisely matched with the keywords in the keyword library, and if the matching is successful, the matching score of the text information is determined.
  • the seed user is selected from the core user group according to the matching scores of various types of text information of the users in the core user group. Calculate the similarity between each type of text information of the seed user and the text information of other users.
  • an extended user is selected from other users. Extend the extended user to the core user community to get the target user community. Thereby, the target user group can be determined more quickly and efficiently.
  • FIG. 1 is a schematic diagram of an application scenario of a method for determining a target user group according to an embodiment of the present disclosure
  • FIG. 2 is a flowchart of a method for determining a target user group according to an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of a matching process of text information of a user provided by the present specification
  • FIG. 4 is a schematic diagram of a process of calculating a similarity of a user's address book provided in the present specification
  • FIG. 5 is a schematic diagram of a method for determining a target user group according to another embodiment
  • FIG. 6 is a schematic diagram of a determining apparatus of a target user group according to an embodiment of the present disclosure.
  • the determining device of the target user group may determine a high net worth population from the entire user group according to the user's text information (including but not limited to the delivery address and the address book, etc.).
  • High net worth individuals here can refer to groups with stable jobs and higher incomes. It may include, but is not limited to, employees in the financial industry (including banking, securities, insurance) and IT industry (including software services, the Internet), employees working in large state-owned enterprises, civil servants working in government agencies, Teachers, doctors and other public officials working in administrative institutions. Because the group has high solvency, strong willingness to repay, and a low level of credit risk.
  • the target user group determining device can push the high net worth group to the consumer credit system. Therefore, the consumer credit system can provide the corresponding consumer credit products for the group, thereby achieving the purpose of expanding the development of the credit business, and also providing tremendous assistance for the automated and personalized credit approval process and marketing process.
  • the method for determining the target user group provided by the embodiment of the present disclosure can be applied to other scenarios, for example, the determination of a high-consumer user group, and the like, which is not limited in this specification.
  • FIG. 2 is a flowchart of a method for determining a target user group according to an embodiment of the present disclosure.
  • the execution subject of the method may be a device having processing capabilities: a server or a system or device, such as the determining device of the target user group in FIG. As shown in FIG. 2, the method may specifically include:
  • Step 210 Acquire a whole user group.
  • the entire user community can be obtained from a back-end database of the Alipay system.
  • the user in the entire user group may have text information such as a delivery address and/or an address book.
  • the user who has a physical purchase record on the shopping site and completes the transaction order has a saved delivery address.
  • the above address book may include the contact information of the contact person and the corresponding phone number.
  • the contact information of the contact may include a contact's name, a nickname, and other information indicating the industry or company to which the contact belongs.
  • the above remark information can be Facebook Zhang San and Li Xingchang and so on.
  • step 220 the entire user group is divided into two or more sub-user groups.
  • different sub-user groups respectively correspond to different text information.
  • the textual information here can be used to characterize users in a sub-user community. It usually has a clear directivity and is related to the quality of the user's access to the service, so it usually has a high degree of recognition and credibility.
  • the user's text information includes but is not limited to one or more of the following: a delivery address, an address book, a wireless network (eg, wifi) name, a company corresponding to a Global Positioning System (GPS) location point. Class name, company name corresponding to Internet Protocol (IP) address, company name corresponding to Media Access Control (Mac) address, social software remark name, social software group name, instant messaging tool The name of the note and the group name of the instant messaging tool.
  • IP Internet Protocol
  • Mac Media Access Control
  • the entire user group can be divided into two sub-user groups.
  • the users in a sub-user group have a delivery address, that is, the one sub-user group corresponds to the harvest address.
  • Users in another sub-user group have contacts, that is, another sub-user group corresponds to the address book.
  • Step 230 According to the screening condition of the text information corresponding to each sub-user group, select corresponding candidate user groups from each sub-user group, and obtain two or more candidate user groups.
  • the text information includes the delivery address and the address book.
  • the screening conditions of the delivery address include one or more of the following: the receiving address is used by the user (the consignee is the person or the contact number is the mobile phone number), The receiving address is used by the user in the near future (for example, nearly one year) and the receiving address is attributed to the company class address.
  • the premise is that the contacts in the address book have a bound mobile number. In general, many websites require users to bind mobile numbers in order to facilitate identity verification and reach users.
  • the filtering conditions of the address book may include one or more of the following: the telephone number of the user to whom the address book belongs is used by the user himself and the telephone number is included in other address books.
  • the corresponding candidate user group may be selected from the one sub-user group according to the screening condition of the corresponding delivery address. It can be understood that the candidate user group also corresponds to the delivery address. That is, the users in the candidate group all have a delivery address. Similarly, for another sub-user group, the corresponding candidate user group can be selected from another sub-user group according to the corresponding filtering conditions of the address book. It can be understood that the candidate user group also corresponds to the address book. That is, the users in the candidate group all have an address book. Thereby two candidate user groups are obtained.
  • Step 240 For each candidate user group, the corresponding text information is accurately matched with the keywords in the keyword library, and if the matching is successful, the matching score of the text information is determined.
  • the process of matching and determining the matching score may be as shown in FIG. 3.
  • the following steps can be included:
  • the keyword library may include keywords of industries and companies of interest.
  • the keyword database may include keywords such as “Bank of China”, “Guotai Junan Securities” and “ Pacific Insurance”.
  • the keyword library can include keywords such as “Alibaba”, “Tencent” and “Huawei”. It should be noted that the above keywords may include the full name, short name or other recognizable name of the company.
  • Step b text structuring. That is, the text information is cleaned and structured according to semantic components.
  • the provinces, cities and counties can be separated and key sites (also called point of interest (POI)) can be extracted.
  • POI point of interest
  • the key door address can be: “Wanglong Road No. 18 Huanglong” Alipay Square, 6th Floor, Block B, Times Square.”
  • the address book you can extract the contact information of the contact from the address book. And removing irrelevant words from the tag information, the irrelevant words may include the contact's name, nickname, and other irrelevant titles (eg, "Ms.” or "buddy").
  • the step of text structuring the delivery address may further include a step of segmentation.
  • the delivery address can be divided into "province / city / district / county / street / road / house number / office / floor / company / other" form.
  • step c the text is exactly matched.
  • the process of exact matching of texts is: the process of determining whether the key information of the key portal or the contact contains the keywords in the keyword library, and if so, the matching is successful; otherwise, the matching is unsuccessful.
  • the keywords in the keyword library include: “Bank of China”, “Alipay” and “Tencent”.
  • the key door address is: "Alipay Company, 6th Floor, Block B, Huanglong Times Square, 18 Wantang Road.” Since the key site contains the keyword "Alipay Company", the key site matches the keywords in the keyword library.
  • the step of determining the matching score of the delivery address may be: determining the corresponding transaction days according to the number of uses of the delivery address within the preset time period. The trading days are then used as the matching score for the shipping address.
  • the step of determining the matching score of the address book may be: determining the number of other address records of the user to which the address book belongs in the annotation information. Use the number of other contacts as the matching score for the address book.
  • step 240 is also a process of screening users.
  • the step of error processing can also be performed. Due to the flexible and versatile language, there is inevitably a mismatch. For example, the “Bank of China supermarket” and “Customer introduced by ICBC Xiaowang” should not be used as identification targets, so the corresponding rules can be designed (for example, company name). Plus the position word) remove the mismatch case as much as possible. In addition, some obvious black goods, fraud, sales and other personnel were removed.
  • step 250 two or more candidate user groups are combined to obtain a core user group.
  • a candidate user group corresponding to the delivery address and a candidate user group corresponding to the address book may be merged. It can be understood that since some users have both a receiving address and an address book, the number of users of the combined core user group will be less than the sum of the number of users of the two candidate user groups.
  • Step 260 Select a seed user from the core user group according to a matching score of various types of text information of the user in the core user group.
  • the corresponding level information may be determined according to a matching score of each type of text information of the user. Then, the level information corresponding to each type of text information is cross-fused to determine the matching level of the user. After determining the matching level, the seed user may be selected from the core user group according to the matching level of each user.
  • the determination rule of the rating information of the delivery address is: the matching score is greater than 100, the rating information is high (represented by “2"); the matching score is in the interval [ When 100, 50), the level information is medium (indicated by "1”); when the matching score is less than or equal to 50, the level information is low (indicated by "0").
  • the matching score is greater than 20, the rating information is high (indicated by "2"); when the matching score is between the intervals [20, 10), the rating information is medium (using "1" "Representation”; when the matching score is less than or equal to 10, the level information is low (indicated by "0").
  • the matching level may include six: extra high (2+2), medium high (2+1), high (2+0), medium (1+1), medium low (1+0), and low (0+0).
  • the foregoing is only a simple method for determining the user matching level.
  • other complex algorithms may be combined to determine the matching level of the user.
  • the weight value of each level information may be set, and then according to the level information and The weight values collectively determine the matching level of the user, which will not be described in this specification.
  • the level information determined according to the receiving address or the matching score of the address book can be used as the matching level of the user.
  • the seed user After determining the matching level of each user, the seed user can be selected from the core user group according to the matching level. As in the foregoing example, a user with a matching level of very high and medium high among the core user groups may be selected as the seed user.
  • Step 270 Calculate the similarity between each type of text information of the seed user and the text information of other users except the seed user in the entire user group.
  • the following phenomenon occurs in the receiving address: 1) The same company's employees usually use the same receiving address in the real world (such as the company's guard/receiving room, etc.); 2) even The same address, different users are not necessarily the same. Based on this, this step is to merge the different shipping address writings that actually represent the same address to the address, so that users who are not text-matched due to differences in writing can be recognized. Therefore, the similarity of the receipt address of other users to the delivery address of the seed user can be calculated. When the similarity satisfies the threshold, the delivery address is taken as the "text similar" to the seed user's delivery address.
  • “text similarity” can mean that "Levenshtein distance” is closer, and the text editing distance is the minimum required between two text strings to be converted into one by another "add/delete/change” operation. The number of operations.
  • the address book and the delivery address are different. On the one hand, the delivery address structure is not standardized, and the direction is clear. On the other hand, there is no "user-address" relationship pair that can be used for "collaborative discrimination" in the delivery address and can be used as a "seed”. Address, but considering the richness and colloquial characteristics of the address book, you can make full use of the contextual semantic information, and expand the keyword library by finding synonyms and related words (collectively referred to as related words), so as to identify more targets. The purpose of the user group.
  • the calculation process of the address book similarity can be as shown in Fig. 4. In Fig. 4, the following steps can be included:
  • Step v word embedding.
  • the annotation information of the contacts in the address book of the sub-user group corresponding to the address book is subjected to word segmentation processing to obtain a full-quantity word collection.
  • the Word2Vec algorithm (a well-established word vectorization algorithm, but not limited to this algorithm) can then be used for unsupervised training to obtain the word vector for each word.
  • any two words can take the cosine similarity of the word vector (not limited to the similarity calculation method) as the similarity between the two, and then can determine the related words of each word in the full set of words.
  • related words of the seed words corresponding to the seed user's address book are also determined. It should be noted that when the number of seed words is plural, the plurality of seed words may constitute a set of seed words.
  • Step w expand the keyword library. Combine the collection of seed words and count the word frequency of each seed word. The expanded words are determined according to the word frequency of each seed word and related words. Extend the extended words to the keyword library. For example, suppose that in the collection of seed words, the word frequency of the seed word "Alibaba" is greater than the threshold, and the related words of "Alibaba” include: “Ali” and “Alipay”, etc., then "Alibaba", “Alibaba” Ali and Alipay are expanded into the keyword library.
  • Step x generating a user vector.
  • the target words appearing in the expanded keyword library are selected from the words corresponding to the address books of other users. Combine the full set of words and count the word frequency of the target words.
  • the user vectors of other users are determined according to the word frequency of the target word and the corresponding word vector (determined in step x). Similarly, the user vector of the seed user can be determined. That is, each user in the entire user community has a user vector.
  • step y a seed vector is generated.
  • the user vectors of all seed users are averaged to obtain a seed vector, which can be used to represent all seed users.
  • Step z calculate the similarity.
  • the cosine similarity of the user vector of the other users and the seed vector is calculated (not limited to the similarity calculation method).
  • the cosine similarity is used as the similarity between the address book of other users and the address book of the seed user. The higher the similarity, the higher the probability that other users belong to the target audience.
  • the target user group is identified by the text mining process such as the receiving address and the address book through matching and extension text mining process, and the matching level is matched by the matching, and the similarity is recognized by the extension. Since the two data sources are independent of each other, the above results can be cross-fused. The higher the matching level, the higher the level of fusion (called the confidence level); the higher the similarity, the higher the confidence level. If both sources can be identified, the confidence level is also higher. Finally, the output identifies the crowd and the confidence level. The higher the confidence level, the higher the probability that the user belongs to the target user group.
  • Step 280 Select an extended user from other users according to the similarity.
  • a user with similarity greater than a threshold among other users may be selected as an extended user.
  • the extended users can also be selected by other means.
  • the company's receiving address can be further expanded by using the latitude and longitude information of the receiving address. For example, if all the shipping addresses within a company's campus are considered to be the company's address, the users who use the address are employees of the company.
  • the network structure formed by the contact information of the contacts of the address book can be used to spread the company employees. For example, the name of a company employee A as "boss" or "colleague" is also considered to be the employee of the company. After identifying the employees of the same company in the above two ways, the employees of the same company can also be selected as the extended users.
  • step 290 the extended user is expanded to the core user group to obtain the target user group.
  • a high net worth group in the field of consumer credit, for example, after the above steps 210-290, the user's occupation and company information can be extracted, thereby serving as the user's occupation attribute tag. In turn, a high net worth group can be determined based on the user's occupational attribute tag.
  • the program can be automatically executed by the written program using existing data, without the need for the user to newly fill in relevant information, and without the manual operation or supervision of the approver, not only can the labor cost be greatly reduced under the premise of ensuring the recognition accuracy. Investing and improving the user experience.
  • This solution is not limited to the availability and regularity of text information.
  • the coverage of the two major data sources of the receiving address and address book is very high, whether it is a logistics record of shopping on the website or a communication social product, can be included in the scope of identification. In fact, more than half of the users have both types of textual information.
  • the step of calculating the similarity is introduced, which can be similar to the fuzzy matching. The effect effectively expands the coverage of the identified population.
  • the results obtained by two types of independent data sources are cross-validated, which effectively ensures the accuracy of the recognition results.
  • the identified target user groups are in the order of millions, and the credit risk is about one-eighth of the total user group.
  • the follow-up through open access, increased credit, and lower pricing can improve the scope of the consumer credit business. And the quality of service can also effectively control the overall risk.
  • FIG. 5 is a schematic diagram of a method for determining a target user group according to another embodiment of the present specification.
  • the target user group can be determined from the entire user community through the two processes of sample screening and text mining.
  • the process of sample screening is: for all user groups, the users with the address book are divided into the first sub-user group, and the users with the delivery address are divided into the second sub-user group.
  • the screening conditions of the address book including but not limited to: the phone number of the user to which the address book belongs is used by the user and the phone number is included in other address book
  • the corresponding first candidate user is selected from the first sub-user group. group.
  • the receiving address including but not limited to: the receiving address is used by the user, the receiving address has been used by the user in the near future, and the receiving address belongs to the company class address
  • the second sub-user group is selected.
  • Corresponding second candidate user group is selected.
  • Text mining consists of two parts: matching and extension. Matching is to use the keyword library to accurately match the text information; the extension is based on the matching, further identifying the unmatched people to expand the coverage of the recognition.
  • the matching process may be: matching the address book of each user in the first candidate user group with the keyword in the keyword library, and if a user's address book matches the keyword in the keyword library, the reservation is retained. The user and determine the matching score for the user's address book; otherwise the user is excluded.
  • the receiving address of each user in the second candidate user group may be matched with the keyword in the keyword library, and if the receiving address of a certain user matches the keyword in the keyword library, the reservation is retained. User, and determine the matching score of the user's shipping address; otherwise the user is excluded. After performing the above matching step on the first candidate user group and the second candidate user group, the two candidate user groups may be merged.
  • the merged candidate user group may also be referred to as a core user group (ie, a union of two candidate user groups).
  • the matching level of the user can be determined according to the matching score of the user's receiving address and the matching score of the address book.
  • the seed user ie, the intersection of the two candidate user groups
  • the process of the extension may specifically: calculating a similarity between the receiving address of the seed user and the receiving address of the user other than the seed user in the entire user group, and selecting the extended user from other users according to the similarity. .
  • the similarity between the seed user's address book and other users' address books can be calculated, and the extended users are selected from other users according to the similarity. After selecting the extended user, the extended user and the core user group together constitute a target user group.
  • the above embodiment proposes a method of identifying a target user group using text mining technology.
  • the text precise matching algorithm is designed.
  • the cooperative identification method is used to expand the receiving address
  • the text vectorization method is used to expand the address book, thereby expanding the coverage of the identified population.
  • the data of the receiving address and the address book are mutually independent, and the cross-validation method improves the recognition accuracy.
  • a determining device for the target user group is also provided in an embodiment of the present specification. As shown in FIG. 6, the device includes:
  • the obtaining unit 601 is configured to acquire a total user group.
  • the dividing unit 602 is configured to divide the total user group acquired by the obtaining unit 601 into two or more sub-user groups, wherein different sub-user groups respectively correspond to different text information.
  • the text information may include: a delivery address, an address book, a wireless network name, a company class name corresponding to a global positioning system GPS positioning point, a company name corresponding to an Internet Protocol IP address, and a company name corresponding to a multimedia access control Mac address. , the social media's note name, the social software's group name, the instant messenger's note name, and several of the instant messenger's group name.
  • the screening unit 603 is configured to filter out corresponding candidate user groups from each sub-user group according to the screening condition of the text information corresponding to each sub-user group divided by the dividing unit 602, and obtain two or more candidate user groups.
  • the screening condition of the receiving address includes one or more of the following: the receiving address is used by the user, the receiving address is used by the user in the near future, and the receiving address belongs to Company class address.
  • the address book when the text information is an address book, the address book includes the contact information of the contact and the corresponding phone number; the screening condition of the address book includes one or more of the following: the phone number of the user to which the address book belongs is used by the user himself. And the phone number is included in other contacts.
  • the matching unit 604 is configured to accurately match the corresponding text information with the keywords in the keyword library for each candidate user group selected by the screening unit 603, and if the matching is successful, determine the matching score of the text information.
  • the matching unit 604 is specifically configured to:
  • the corresponding trading days are determined according to the number of times the receiving address is used within the preset time period.
  • the matching unit 604 is specifically configured to:
  • the annotation information after removing the irrelevant words is precisely matched with the keywords in the keyword library.
  • the tag information includes the number of other address books of the user to which the address book belongs.
  • the merging unit 605 is configured to merge the two or more candidate user groups filtered by the screening unit 603 to obtain a core user group.
  • the selecting unit 606 is configured to select a seed user from the core user group according to a matching score of various types of text information of the user in the core user group.
  • the calculating unit 607 is configured to separately calculate the similarity between each type of text information of the seed user selected by the selecting unit 606 and the text information of other users except the seed user in the entire user group.
  • the calculating unit 607 is specifically configured to:
  • the annotation information of the contacts in the address book of the sub-user group corresponding to the address book is subjected to word segmentation processing to obtain a full-quantity word collection.
  • a set of seed words corresponding to the seed user's address book is determined from the full set of words. Seed words have corresponding related words.
  • the expanded words are determined according to the word frequency of each seed word and related words.
  • the target words appearing in the expanded keyword library are selected from the words corresponding to the address books of other users.
  • the similarity is used as the similarity between the seed user's address book and other users' address books.
  • the calculating unit 607 is further specifically configured to:
  • the target word and the seed word are respectively represented as corresponding word vectors.
  • the user vector of the target word is determined according to the word frequency of the target word and the corresponding word vector
  • the user vector of the seed word is determined according to the word frequency of the seed word and the corresponding word vector.
  • the similarity between the target word and the seed word is determined according to the user vector of the target word and the user vector of the seed word.
  • the selecting unit 606 is further configured to select an extended user from other users according to the similarity calculated by the calculating unit 607.
  • the expansion unit 608 is configured to expand the extended user selected by the selecting unit 606 to the core user group, thereby obtaining the target user group.
  • the obtaining unit 601 acquires the entire user group.
  • the dividing unit 602 divides the entire user group into two or more sub-user groups.
  • the screening unit 603 selects corresponding candidate user groups from each sub-user group according to the screening condition of the text information corresponding to each sub-user group, and obtains two or more candidate user groups.
  • the matching unit 604 accurately matches the corresponding text information with the keywords in the keyword library for each candidate user group, and if the matching is successful, determines the matching score of the text information.
  • the merging unit 605 merges more than two candidate user groups to obtain a core user group.
  • the selecting unit 606 selects the seed user from the core user group according to the matching scores of the various types of text information of the users in the core user group.
  • the calculating unit 607 calculates the similarity between each type of text information of the seed user and the text information of other users in the entire user group except the seed user.
  • the selecting unit 606 selects an extended user from other users according to the similarity.
  • the expansion unit 608 expands the extended users to the core user community to obtain a target user community. Thereby, the target user group can be determined more quickly and efficiently.
  • the functions described in this specification can be implemented in hardware, software, firmware, or any combination thereof.
  • the functions may be stored in a computer readable medium or transmitted as one or more instructions or code on a computer readable medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种目标用户群体的确定方法及装置,其中方法包括:按照不同的文本信息,将全体用户群体划分为两个以上的子用户群体。根据各个子用户群体对应的文本信息的筛选条件,从各个子用户群体中筛选出对应的候选用户群体。将各个候选用户群体对应的文本信息与关键词库中的关键词进行精确匹配,匹配成功时,确定文本信息的匹配分数。合并各个候选用户群体,得到核心用户群体。根据核心用户群体中用户的各类文本信息的匹配分数,从核心用户群体中选取种子用户。分别计算种子用户的每类文本信息与其它用户的该类文本信息的相似度。根据该相似度,从其它用户中选取扩展用户。将扩展用户扩充到核心用户群体,从而得到目标用户群体。

Description

目标用户群体的确定方法及装置 技术领域
本说明书一个或多个实施例涉及计算机技术领域,尤其涉及一种目标用户群体的确定方法及装置。
背景技术
传统技术中,在从海量用户中选取目标用户群体时,通常是先对该海量用户主动提供的信息进行人工审核,之后根据经人工审核后确定的目标信息,来进行上述选取操作。或者,预先创建目标名单或者词库,该目标名单或者词库中包含目标信息,之后通过将海量用户各自的文本信息与目标名单或者词库进行匹配,来进行上述选取操作。
因此,需要提供一种更快速、更有效地确定目标用户群体的方案。
发明内容
本说明书一个或多个实施例描述了一种目标用户群体的确定方法及装置,可以更快速、更有效地确定目标用户群体。
第一方面,提供了一种目标用户群体的确定方法,包括:
获取全体用户群体;
将所述全体用户群体划分为两个以上的子用户群体,其中,不同的子用户群体分别对应不同的文本信息;
根据各个子用户群体对应的文本信息的筛选条件,从所述各个子用户群体中筛选出对应的候选用户群体,得到两个以上的候选用户群体;
对每个候选用户群体,将对应的文本信息与关键词库中的关键词进行精确匹配,若匹配成功,则确定所述文本信息的匹配分数;
合并所述两个以上的候选用户群体,得到核心用户群体;
根据所述核心用户群体中用户的各类文本信息的匹配分数,从所述核心用户群体中选取种子用户;
分别计算所述种子用户的每类文本信息与所述全体用户群体中除所述种子用户外的 其它用户的该类文本信息的相似度;
根据所述相似度,从所述其它用户中选取扩展用户;
将所述扩展用户扩充到所述核心用户群体,从而得到目标用户群体。
第二方面,提供了一种目标用户群体的确定装置,包括:
获取单元,用于获取全体用户群体;
划分单元,用于将所述获取单元获取的所述全体用户群体划分为两个以上的子用户群体,其中,不同的子用户群体分别对应不同的文本信息;
筛选单元,用于根据所述划分单元划分的各个子用户群体对应的文本信息的筛选条件,从所述各个子用户群体中筛选出对应的候选用户群体,得到两个以上的候选用户群体;
匹配单元,用于对所述筛选单元筛选的每个候选用户群体,将对应的文本信息与关键词库中的关键词进行精确匹配,若匹配成功,则确定所述文本信息的匹配分数;
合并单元,用于合并所述筛选单元筛选的所述两个以上的候选用户群体,得到核心用户群体;
选取单元,用于根据所述核心用户群体中用户的各类文本信息的匹配分数,从所述核心用户群体中选取种子用户;
计算单元,用于分别计算所述选取单元选取的所述种子用户的每类文本信息与所述全体用户群体中除所述种子用户外的其它用户的该类文本信息的相似度;
所述选取单元,还用于根据所述计算单元计算的所述相似度,从所述其它用户中选取扩展用户;
扩充单元,用于将所述选取单元选取的所述扩展用户扩充到所述核心用户群体,从而得到目标用户群体。
本说明书一个或多个实施例提供的目标用户群体的确定方法及装置,按照不同的文本信息,将获取的全体用户群体划分为两个以上的子用户群体。根据各个子用户群体对应的文本信息的筛选条件,从各个子用户群体中筛选出对应的候选用户群体。将各个候选用户群体对应的文本信息与关键词库中的关键词进行精确匹配,并在匹配成功的情况下,确定文本信息的匹配分数。合并各个候选用户群体,得到核心用户群体。根据核心用户群体中用户的各类文本信息的匹配分数,从核心用户群体中选取种子用户。分别计 算种子用户的每类文本信息与其它用户的该类文本信息的相似度。根据该相似度,从其它用户中选取扩展用户。将扩展用户扩充到核心用户群体,从而得到目标用户群体。由此,可以更快速、更有效地确定目标用户群体。
附图说明
为了更清楚地说明本说明书实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为本说明书一个实施例提供的目标用户群体的确定方法的应用场景示意图;
图2为本说明书一个实施例提供的目标用户群体的确定方法流程图;
图3为本说明书提供的用户的文本信息的匹配过程示意图;
图4为本说明书提供的用户的通讯录的相似度的计算过程示意图;
图5为本说明另一个实施例提供的目标用户群体的确定方法示意图;
图6为本说明书一个实施例提供的目标用户群体的确定装置示意图。
具体实施方式
下面结合附图,对本说明书提供的方案进行描述。
本说明书一个实施例提供的目标用户群体的确定方法可以应用于如图1所示的场景中。图1中,目标用户群体的确定装置可以根据用户的文本信息(包括但不限于收货地址以及通讯录等),从全体用户群体中确定出高净值人群。此处的高净值人群可以是指具有稳定工作和较高收入的群体。其可以包括但不限于在金融行业(包括银行业、证券业、保险业)和IT行业(包括软件服务、互联网)工作的企业员工、在大型国企工作的企业职工、在政府机关工作的公务员、在行政事业单位工作的教师、医生和其他公职人员等。因为该群体偿债能力高,还款意愿强,具有较低的信用风险水平。所以目标用户群体确定装置可以将该高净值人群推送给消费信贷系统。从而消费信贷系统可以为该群体提供对应的消费信贷产品,由此,可以达到扩张发展信贷业务的目的,也可以为自动化和个性化的信贷审批流程和营销过程提供巨大的帮助。
当然,在实际应用中,本说明书实施例提供的目标用户群体的确定方法还可以应用于其它场景中,如,高消费用户群体的确定等,本说明书对此不作限制。
图2为本说明书一个实施例提供的目标用户群体的确定方法流程图。所述方法的执行主体可以为具有处理能力的设备:服务器或者系统或者装置,如,图1中的目标用户群体的确定装置。如图2所示,所述方法具体可以包括:
步骤210,获取全体用户群体。
例如,可以从支付宝系统的后台数据库中获取该全体用户群体。需要说明的是,该全体用户群体中的用户可以具有收货地址和/或通讯录等文本信息。一般而言,在购物网站上有过实体商品购买记录且完成交易订单的用户均有保存的收货地址。上述通讯录可以包括联系人的标注信息及对应的电话号码。其中,联系人的备注信息可以包括联系人的姓名、昵称以及其它用于表示该联系人所属行业或者公司的信息。例如,上述备注信息可以为阿里巴巴张三以及李行长等等。
步骤220,将全体用户群体划分为两个以上的子用户群体。
其中,不同的子用户群体分别对应不同的文本信息。此处的文本信息可以用于对子用户群体中的用户进行刻画。其通常具有明确的指向性,且关系到用户获得服务的质量,因此通常具有较高的辨识度和可信度。
在本说明书,用户的文本信息包括但不限于以下一种或者多种:收货地址、通讯录、无线网络(如,wifi)名称、全球定位系统(Global Positioning System,GPS)定位点对应的公司类地名、互联网协议(Internet Protocol,IP)地址所对应的公司名称、多媒体访问控制(Media Access Control,Mac)地址所对应的公司名称、社交软件的备注名称、社交软件的群名称、即时通讯工具的备注名称以及即时通讯工具的群名称等。
以文本信息包括收货地址和通讯录为例来说,可以将全体用户群体划分为两个子用户群体。其中,一个子用户群体中的用户均有收货地址,也即该一个子用户群体与收获地址相对应。另一个子用户群体中的用户均有通讯录,也即另一个子用户群体与通讯录相对应。
步骤230,根据各个子用户群体对应的文本信息的筛选条件,从各个子用户群体中筛选出对应的候选用户群体,得到两个以上的候选用户群体。
以文本信息包括收货地址和通讯录为例来说,收货地址的筛选条件包括以下一 种或多种:收货地址为用户本人使用(收货人为本人或者联系电话为本人手机号码)、收货地址在近期(如,近一年)被用户使用过以及收货地址归属于公司类地址等。对通讯录,前提是通讯录中的联系人有绑定的手机号码。一般而言,为了便于验证身份和触达用户,许多网站都要求用户绑定手机号码。在上述大前提下,通讯录的筛选条件可以包括以下一种或多种:通讯录所属用户的电话号码为用户本人使用以及该电话号码包含在其它通讯录中。对于筛选条件“电话号码包含在其它通讯录中”解释如下:由于用户的通讯录内保存的实际是用户的联系人的信息,只有用户的联系人才会在他们的通讯录内保存该用户的信息。因此,要求用户的电话号码包含在其它通讯录中。
如前述例子,对一个子用户群体,可以根据对应的收货地址的筛选条件,从该一个子用户群体中筛选出对应的候选用户群体。可以理解的是,该候选用户群体也与收货地址相对应。也即该候选群体中的用户均具有收货地址。同理,对另一个子用户群体,可以根据对应的通讯录的筛选条件,从另一个子用户群体中筛选出对应的候选用户群体。可以理解的是,该候选用户群体也与通讯录相对应。也即该候选群体中的用户均具有通讯录。从而得到两个候选用户群体。
需要说明的是,通过该步骤可以减少不必要的计算和处理,从而仅关注可能是高净值人群的候选用户群体。
步骤240,对每个候选用户群体,将对应的文本信息与关键词库中的关键词进行精确匹配,若匹配成功,则确定文本信息的匹配分数。
对前述例子中的两个候选用户群体对应的收货地址和通讯录,其匹配和确定匹配分数的过程可以如图3所示。图3中,可以包括如下几个步骤:
步骤a,创建关键词库。该关键词库中可以包括所关注行业和公司的关键词。如,当所关注行业为金融行业时,该关键词库中可以包括:“中国银行”、“国泰君安证券”以及“太平洋保险”等关键词。而当所关注行业为IT行业时,该关键词库可以包括“阿里巴巴”、“腾讯”以及“华为”等关键词。需要说明的是,上述关键词可以包括公司的全称、简称或者其它具有辨识度的名称等。
步骤b,文本结构化。即对文本信息进行清洗,并按语义成分进行结构化。如,对于收货地址,可以将省市县分离,并提取关键性门址(也称兴趣点(point of interest,POI))。以收货地址为:“浙江省杭州市西湖区翠苑街道万塘路18号黄龙时代广场B座6楼支付宝公司”为例来说,提取的关键性门址可以为:“万塘路18号黄龙时代广场B座 6楼支付宝公司”。对于通讯录,可以从通讯录中提取联系人的标注信息。并从标注信息中去除不相关词语,该不相关词语可以包括联系人的姓名、昵称以及其它无关称谓(如,“女士”或者“哥们”)等。
需要说明的是,在本说明书中,对收货地址进行文本结构化的步骤还可以包括切分的步骤。如,可以将收货地址切分为“省份/城市/区县/街道/路/门牌号/写字楼/楼层/公司/其他”的形式。
步骤c,文本精确匹配。在此说明书中,文本精确匹配的过程即为:判断关键性门址或者联系人的标注信息是否包含关键词库中的关键词的过程,若包含,则匹配成功;否则匹配不成功。举例来说,假设关键词库中的关键词包括:“中国银行”、“支付宝公司”以及“腾讯”等。且假设关键性门址为:“万塘路18号黄龙时代广场B座6楼支付宝公司”。由于该关键性门址中包含了关键词“支付宝公司”,所以该关键性门址与关键词库中的关键词匹配成功。
图3中,在对文本进行精确匹配之后,并在匹配成功的情况下,还包括确定匹配分数的步骤。具体地,确定收货地址的匹配分数的步骤可以为:根据在预设时间段内收货地址的使用次数,来确定对应的交易天数。之后将交易天数作为收货地址的匹配分数。确定通讯录的匹配分数的步骤可以为:确定标注信息中包含该通讯录所属用户的其它通讯录的个数。将其它通讯录的个数作为该通讯录的匹配分数。举例来说,以确定张三的通讯录的匹配分数为例来说,假设有5个人的通讯录中联系人的标注信息包含:“阿里张三”,3个人的通讯录中联系人的标注信息包含:“阿里巴巴张三”,1个人的通讯录中联系人的标注信息包含:“支付宝张三”,则张三的通讯录的匹配分数为:5+3+1=9分。
以上是针对文本匹配成功时的说明。当文本匹配不成功时,如,当收货地址不包含关键词库中的关键词时,可以删除该收货地址,也即删除该收货地址对应的用户。由此可以看出,步骤240也是筛选用户的过程。
图3中,在确定匹配分数之后,还可以执行错误处理的步骤。由于语言的灵活多样性,不可避免存在错配情况,例“中国银行旁超市”、“工行小王介绍的客户”,实际并不应作为识别目标,因此可以设计对应的规则(如,公司名称加上方位词)尽可能地将错配案例剔除。此外,还将一些明显的黑产、诈骗、推销等人员剔除。
步骤250,合并两个以上的候选用户群体,得到核心用户群体。
如前述例子,可以合并与收货地址对应的候选用户群体和与通讯录对应的候选 用户群体。可以理解的是,由于部分用户既有收货地址又有通讯录,因此,合并后的核心用户群体的用户数目会少于两个候选用户群体的用户数目之和。
步骤260,根据核心用户群体中用户的各类文本信息的匹配分数,从核心用户群体中选取种子用户。
在一种实现方式中,可以根据用户的每类文本信息的匹配分数,确定对应的等级信息。之后将与各类文本信息对应的等级信息进行交叉融合,来确定用户的匹配等级。在确定匹配等级之后,可以根据各个用户的匹配等级,从核心用户群体中选取种子用户。
以前述文本信息包括收货地址和通讯录为例来说,假设收货地址的等级信息的确定规则为:匹配分数大于100,等级信息为高(用“2”表示);匹配分数在区间[100,50)之间时,等级信息为中(用“1”表示);匹配分数小于等于50时,等级信息为低(用“0”表示)。还假设通讯录的等级信息的确定规则为:匹配分数大于20,等级信息为高(用“2”表示);匹配分数在区间[20,10)之间时,等级信息为中(用“1”表示);匹配分数小于等于10时,等级信息为低(用“0”表示)。则匹配等级可以包括六个:特高(2+2)、中高(2+1)、高(2+0)、中(1+1)、中低(1+0)以及低(0+0)。而假设用户A的收货地址的匹配分数为“60”(即等级信息为中),通讯录的匹配分数为“5”(即等级信息为低),则该用户的匹配等级为中低(即1+0=1)。
当然,上述只是一种简单的用户匹配等级的确定方法,在实际应用中,还可以结合其它复杂算法来确定用户的匹配等级,如,可以设定各个等级信息的权重值,之后根据等级信息和权重值共同确定用户的匹配等级,本说明书对此不复赘述。
可以理解的是,当核心用户群体中的用户只有收货地址或者通讯录时,可以将根据该收货地址或者通讯录的匹配分数确定的等级信息作为用户的匹配等级。
在确定各个用户的匹配等级之后,可以根据该匹配等级,从核心用户群体中选取种子用户。如前述例子,可以选取核心用户群体中匹配等级为特高和中高的用户为种子用户。
步骤270,分别计算种子用户的每类文本信息与全体用户群体中除种子用户外的其它用户的该类文本信息的相似度。
以文本信息为收货地址为例来说,由于收货地址存在以下现象:1)同一公司的员工通常会使用现实中相同的收货地址(如公司的门卫/收发室等);2)即便同样的地址,不同用户的写法也不一定完全相同。基于此,该步骤就是为了使实际上表示同一地 址的不同收货地址写法都能归并至该地址下,从而使因写法存在差异而未被文本匹配的用户也能被识别。因此,可以计算其它用户的收货地址与种子用户的收货地址的相似度。当相似度满足阈值时,将该收货地址作为与种子用户的收货地址“文本相似”的地址。这里的“文本相似”可以是指“文本编辑(Levenshtein)距离”较近,文本编辑距离即为两个文本字符串间由一个经“增/删/改”操作转换成另一个所需要的最少操作次数。
需要说明的是,在对收货地址进行文本结构化的过程中,如果还对该收货地址进行了切分,则只需要将这些成分分别作为一个“字符”计算编辑距离即可。例如“浙江省/杭州市/西湖区/翠苑街道/万塘路/18号/黄龙时代广场B座/6楼/支付宝公司”与“浙江省/杭州市/西湖区/翠苑街道/万塘路/18号/黄龙时代广场B座/6楼”,在后者中没有“支付宝公司”字样,也即两者相差一个成分,由于该成分可以看作一个字符,在这种情况下依然可视为“支付宝公司”的收货地址。当然,这要求有预先设定两个文本相差一个字符可以作为相似文本的前提。
通讯录与收货地址不同,一方面不及收货地址结构规整、指向明确,另一方面不存在收货地址中可用于“协同判别””的“用户-地址”关系对和可作为“种子”的地址。但考虑到通讯录的丰富性和口语化特性,可充分利用其中的上下文语义信息,通过寻找近义词和关联词(统称相关词语)的方式将关键词库进行扩充,从而实现识别更多目标用户群体的目的。通讯录相似度的计算过程可以如图4所示。图4中,可以包括如下几个步骤:
步骤v,词嵌入(word embedding)。将通讯录对应的子用户群体的通讯录中联系人的标注信息进行分词处理,得到全量词语集合。之后可以采用Word2Vec算法(一种公认有效的词向量化算法,但不限于此算法)进行无监督训练得到每个词语的词向量。此时,任意两个词语均可将其词向量的余弦相似度(不限于此相似度计算方法)作为两者的相似度,进而可以确定出全量词语集合中各个词语的相关词语。
可以理解的是,通过该步骤,与种子用户的通讯录对应的种子词语的相关词语也确定了。需要说明的是,当种子词语的个数为多个时,该多个种子词语可以构成种子词语的集合。
步骤w,扩充关键词库。结合种子词语的集合,统计各个种子词语的词频。根据各个种子词语的词频以及相关词语,确定扩展词语。将扩展词语扩充到关键词库中。举例来说,假设种子词语的集合中,种子词语“阿里巴巴”的词频大于阈值,且“阿里巴 巴”的相关词语包括:“阿里”和“支付宝”等,则可以将“阿里巴巴”、“阿里”和“支付宝”扩充到关键词库中。
步骤x,生成用户向量。从与其它用户的通讯录对应的词语中选取出现在扩充后的关键词库中的目标词语。结合全量词语集合,统计目标词语的词频。根据目标词语的词频以及对应的词向量(步骤x中确定的),确定其它用户的用户向量。同理可以确定种子用户的用户向量。也即全体用户群体中的每个用户都有用户向量。
步骤y,生成种子向量。将所有种子用户的用户向量计算平均得到种子向量,该种子向量可以用于表示全部的种子用户。
步骤z,计算相似度。计算其它用户的用户向量与种子向量的余弦相似度(不限于此相似度计算方法)。将余弦该相似度作为其它用户的通讯录与种子用户的通讯录之间的相似度。该相似度越高,说明其它用户属于识别目标用户群体的概率越高。
至此,由收货地址和通讯录两个数据源经匹配和扩展等文本挖掘过程识别得到了目标用户群体,由匹配识别的带有匹配等级,由扩展识别的带有相似度。由于两个数据源是相互独立的,因此可将上述结果进行交叉融合。匹配等级越高,则融合后的等级(称为置信等级)越高;相似度越高,则置信等级越高。如从两个来源均能被识别,则置信等级也越高。最终,产出识别人群和置信等级,置信等级越高,则用户属于目标用户群体的概率越高。
步骤280,根据相似度,从其它用户中选取扩展用户。
如,可以选取其它用户中相似度大于阈值的用户作为扩展用户。
当然,在实际应用中,也可以通过其它方式,来选取扩展用户。以文本信息为收货地址为例来说,可以利用收货地址的经纬度信息进一步扩充该公司的收货地址。例如将在某公司园区范围内的所有收货地址认为是该公司地址,则使用该地址的用户均为该公司员工。再以文本信息为通讯录为例来说,可以利用由通讯录的联系人的标注信息形成的网络结构进行公司员工的扩散。例如将某公司员工甲标注为“老板”或“同事”等称谓的乙也认为是该公司员工。在通过上述两种方式识别出同一公司的员工之后,该同一公司的员工也可以选取为扩展用户。
步骤290,将扩展用户扩充到核心用户群体,从而得到目标用户群体。
需要说明的是,虽然本说明书上述实施例均以收货地址和通讯录为例进行了举例说明,但可以理解的是,当文本信息为无线网络名称等其它信息时,目标用户群体的 确定过程类似,本说明书在此不复赘述。
需要说明的是,以确定消费信贷领域的高净值人群为例来说,在经过上述步骤210-步骤290之后,就可以提取出用户的职业和公司等信息,从而作为用户的职业属性标签。进而可以根据用户的职业属性标签,来确定高净值人群。
综上,本方案可以由编写好的程序利用现有数据自动执行,无需用户专门新增填写相关信息,也无需审批员人工操作或监督,在保证识别精度的前提下,不仅能大大减少人力成本投入,而且能提升用户体验。
本方案不受限于文本信息的可获取性和规整性。一方面,收货地址和通讯录两大数据源的覆盖率很高,不管是在网站上有过购物的物流记录,还是使用过通讯社交类产品,均可纳入识别范围。事实上,超过半数的用户都有这两类文本信息。另一方面,即使收货地址的填写和通讯录中联系人的标注信息存在不规范性,在精确匹配识别一部分用户的基础上,引入了计算相似度的步骤,能起到类似于模糊匹配的效果,有效地扩充了识别人群的覆盖度。此外,对由两类相互独立的数据源识别得到的结果进行了交叉验证,有效地保证了识别结果的准确度。
识别出的目标用户群体在百万数量级,信用风险约为全体用户群体的八分之一,后续通过开放准入、提高授信、降低定价等方式,能很好地提升消费信贷业务涵盖的人群范围和服务质量,也能有效地控制整体风险。
图5为本说明书另一个实施例提供的目标用户群体的确定方法示意图。图5中,可以通过样本筛选和文本挖掘两个过程,来从全体用户群体中确定出目标用户群体。其中,样本筛选的过程为:对全体用户群体,将有通讯录的用户划分到第一子用户群体,将有收货地址的用户划分到第二子用户群体。之后,根据通讯录的筛选条件(包括但不限于:通讯录所属用户的电话号码为用户本人使用以及电话号码包含在其它通讯录中),从第一子用户群体中筛选对应的第一候选用户群体。并根据收货地址的筛选条件(包括但不限于:收货地址为用户本人使用、收货地址在近期被用户使用过以及收货地址归属于公司类地址),从第二子用户群体中筛选对应的第二候选用户群体。
图5中,对于经过样本筛选的用户,分别从收货地址和通讯录两个方面对其进行文本挖掘。文本挖掘包括两个部分:匹配和扩展。匹配即为利用关键词库对文本信息进行精确匹配;扩展是在匹配的基础上,对未能匹配到的人群进一步加以识别,以扩大识别的覆盖度。
匹配的过程具体可以为:将第一候选用户群体中各个用户的通讯录与关键词库中的关键词进行匹配,若某一用户的通讯录与关键词库中的关键词匹配成功,则保留该用户,并确定该用户的通讯录的匹配分数;否则剔除该用户。此外,还可以将第二候选用户群体中各个用户的收货地址与关键词库中的关键词进行匹配,若某一用户的收货地址与关键词库中的关键词匹配成功,则保留该用户,并确定该用户的收货地址的匹配分数;否则剔除该用户。在对第一候选用户群体和第二候选用户群体执行上述匹配的步骤之后,可以合并该两个候选用户群体。合并后的候选用户群体也可以称为核心用户群体(也即两个候选用户群体的并集)。对核心用户群体中的用户,可以根据该用户的收货地址的匹配分数和通讯录的匹配分数,来确定该用户的匹配等级。之后,可以根据匹配等级,从核心用户群体中选取种子用户(也即两个候选用户群体的交集)。在选取种子用户之后,就可以进入到扩展部分了。
扩展的过程具体可以为:计算种子用户的收货地址与全体用户群体中除种子用户之外的其它用户的收货地址之间的相似度,并根据该相似度,从其它用户中选取扩展用户。此外,还可以计算种子用户的通讯录与其它用户的通讯录之间的相似度,并根据该相似度,从其它用户中选取扩展用户。在选取扩展用户之后,该扩展用户与核心用户群体共同构成目标用户群体。
总之,上述实施例提出了利用文本挖掘技术识别目标用户群体的方法。对于收货地址和通讯录两类不同形式的文本信息,结合目标行业的语料特性,针对性地设计了文本精确匹配算法。利用协同判别方式对收货地址进行扩展,利用文本向量化方式对通讯录进行扩展,从而扩大了识别人群的覆盖范围。将收货地址和通讯录两类来源上相互独立的数据加以融合,通过交叉验证的方式提升了识别准确度。
与上述目标用户群体的确定方法对应地,本说明书一个实施例还提供的一种目标用户群体的确定装置,如图6所示,该装置包括:
获取单元601,用于获取全体用户群体。
划分单元602,用于将获取单元601获取的全体用户群体划分为两个以上的子用户群体,其中,不同的子用户群体分别对应不同的文本信息。
其中,文本信息可以包括:收货地址、通讯录、无线网络名称、全球定位系统GPS定位点对应的公司类地名、互联网协议IP地址所对应的公司名称、多媒体访问控制Mac地址所对应的公司名称、社交软件的备注名称、社交软件的群名称、即时通讯工具 的备注名称以及即时通讯工具的群名称中的若干个。
筛选单元603,用于根据划分单元602划分的各个子用户群体对应的文本信息的筛选条件,从各个子用户群体中筛选出对应的候选用户群体,得到两个以上的候选用户群体。
可选地,当文本信息为收货地址时,收货地址的筛选条件包括以下一种或多种:收货地址为用户本人使用、收货地址在近期被用户使用过以及收货地址归属于公司类地址。
可选地,当文本信息为通讯录时,通讯录包括联系人的标注信息及对应的电话号码;通讯录的筛选条件包括以下一种或多种:通讯录所属用户的电话号码为用户本人使用以及电话号码包含在其它通讯录中。
匹配单元604,用于对筛选单元603筛选的每个候选用户群体,将对应的文本信息与关键词库中的关键词进行精确匹配,若匹配成功,则确定文本信息的匹配分数。
可选地,匹配单元604具体可以用于:
从收货地址中提取关键性门址。
将关键性门址与关键词库中的关键词进行精确匹配。
若匹配成功,则根据在预设时间段内收货地址的使用次数,确定对应的交易天数。
将交易天数作为收货地址的匹配分数。
可选地,匹配单元604具体可以用于:
从通讯录中提取联系人的标注信息。
从标注信息中去除不相关词语,不相关词语包括联系人的姓名、昵称以及其它无关称谓。
将去除不相关词语后的标注信息与关键词库中的关键词进行精确匹配。
若匹配成功,则确定标注信息中包含通讯录所属用户的其它通讯录的个数。
将其它通讯录的个数作为通讯录的匹配分数。
合并单元605,用于合并筛选单元603筛选的两个以上的候选用户群体,得到核心用户群体。
选取单元606,用于根据核心用户群体中用户的各类文本信息的匹配分数,从核心用户群体中选取种子用户。
计算单元607,用于分别计算选取单元606选取的种子用户的每类文本信息与全体用户群体中除种子用户外的其它用户的该类文本信息的相似度。
可选地,计算单元607具体可以用于:
将通讯录对应的子用户群体的通讯录中联系人的标注信息进行分词处理,得到全量词语集合。
确定全量词语集合中各个词语的相关词语。
从全量词语集合中确定出与种子用户的通讯录对应的种子词语的集合。种子词语具有对应的相关词语。
结合种子词语的集合,统计各个种子词语的词频。
根据各个种子词语的词频以及相关词语,确定扩展词语。
将扩展词语扩充到关键词库中。
从与其它用户的通讯录对应的词语中选取出现在扩充后的关键词库中的目标词语。
计算目标词语与种子词语的相似度。
将相似度作为种子用户的通讯录与其它用户的通讯录的相似度。
可选地,计算单元607还具体可以用于:
结合全量词语集合,统计目标词语的词频。
根据词向量化算法,分别将目标词语以及种子词语表示为对应的词向量。
根据目标词语的词频以及对应的词向量,确定目标词语的用户向量,并根据种子词语的词频以及对应的词向量,确定种子词语的用户向量。
根据目标词语的用户向量以及种子词语的用户向量,确定目标词语与种子词语的相似度。
选取单元606,还用于根据计算单元607计算的相似度,从其它用户中选取扩展用户。
扩充单元608,用于将选取单元606选取的扩展用户扩充到核心用户群体,从而得到目标用户群体。
本说明书上述实施例装置的各功能模块的功能,可以通过上述方法实施例的各步骤来实现,因此,本说明书一个实施例提供的装置的具体工作过程,在此不复赘述。
本说明书一个实施例提供的目标用户群体的确定装置,获取单元601获取全体用户群体。划分单元602将全体用户群体划分为两个以上的子用户群体。筛选单元603根据各个子用户群体对应的文本信息的筛选条件,从各个子用户群体中筛选出对应的候选用户群体,得到两个以上的候选用户群体。匹配单元604对每个候选用户群体,将对应的文本信息与关键词库中的关键词进行精确匹配,若匹配成功,则确定文本信息的匹配分数。合并单元605合并两个以上的候选用户群体,得到核心用户群体。选取单元606根据核心用户群体中用户的各类文本信息的匹配分数,从核心用户群体中选取种子用户。计算单元607分别计算种子用户的每类文本信息与全体用户群体中除种子用户外的其它用户的该类文本信息的相似度。选取单元606于根据相似度,从其它用户中选取扩展用户。扩充单元608将扩展用户扩充到核心用户群体,从而得到目标用户群体。由此,可以更快速、更有效地确定目标用户群体。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本说明书所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。
以上所述的具体实施方式,对本说明书的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本说明书的具体实施方式而已,并不用于限定本说明书的保护范围,凡在本说明书的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本说明书的保护范围之内。

Claims (16)

  1. 一种目标用户群体的确定方法,其特征在于,包括:
    获取全体用户群体;
    将所述全体用户群体划分为两个以上的子用户群体,其中,不同的子用户群体分别对应不同的文本信息;
    根据各个子用户群体对应的文本信息的筛选条件,从所述各个子用户群体中筛选出对应的候选用户群体,得到两个以上的候选用户群体;
    对每个候选用户群体,将对应的文本信息与关键词库中的关键词进行精确匹配,若匹配成功,则确定所述文本信息的匹配分数;
    合并所述两个以上的候选用户群体,得到核心用户群体;
    根据所述核心用户群体中用户的各类文本信息的匹配分数,从所述核心用户群体中选取种子用户;
    分别计算所述种子用户的每类文本信息与所述全体用户群体中除所述种子用户外的其它用户的该类文本信息的相似度;
    根据所述相似度,从所述其它用户中选取扩展用户;
    将所述扩展用户扩充到所述核心用户群体,从而得到目标用户群体。
  2. 根据权利要求1所述的方法,其特征在于,所述文本信息包括:收货地址、通讯录、无线网络名称、全球定位系统GPS定位点对应的公司类地名、互联网协议IP地址所对应的公司名称、多媒体访问控制Mac地址所对应的公司名称、社交软件的备注名称、社交软件的群名称、即时通讯工具的备注名称以及即时通讯工具的群名称中的若干个。
  3. 根据权利要求1所述的方法,其特征在于,
    当所述文本信息为收货地址时,所述收货地址的筛选条件包括以下一种或多种:收货地址为用户本人使用、收货地址在近期被用户使用过以及收货地址归属于公司类地址。
  4. 根据权利要求3所述的方法,其特征在于,所述将对应的文本信息与关键词库中的关键词进行精确匹配,若匹配成功,则确定所述文本信息的匹配分数,包括:
    从所述收货地址中提取关键性门址;
    将所述关键性门址与关键词库中的关键词进行精确匹配;
    若匹配成功,则根据在预设时间段内所述收货地址的使用次数,确定对应的交易天数;
    将所述交易天数作为所述收货地址的匹配分数。
  5. 根据权利要求1所述的方法,其特征在于,
    当所述文本信息为通讯录时,所述通讯录包括联系人的标注信息及对应的电话号码;所述通讯录的筛选条件包括以下一种或多种:通讯录所属用户的电话号码为所述用户本人使用以及所述电话号码包含在其它通讯录中。
  6. 根据权利要求5所述的方法,其特征在于,所述将对应的文本信息与关键词库中的关键词进行匹配,若匹配成功,则确定所述文本信息的匹配分数,包括:
    从所述通讯录中提取联系人的标注信息;
    从所述标注信息中去除不相关词语,所述不相关词语包括联系人的姓名、昵称以及其它无关称谓;
    将去除不相关词语后的标注信息与关键词库中的关键词进行精确匹配;
    若匹配成功,则确定标注信息中包含所述通讯录所属用户的其它通讯录的个数;
    将所述其它通讯录的个数作为所述通讯录的匹配分数。
  7. 根据权利要求5或6所述的方法,其特征在于,所述分别计算所述种子用户的每类文本信息与所述全体用户群体中除所述种子用户外的其它用户的该类文本信息的相似度,包括:
    将所述通讯录对应的子用户群体的通讯录中联系人的标注信息进行分词处理,得到全量词语集合;
    确定所述全量词语集合中各个词语的相关词语;
    从所述全量词语集合中确定出与所述种子用户的通讯录对应的种子词语的集合;所述种子词语具有对应的相关词语;
    结合所述种子词语的集合,统计各个种子词语的词频;
    根据所述各个种子词语的词频以及相关词语,确定扩展词语;
    将所述扩展词语扩充到所述关键词库中;
    从与所述其它用户的通讯录对应的词语中选取出现在扩充后的关键词库中的目标词语;
    计算所述目标词语与所述种子词语的相似度;
    将所述相似度作为所述种子用户的通讯录与所述其它用户的通讯录的相似度。
  8. 根据权利要求7所述的方法,其特征在于,所述计算所述目标词语与所述种子词语的相似度,包括:
    结合所述全量词语集合,统计所述目标词语的词频;
    根据词向量化算法,分别将所述目标词语以及所述种子词语表示为对应的词向量;
    根据所述目标词语的词频以及对应的词向量,确定所述目标词语的用户向量,并根据所述种子词语的词频以及对应的词向量,确定所述种子词语的用户向量;
    根据所述目标词语的用户向量以及所述种子词语的用户向量,确定所述目标词语与所述种子词语的相似度。
  9. 一种目标用户群体的确定装置,其特征在于,包括:
    获取单元,用于获取全体用户群体;
    划分单元,用于将所述获取单元获取的所述全体用户群体划分为两个以上的子用户群体,其中,不同的子用户群体分别对应不同的文本信息;
    筛选单元,用于根据所述划分单元划分的各个子用户群体对应的文本信息的筛选条件,从所述各个子用户群体中筛选出对应的候选用户群体,得到两个以上的候选用户群体;
    匹配单元,用于对所述筛选单元筛选的每个候选用户群体,将对应的文本信息与关键词库中的关键词进行精确匹配,若匹配成功,则确定所述文本信息的匹配分数;
    合并单元,用于合并所述筛选单元筛选的所述两个以上的候选用户群体,得到核心用户群体;
    选取单元,用于根据所述核心用户群体中用户的各类文本信息的匹配分数,从所述核心用户群体中选取种子用户;
    计算单元,用于分别计算所述选取单元选取的所述种子用户的每类文本信息与所述全体用户群体中除所述种子用户外的其它用户的该类文本信息的相似度;
    所述选取单元,还用于根据所述计算单元计算的所述相似度,从所述其它用户中选取扩展用户;
    扩充单元,用于将所述选取单元选取的所述扩展用户扩充到所述核心用户群体,从而得到目标用户群体。
  10. 根据权利要求9所述的装置,其特征在于,所述文本信息包括:收货地址、通讯录、无线网络名称、全球定位系统GPS定位点对应的公司类地名、互联网协议IP地址所对应的公司名称、多媒体访问控制Mac地址所对应的公司名称、社交软件的备注名称、社交软件的群名称、即时通讯工具的备注名称以及即时通讯工具的群名称中的若干个。
  11. 根据权利要求9所述的装置,其特征在于,
    当所述文本信息为收货地址时,所述收货地址的筛选条件包括以下一种或多种:收 货地址为用户本人使用、收货地址在近期被用户使用过以及收货地址归属于公司类地址。
  12. 根据权利要求11所述的装置,其特征在于,所述匹配单元具体用于:
    从所述收货地址中提取关键性门址;
    将所述关键性门址与关键词库中的关键词进行精确匹配;
    若匹配成功,则根据在预设时间段内所述收货地址的使用次数,确定对应的交易天数;
    将所述交易天数作为所述收货地址的匹配分数。
  13. 根据权利要求9所述的装置,其特征在于,
    当所述文本信息为通讯录时,所述通讯录包括联系人的标注信息及对应的电话号码;所述通讯录的筛选条件包括以下一种或多种:通讯录所属用户的电话号码为所述用户本人使用以及所述电话号码包含在其它通讯录中。
  14. 根据权利要求13所述的装置,其特征在于,所述匹配单元具体用于:
    从所述通讯录中提取联系人的标注信息;
    从所述标注信息中去除不相关词语,所述不相关词语包括联系人的姓名、昵称以及其它无关称谓;
    将去除不相关词语后的标注信息与关键词库中的关键词进行精确匹配;
    若匹配成功,则确定标注信息中包含所述通讯录所属用户的其它通讯录的个数;
    将所述其它通讯录的个数作为所述通讯录的匹配分数。
  15. 根据权利要求13或14所述的装置,其特征在于,所述计算单元具体用于:
    将所述通讯录对应的子用户群体的通讯录中联系人的标注信息进行分词处理,得到全量词语集合;
    确定所述全量词语集合中各个词语的相关词语;
    从所述全量词语集合中确定出与所述种子用户的通讯录对应的种子词语的集合;所述种子词语具有对应的相关词语;
    结合所述种子词语的集合,统计各个种子词语的词频;
    根据所述各个种子词语的词频以及相关词语,确定扩展词语;
    将所述扩展词语扩充到所述关键词库中;
    从与所述其它用户的通讯录对应的词语中选取出现在扩充后的关键词库中的目标词语;
    计算所述目标词语与所述种子词语的相似度;
    将所述相似度作为所述种子用户的通讯录与所述其它用户的通讯录的相似度。
  16. 根据权利要求15所述的装置,其特征在于,所述计算单元还具体用于:
    结合所述全量词语集合,统计所述目标词语的词频;
    根据词向量化算法,分别将所述目标词语以及所述种子词语表示为对应的词向量;
    根据所述目标词语的词频以及对应的词向量,确定所述目标词语的用户向量,并根据所述种子词语的词频以及对应的词向量,确定所述种子词语的用户向量;
    根据所述目标词语的用户向量以及所述种子词语的用户向量,确定所述目标词语与所述种子词语的相似度。
PCT/CN2018/104939 2017-12-06 2018-09-11 目标用户群体的确定方法及装置 WO2019109698A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711279551.6 2017-12-06
CN201711279551.6A CN108153824B (zh) 2017-12-06 2017-12-06 目标用户群体的确定方法及装置

Publications (1)

Publication Number Publication Date
WO2019109698A1 true WO2019109698A1 (zh) 2019-06-13

Family

ID=62466539

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/104939 WO2019109698A1 (zh) 2017-12-06 2018-09-11 目标用户群体的确定方法及装置

Country Status (3)

Country Link
CN (1) CN108153824B (zh)
TW (1) TWI709927B (zh)
WO (1) WO2019109698A1 (zh)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108153824B (zh) * 2017-12-06 2020-04-24 阿里巴巴集团控股有限公司 目标用户群体的确定方法及装置
CN109101562B (zh) * 2018-07-13 2023-07-21 中国平安人寿保险股份有限公司 寻找目标群体的方法、装置、计算机设备及存储介质
CN110895587B (zh) * 2018-08-23 2022-08-26 百度在线网络技术(北京)有限公司 用于确定目标用户的方法和装置
CN109561166B (zh) * 2018-11-13 2021-10-12 创新先进技术有限公司 定位目标对象的方法、装置和电子设备
CN109902681B (zh) * 2019-03-04 2022-06-21 苏州达家迎信息技术有限公司 用户群体关系确定方法、装置、设备及存储介质
CN110413875B (zh) * 2019-06-26 2024-06-07 腾讯科技(深圳)有限公司 一种文本信息推送的方法以及相关装置
CN110674390B (zh) * 2019-08-14 2022-05-20 国家计算机网络与信息安全管理中心 基于置信度的群体发现方法及装置
US20210110343A1 (en) * 2019-10-10 2021-04-15 United States Postal Service Methods and systems for generating address score information
CN111626462B (zh) * 2020-02-27 2023-02-10 进佳科技(国际)有限公司 电商货件自取点的集中选定方法
CN113742606A (zh) * 2020-05-29 2021-12-03 京东城市(北京)数字科技有限公司 对象识别方法、装置、电子设备及可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015021937A1 (zh) * 2013-08-14 2015-02-19 腾讯科技(深圳)有限公司 用户推荐方法和装置
CN105550903A (zh) * 2015-12-25 2016-05-04 腾讯科技(深圳)有限公司 目标用户确定方法及装置
US20160379268A1 (en) * 2013-12-10 2016-12-29 Tencent Technology (Shenzhen) Company Limited User behavior data analysis method and device
CN107220852A (zh) * 2017-05-26 2017-09-29 北京小度信息科技有限公司 用于确定目标推荐用户的方法、装置和服务器
CN108153824A (zh) * 2017-12-06 2018-06-12 阿里巴巴集团控股有限公司 目标用户群体的确定方法及装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160034968A1 (en) * 2014-07-31 2016-02-04 Huawei Technologies Co., Ltd. Method and device for determining target user, and network server
CN105184616B (zh) * 2015-09-29 2020-06-19 北京奇艺世纪科技有限公司 一种业务对象定向投放的方法和装置
CN106874925A (zh) * 2015-12-14 2017-06-20 阿里巴巴集团控股有限公司 对象分群方法、模型训练方法及装置
CN106021230B (zh) * 2016-05-19 2018-11-23 无线生活(杭州)信息科技有限公司 一种分词方法及装置
CN107194815B (zh) * 2016-11-15 2018-06-22 平安科技(深圳)有限公司 客户分类方法及系统
CN106874093B (zh) * 2017-02-14 2021-09-14 阿里巴巴(中国)有限公司 基于用户画像计算目标人群的方法、计算引擎及计算设备
CN107122349A (zh) * 2017-04-24 2017-09-01 无锡中科富农物联科技有限公司 一种基于word2vec‑LDA模型的文本主题词提取方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015021937A1 (zh) * 2013-08-14 2015-02-19 腾讯科技(深圳)有限公司 用户推荐方法和装置
US20160379268A1 (en) * 2013-12-10 2016-12-29 Tencent Technology (Shenzhen) Company Limited User behavior data analysis method and device
CN105550903A (zh) * 2015-12-25 2016-05-04 腾讯科技(深圳)有限公司 目标用户确定方法及装置
CN107220852A (zh) * 2017-05-26 2017-09-29 北京小度信息科技有限公司 用于确定目标推荐用户的方法、装置和服务器
CN108153824A (zh) * 2017-12-06 2018-06-12 阿里巴巴集团控股有限公司 目标用户群体的确定方法及装置

Also Published As

Publication number Publication date
TWI709927B (zh) 2020-11-11
CN108153824B (zh) 2020-04-24
TW201926170A (zh) 2019-07-01
CN108153824A (zh) 2018-06-12

Similar Documents

Publication Publication Date Title
WO2019109698A1 (zh) 目标用户群体的确定方法及装置
US11257038B2 (en) Event extraction systems and methods
US11748416B2 (en) Machine-learning system for servicing queries for digital content
CN110020433B (zh) 一种基于企业关联关系的工商高管人名消歧方法
CN102591867B (zh) 一种基于移动设备位置的搜索服务方法
CN110781246A (zh) 一种企业关联关系构建方法及系统
CN106776897B (zh) 一种用户画像标签确定方法及装置
WO2016043826A1 (en) Determining trustworthiness and compatiblity of a person
CN111949834A (zh) 选址方法和选址平台
CN107544988B (zh) 一种获取舆情数据的方法和装置
Compton et al. Using publicly visible social media to build detailed forecasts of civil unrest
US20220076231A1 (en) System and method for enrichment of transaction data
CN110309432B (zh) 基于兴趣点的同义词确定方法、地图兴趣点处理方法
CN110880142A (zh) 一种风险实体获取方法及装置
US20160092960A1 (en) Product recommendations over multiple stores
Utomo et al. Geolocation prediction in social media data using text analysis: A review
CN112241458B (zh) 文本的知识结构化处理方法、装置、设备和可读存储介质
Alsudais Quantifying the offline interactions between hosts and guests of Airbnb
CN105808641A (zh) 线下资源的挖掘方法和装置
CN114153860A (zh) 业务数据管理方法及装置、电子设备、存储介质
CN110097258B (zh) 一种用户关系网络建立方法、装置及计算机可读存储介质
CN113934764A (zh) 事件信息的处理方法、装置及电子设备
US10679227B2 (en) Systems and methods for mapping online data to data of interest
Heravi et al. Tweet location detection
He et al. Poi alias discovery in delivery addresses using user locations

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18885196

Country of ref document: EP

Kind code of ref document: A1