CN115759326A - Method, apparatus, medium, and program product for age prediction - Google Patents

Method, apparatus, medium, and program product for age prediction Download PDF

Info

Publication number
CN115759326A
CN115759326A CN202211190701.7A CN202211190701A CN115759326A CN 115759326 A CN115759326 A CN 115759326A CN 202211190701 A CN202211190701 A CN 202211190701A CN 115759326 A CN115759326 A CN 115759326A
Authority
CN
China
Prior art keywords
data
target
word
numbers
age
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211190701.7A
Other languages
Chinese (zh)
Inventor
许文龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zhangmen Science and Technology Co Ltd
Original Assignee
Shanghai Zhangmen Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Zhangmen Science and Technology Co Ltd filed Critical Shanghai Zhangmen Science and Technology Co Ltd
Priority to CN202211190701.7A priority Critical patent/CN115759326A/en
Publication of CN115759326A publication Critical patent/CN115759326A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An object of the present application is to provide a method, an apparatus, a medium, and a program product for age prediction, the method including: determining at least one associated number corresponding to each number according to a plurality of address book data, obtaining labeled data associated with each number, and segmenting the labeled data to obtain word data corresponding to the labeled data; vectorizing the word data and the at least one associated number to obtain one or more vectorization characteristics corresponding to each number; and performing supervised learning based on a preset machine learning regressor according to the one or more vectorization features and the age label corresponding to each number to obtain an age prediction model. The method can obviously improve the coverage and accuracy of age prediction, does not need to understand the labeled content because of no need of taking account of the specific labeled content, can be applied to any foreign language in an expanded way, and can be conveniently transplanted and expanded for relevant businesses of overseas or domestic minority languages.

Description

Method, apparatus, medium, and program product for age prediction
Technical Field
The present application relates to the field of communications, and more particularly, to a technique for age prediction.
Background
In the field of traditional mobile information, the age judgment is mainly based on images, sounds, app installation lists, shopping or entertainment behavior data in Apps and the like, and the problems of large data acquisition amount, insufficient coverage, low accuracy and the like exist.
Disclosure of Invention
It is an object of the present application to provide a method, apparatus, medium and program product for predicting with age.
According to an aspect of the present application, there is provided a method for age prediction, the method comprising:
determining at least one associated number corresponding to each number according to a plurality of address book data, obtaining labeled data associated with each number, and segmenting the labeled data to obtain word data corresponding to the labeled data;
vectorizing the word data and the at least one associated number to obtain one or more vectorization characteristics corresponding to each number;
and performing supervised learning based on a preset machine learning regressor according to the one or more vectorization features and the age label corresponding to each number to obtain an age prediction model.
According to another aspect of the present application, there is provided a method for age prediction, the method comprising:
segmenting target marking data associated with a target number to obtain target word data corresponding to the target marking data;
vectorizing the target word data and at least one target associated number corresponding to the target number to obtain target vectorization characteristics corresponding to the target number;
and inputting the target vectorization characteristics into an age prediction model to obtain predicted age information corresponding to the target number output by the age prediction model.
According to an aspect of the present application, there is provided a computer apparatus for age prediction, the apparatus comprising:
the one-to-one module is used for determining at least one associated number corresponding to each number according to a plurality of address book data, acquiring label data associated with each number, segmenting the label data and acquiring word data corresponding to the label data;
a second module, configured to perform vectorization on the word data and the at least one associated number to obtain one or more vectorization features corresponding to each number;
and the three modules are used for performing supervised learning based on a preset machine learning regressor according to the one or more vectorization characteristics and the age label corresponding to each number to obtain an age prediction model.
According to another aspect of the present application, there is provided a computer apparatus for age prediction, the apparatus comprising:
the second module and the first module are used for segmenting target marking data associated with a target number to obtain target word data corresponding to the target marking data;
the second module is used for vectorizing the target word data and at least one target associated number corresponding to the target number to obtain target vectorization characteristics corresponding to the target number;
and the second and third modules are used for inputting the target vectorization characteristics into an age prediction model to obtain predicted age information corresponding to the target number output by the age prediction model.
According to an aspect of the present application, there is provided a computer device for age prediction, comprising a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to implement the operations of any of the methods as described above.
According to an aspect of the application, there is provided a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the operations of any of the methods described above.
According to an aspect of the application, a computer program product is provided, comprising a computer program which, when executed by a processor, carries out the steps of any of the methods as described above.
Compared with the prior art, the method and the device have the advantages that at least one associated number corresponding to each number is determined according to a plurality of address book data, the label data associated with each number is obtained, the label data is segmented, and the word data corresponding to the label data is obtained; vectorizing the word data and the at least one associated number to obtain one or more vectorization characteristics corresponding to each number; according to the one or more vectorization characteristics and the age label corresponding to each number, supervised learning is carried out based on a preset machine learning regressor to obtain an age prediction model, so that the vectorization characteristics of each marked number can be obtained by embedding a vectorization mode according to the user number marking information which is relatively objective for the address book owner and the incidence relation between numbers, the vectorization characteristics are input into the age prediction model, the marked number can be subjected to age prediction, the coverage and the accuracy of age prediction can be obviously improved, the marked content does not need to be understood because specific marked content does not need to be managed, the method can be applied to any foreign language in an expanded mode, and the transplantation and expansion of foreign language related businesses or national minority language related businesses can be conveniently carried out.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:
FIG. 1 shows a flow diagram of a method for age prediction according to one embodiment of the present application;
FIG. 2 illustrates a flow diagram of a method for age prediction according to one embodiment of the present application;
FIG. 3 illustrates a computer device architecture diagram for age prediction, according to one embodiment of the present application;
FIG. 4 illustrates a computer device architecture diagram for age prediction, according to one embodiment of the present application;
FIG. 5 illustrates an exemplary system that can be used to implement the various embodiments described in this application.
The same or similar reference numbers in the drawings identify the same or similar elements.
Detailed Description
The present application is described in further detail below with reference to the attached figures.
In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (e.g., central Processing Units (CPUs)), input/output interfaces, network interfaces, and memory.
The Memory may include forms of volatile Memory, random Access Memory (RAM), and/or non-volatile Memory in a computer-readable medium, such as Read Only Memory (ROM) or Flash Memory. Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase-Change Memory (PCM), programmable Random Access Memory (PRAM), static Random-Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash Memory or other Memory technology, compact Disc Read Only Memory (CD-ROM), digital Versatile Disc (DVD) or other optical storage, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
The device referred to in the present application includes, but is not limited to, a terminal, a network device, or a device formed by integrating a terminal and a network device through a network. The terminal includes, but is not limited to, any mobile electronic product, such as a smart phone, a tablet computer, etc., capable of performing human-computer interaction with a user (e.g., human-computer interaction through a touch panel), and the mobile electronic product may employ any operating system, such as an Android operating system, an iOS operating system, etc. The network Device includes an electronic Device capable of automatically performing numerical calculation and information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded Device, and the like. The network device includes but is not limited to a computer, a network host, a single network server, multiple network server sets, or a cloud of multiple servers; here, the Cloud is composed of a large number of computers or network servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual supercomputer consisting of a collection of loosely coupled computers. Including, but not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, a wireless Ad Hoc network (Ad Hoc network), etc. Preferably, the device may also be a program running on the terminal, the network device, or a device formed by integrating the terminal and the network device, the touch terminal, or the network device and the touch terminal through a network.
Of course, those skilled in the art will appreciate that the foregoing is by way of example only, and that other existing or future devices, which may be suitable for use in the present application, are also encompassed within the scope of the present application and are hereby incorporated by reference.
In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
Fig. 1 shows a flowchart of a method for age prediction according to an embodiment of the present application, the method comprising step S11, step S12 and step S13. In step S11, the computer device determines at least one associated number corresponding to each number according to a plurality of address book data, obtains tagging data associated with each number, and performs word segmentation on the tagging data to obtain word data corresponding to the tagging data; in step S12, the computer device vectorizes the word data and the at least one associated number to obtain one or more vectorization features corresponding to each number; in step S13, the computer device performs supervised learning based on a predetermined machine learning regressor according to the one or more vectorization features and the age label corresponding to each number, so as to obtain an age prediction model.
In step S11, the computer device determines at least one associated number corresponding to each number according to the plurality of address book data, obtains tagging data associated with each number, and performs word segmentation on the tagging data to obtain word data corresponding to the tagging data. In some embodiments, for a certain number, the associated number corresponding to the number refers to another number having a predetermined association with the number, for example, the predetermined association may refer to that the other number exists in an address book of a user of the number, or may refer to that the number exists in an address book of a user of the other number, or the predetermined association may refer to that a degree of relationship between two numbers is less than or equal to a predetermined degree threshold (for example, one degree), for example, if the number B exists in the address book of the user of the number a, the degree of relationship between the number a and the number B is one degree, for example, if the number B exists in the address book of the user of the number a, and the number C exists in the address book of the user of the number B, the degree of relationship between the number a and the number C is two degrees, for example, if the number B exists in the address book of the user of the number a, and the number B also exists in the address book of the user of the number C, the degree of relationship between the number a and the number C is two degrees. In some embodiments, a user may add remarks (e.g., title, nickname, occupation, name, etc.) to a certain contact number in an address book, collect address books of multiple users in advance, integrate and summarize data of each address book, for a certain contact number, if remark information corresponding to the contact number exists in one or more address books, the contact number is associated with the one or more address books, and the annotation data associated with the contact number includes remark information corresponding to the contact number in one or more address books associated with the contact number. In some embodiments, if multiple remark information is included in the annotation data associated with each number, the multiple remark information may be stitched together using a predetermined separator (e.g., a space, a comma, etc.). In some embodiments, for each number associated annotation data, the annotation data is segmented into a plurality of word data using a predetermined word segmentation algorithm, for example, a jieba word segmentation algorithm (e.g., the jieba word segmentation algorithm "https:// github. Com/fxsjy/jieba" of python version) may be used.
In step S12, the computer device performs vectorization on the word data and the at least one associated number to obtain one or more vectorization features corresponding to each number. In some embodiments, word data corresponding to each contact number and at least one associated number corresponding to each contact number (here, each associated number is regarded as word data) may be vectorized through a predetermined embedding (embedding) algorithm, the word data is represented by a numerical vector with low density, and one or more vectorization features corresponding to the contact number are obtained, each vectorization feature corresponds to one word data, and each vectorization feature is used for representing one word data through one vector, for example, a word2vec embedding algorithm (word 2vec is an open-source word vector algorithm published by Google corporation in 2013) may be used.
In step S13, the computer device performs supervised learning based on a predetermined machine learning regressor according to the one or more vectorization features and the age label corresponding to each number, so as to obtain an age prediction model. In some embodiments, the age standard data corresponding to each contact number is used as a supervised learning label, i.e. an age label, of the contact number, and supervised learning is performed based on a predetermined machine learning regressor and based on one or more vectorization features corresponding to each contact number and the age label of the contact number, so as to obtain an age prediction model through training. In some embodiments, the type of machine learning regressor includes, but is not limited to, random forests, gradient boosting regression trees, and the like. In some embodiments, supervised learning refers to a process of training a model by letting a machine learn a large amount of labeled sample data, adjusting parameters of a classifier, so that the model can predict new unlabeled data. In some embodiments, the age prediction model is used for predicting age data of a user corresponding to a certain number according to remark information of the certain number in one address book or according to label data associated with the certain number in one or more address books and at least one associated number corresponding to the certain number. In some embodiments, products or services of different age levels can be provided for different age user groups according to the prediction result, so that the efficiency of the products or services is improved, and corresponding content risks or privacy risks are controlled or reduced. According to the method, the vectorization characteristic of each marked number is obtained by embedding the vectorization mode according to the relatively objective user number marking information of the address book owner and the incidence relation among the numbers, the vectorization characteristic is input into the age prediction model, the marked number can be subjected to age prediction, the coverage and the accuracy of age prediction can be obviously improved, the marked content does not need to be understood because the specific marked content does not need to be managed, the method can be expanded and applied to any foreign language, and the foreign language or national minority language related business can be conveniently transplanted and expanded
In some embodiments, the determining at least one associated number corresponding to each number includes: and regarding each number as at least one associated number corresponding to the number, wherein other numbers which exist in the same address book with the number are used as the associated numbers. In some embodiments, for each number, the associated number corresponding to the number includes other numbers existing in the address book of a certain user at the same time as the number, for example, for number a, there are number a, number M, number N in the address book of the user at number B, and there are number a, number P, number Q in the address book of the user at number C, then the associated number corresponding to number a includes number M, number N, number P, number Q.
In some embodiments, the at least one association number comprises at least one first association number; wherein, the determining at least one associated number corresponding to each number includes: for each number, if the number exists in the address list of one or more first numbers, taking the one or more first numbers as at least one first associated number corresponding to the number. In some embodiments, for each number, if the number is included in the address book of the user of one or more other numbers, the one or more other numbers may be used as the associated number corresponding to the number, for example, if there is a number a in the address book of the user of number B, and there is a number a in the address book of the user of number C, then the associated number corresponding to the number a includes numbers B and C.
In some embodiments, the at least one association number further comprises at least one second association number; and determining at least one associated number corresponding to each number, wherein for each number, one or more second numbers existing in the address book of the number are used as at least one second associated number corresponding to the number. In some embodiments, for each number, other numbers existing in the address book of the user of the number may also be used as the associated number corresponding to the number, for example, if there are numbers B and C in the address book of the user of number a, the associated number corresponding to number a includes numbers B and C.
In some embodiments, for each number, a predetermined call condition is satisfied between the number and each of the at least one associated number. In some embodiments, for each number, a predetermined call condition needs to be satisfied between the number corresponding to the associated number and the number, where the call condition includes, but is not limited to, the number of calls being greater than or equal to a predetermined threshold number of times, the total call duration being greater than or equal to a predetermined first duration threshold, and the time interval between the latest call duration and the current time being less than or equal to a predetermined second duration threshold. In some embodiments, for each number, one or more associated numbers corresponding to the number may be determined according to the foregoing method, and then at least one associated number that satisfies a predetermined call condition with the number may be obtained from the one or more associated numbers.
In some embodiments, the call condition comprises: the number of calls is greater than or equal to a predetermined number threshold; the total call duration is greater than or equal to a preset first time threshold; the time interval between the last call time and the current time is less than or equal to a preset second time length threshold value.
In some embodiments, the at least one association number further comprises at least one third association number; wherein the determining of the at least one associated number corresponding to each number includes at least one of: for each number, if the number exists in one or more address lists of the first number and the first number exists in one or more address lists of the third number, taking the third number as at least one third associated number corresponding to the number; for each number, if one or more second numbers exist in the address book of the number and one or more fourth numbers exist in the address book of the second number, taking the one or more fourth numbers as at least one third associated number corresponding to the number; and for each number, if one or more fifth numbers exist in the address book of the number and the fifth numbers exist in the address book of one or more sixth numbers except the number, taking the one or more sixth numbers as at least one third associated number corresponding to the number. In some embodiments, for each number, if the number exists in the address book of the user with the first number and the first number exists in the address book of the user with the third number, the third number is also used as the associated number corresponding to the number in addition to the first number. In some embodiments, for each number, if a second number exists in the address book of the user of the number and a fourth number exists in the address book of the user of the second number, the fourth number is also used as the associated number corresponding to the number in addition to the second number being used as the associated number corresponding to the number. In some embodiments, if a fifth number exists in the address book of the user of the number and the fifth number also exists in the address book of the user of the sixth number, the sixth number is used as the associated number corresponding to the number in addition to the fifth number as the associated number corresponding to the number.
In some embodiments, the segmenting the tagged data to obtain word data corresponding to the tagged data includes: and performing word segmentation on the labeled data, removing meaningless words in word segmentation results, and obtaining word data corresponding to the labeled data. In some embodiments, for the label data associated with each number, a predetermined word segmentation algorithm is used to segment the label data into a plurality of segmented words, then one or more nonsense words in the plurality of segmented words are removed, and at least one removed segmented word is used as word data corresponding to the label data, wherein the nonsense words include, but are not limited to, symbols, numbers and the like without actual meanings.
In some embodiments, the segmenting the tagged data, removing meaningless words in a segmentation result, and obtaining word data corresponding to the tagged data includes: and performing word segmentation on the labeled data, removing meaningless words in word segmentation results, and taking one or more segmented words with the occurrence frequency larger than or equal to a preset frequency threshold value in the word segmentation results as word data corresponding to the labeled data. In some embodiments, for the labeled data associated with each number, a predetermined word segmentation algorithm is used to segment the labeled data into a plurality of words, one or more nonsense words in the plurality of words are removed, one or more words with the occurrence frequency greater than or equal to a predetermined frequency threshold in at least one removed word are selected, and the one or more words are used as word data corresponding to the labeled data.
In some embodiments, the step S12 includes: and inputting the word data and the at least one associated number into a trained word vector model by the computer equipment to obtain one or more vectorization characteristics corresponding to each number output by the word vector model. In some embodiments, a trained word vector model may be input with word data corresponding to each contact number and at least one associated number corresponding to each contact number (where each associated number is considered as a word data), and the word vector model may output one or more vectorized features corresponding to the contact number.
In some embodiments, the method further comprises: the computer equipment sets training parameters corresponding to the word vector model; and training the quantity model according to the labeling data respectively associated with the plurality of numbers to obtain a trained word vector model. In some embodiments, training parameters corresponding to the word vector model are first preset, where the training parameters include, but are not limited to, vectorized dimensionality (e.g., 32), minimum word frequency (e.g., 100), and the like, then labeling data respectively associated with a plurality of numbers are randomly extracted as training data, and then the training data is used to perform label-free training on the word vector model, so as to obtain a trained word vector model.
In some embodiments, the training parameters include at least one of: vectorizing the dimension number; the minimum word frequency. In some embodiments, vectorized dimension numbers refer to how many dimensions of a vector represent a word datum. In some embodiments, the minimum word frequency (min _ count) is used to remove some rare low frequency words, i.e., the low frequency words are not represented using vectorization features, and do not have corresponding vectorization features.
In some embodiments, the machine learning regressor comprises any of: random forests; gradient boosting the regression tree. In some embodiments, a Random Forest (Random Forest) regressor is a regressor comprising multiple regression trees that trains and predicts samples using multiple trees. In some embodiments, a Gradient Boosting Regression Tree (Gradient Boosting Regression Tree) is a Regression based on a Regression Tree that models the approximation of the residual in the Regression problem by using the negative Gradient of the loss function at the value of the current model.
In some embodiments, the method further comprises: the computer equipment carries out word segmentation on target marking data associated with a target number to obtain target word data corresponding to the target marking data; vectorizing the target word data and at least one target associated number corresponding to the target number to obtain target vectorization characteristics corresponding to the target number; and inputting the target vectorization characteristics into the age prediction model to obtain predicted age information corresponding to the target number output by the age prediction model. In some embodiments, in the age prediction model, in the prediction stage, for a contact number of a certain age to be predicted, that is, a target number, target tagging data associated with the target number in one or more address lists and at least one target associated number corresponding to the target number are obtained first, then, target vectorization features corresponding to the target tagging data and the at least one target associated number (here, each target associated number is regarded as word data) are obtained, specifically, a way of obtaining the vectorization features according to tagging data is as described above (word segmentation is performed first and then vectorization is performed), which is not described herein again, then, the target vectorization features are input into the age prediction model, and the age prediction model outputs the predicted age corresponding to the number.
Fig. 2 shows a flowchart of a method for age prediction according to an embodiment of the present application, the method comprising step S21, step S22 and step S23. In step S21, the computer device performs word segmentation on target labeling data associated with the target number to obtain target word data corresponding to the target labeling data; in step S22, the computer device vectorizes the target word data and at least one target association number corresponding to the target number to obtain a target vectorization feature corresponding to the target number; in step S23, the computer device inputs the target vectorization feature into an age prediction model, and obtains predicted age information corresponding to the target number output by the age prediction model.
In step S21, the computer device performs word segmentation on the target annotation data associated with the target number to obtain target word data corresponding to the target annotation data. The related operations are described in detail above, and are not described in detail herein.
In step S22, the computer device performs vectorization on the target word data and at least one target associated number corresponding to the target number to obtain a target vectorization feature corresponding to the target number. The related operations are described in detail above, and are not described herein again.
In step S23, the computer device inputs the target vectorization feature into an age prediction model, and obtains predicted age information corresponding to the target number output by the age prediction model. The related operations are described in detail above, and are not described in detail herein.
In some embodiments, the method further comprises: the computer equipment determines at least one associated number corresponding to each number according to a plurality of address book data, obtains the label data associated with each number, and performs word segmentation on the label data to obtain word data corresponding to the label data; vectorizing the word data and the at least one associated number to obtain one or more vectorization characteristics corresponding to each number; and performing supervised learning based on a preset machine learning regressor according to the one or more vectorization features and the age label corresponding to each number to obtain the age prediction model. The related operations are described in detail above, and are not described in detail herein.
Fig. 3 shows a block diagram of a computer device for age prediction according to an embodiment of the present application, the device comprising a one-module 11, a two-module 12 and a three-module 13. The one-to-one module 11 is configured to determine at least one associated number corresponding to each number according to a plurality of address book data, obtain label data associated with each number, perform word segmentation on the label data, and obtain word data corresponding to the label data; a second module 12, configured to perform vectorization on the word data and the at least one associated number to obtain one or more vectorization features corresponding to each number; and a third module 13, configured to perform supervised learning based on a predetermined machine learning regressor according to the one or more vectorization features and the age label corresponding to each number, to obtain an age prediction model.
The one-to-one module 11 is configured to determine at least one associated number corresponding to each number according to a plurality of address book data, obtain tagging data associated with each number, perform word segmentation on the tagging data, and obtain word data corresponding to the tagging data. In some embodiments, for a certain number, the associated number corresponding to the number refers to another number having a predetermined association with the number, for example, the predetermined association may refer to that the other number exists in an address book of a user of the number, or may refer to that the number exists in an address book of a user of the other number, or the predetermined association may refer to that a degree of relationship between two numbers is less than or equal to a predetermined degree threshold (for example, one degree), for example, if the number B exists in the address book of the user of the number a, the degree of relationship between the number a and the number B is one degree, for example, if the number B exists in the address book of the user of the number a, and the number C exists in the address book of the user of the number B, the degree of relationship between the number a and the number C is two degrees, for example, if the number B exists in the address book of the user of the number a, and the number B also exists in the address book of the user of the number C, the degree of relationship between the number a and the number C is two degrees. In some embodiments, a user may add remarks (e.g., title, nickname, occupation, name, etc.) to a certain contact number in an address book, collect address books of multiple users in advance, integrate and summarize data of each address book, for a certain contact number, if remark information corresponding to the contact number exists in one or more address books, the contact number is associated with the one or more address books, and the annotation data associated with the contact number includes remark information corresponding to the contact number in one or more address books associated with the contact number. In some embodiments, if multiple remark information is included in the annotation data associated with each number, the multiple remark information may be stitched together using a predetermined separator (e.g., a space, a comma, etc.). In some embodiments, for each number associated annotation data, the annotation data is segmented into a plurality of word data using a predetermined word segmentation algorithm, for example, a jieba word segmentation algorithm (e.g., the jieba word segmentation algorithm "https:// github. Com/fxsjy/jieba" of python version) may be used.
A second module 12, configured to perform vectorization on the word data and the at least one associated number to obtain one or more vectorization features corresponding to each number. In some embodiments, word data corresponding to each contact number and at least one associated number corresponding to each contact number (here, each associated number is regarded as word data) may be vectorized through a predetermined embedding (embedding) algorithm, the word data is represented by a numerical vector with low density, and one or more vectorization features corresponding to the contact number are obtained, each vectorization feature corresponds to one word data, and each vectorization feature is used for representing one word data through one vector, for example, a word2vec embedding algorithm (word 2vec is an open-source word vector algorithm published by Google corporation in 2013) may be used.
And a third module 13, configured to perform supervised learning based on a predetermined machine learning regressor according to the one or more vectorization features and the age label corresponding to each number, to obtain an age prediction model. In some embodiments, the age standard data corresponding to each contact number is used as a supervised learning label, i.e. an age label, of the contact number, and supervised learning is performed based on a predetermined machine learning regressor and based on one or more vectorization features corresponding to each contact number and the age label of the contact number, so as to obtain an age prediction model through training. In some embodiments, the type of machine learning regressor includes, but is not limited to, random forests, gradient boosting regression trees, and the like. In some embodiments, supervised learning refers to a process of training a model by letting a machine learn a large amount of labeled sample data, adjusting parameters of a classifier, so that the model can predict new unlabeled data. In some embodiments, the age prediction model is configured to predict age data of a user corresponding to a certain number according to remark information of the certain number in one address book or according to label data associated with the certain number in one or more address books and at least one associated number corresponding to the certain number. In some embodiments, products or services of different age levels can be provided for different age user groups according to the prediction result, so that the efficiency of the products or services is improved, and corresponding content risks or privacy risks are controlled or reduced. According to the method, the vectorization characteristic of each marked number is obtained by embedding the vectorization mode according to the relatively objective user number marking information of the address book owner and the incidence relation among the numbers, the vectorization characteristic is input into the age prediction model, the marked number can be subjected to age prediction, the coverage and the accuracy of age prediction can be obviously improved, the marked content does not need to be understood because the specific marked content does not need to be managed, the method can be expanded and applied to any foreign language, and the foreign language or national minority language related business can be conveniently transplanted and expanded
In some embodiments, the determining at least one associated number corresponding to each number includes: and regarding each number as at least one associated number corresponding to the number, wherein other numbers which exist in the same address book with the number are used as the associated numbers. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.
In some embodiments, the at least one association number comprises at least one first association number; wherein, the determining at least one associated number corresponding to each number includes: for each number, if the number exists in the address list of one or more first numbers, taking the one or more first numbers as at least one first associated number corresponding to the number. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and thus are not described again, and are included herein by reference.
In some embodiments, the at least one association number further comprises at least one second association number; wherein, the determining at least one associated number corresponding to each number further includes: and for each number, taking one or more second numbers existing in the address book of the number as at least one second associated number corresponding to the number. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.
In some embodiments, for each number, a predetermined call condition is satisfied between the number and each of the at least one associated number. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.
In some embodiments, the call condition comprises: the number of calls is greater than or equal to a predetermined number threshold; the total call duration is greater than or equal to a preset first duration threshold; the time interval between the last call time and the current time is less than or equal to a preset second time length threshold value.
In some embodiments, the at least one association number further comprises at least one third association number; wherein, the determining at least one associated number corresponding to each number includes at least one of: for each number, if the number exists in one or more address lists of first numbers and the first numbers exist in one or more address lists of third numbers, taking the third numbers as at least one third associated number corresponding to the number; for each number, if one or more second numbers exist in the address book of the number and one or more fourth numbers exist in the address book of the second number, taking the one or more fourth numbers as at least one third associated number corresponding to the number; and for each number, if one or more fifth numbers exist in the address book of the number and the fifth numbers exist in the address book of one or more sixth numbers except the number, taking the one or more sixth numbers as at least one third associated number corresponding to the number. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.
In some embodiments, the segmenting the tagged data to obtain word data corresponding to the tagged data includes: and performing word segmentation on the labeled data, removing meaningless words in word segmentation results, and obtaining word data corresponding to the labeled data. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and thus are not described again, and are included herein by reference.
In some embodiments, the segmenting the tagged data, removing meaningless words in a segmentation result, and obtaining word data corresponding to the tagged data includes: and performing word segmentation on the labeled data, removing meaningless words in word segmentation results, and taking one or more segmented words with the occurrence frequency greater than or equal to a preset frequency threshold value in the word segmentation results as word data corresponding to the labeled data. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.
In some embodiments, the secondary module 12 is configured to: inputting the word data and the at least one associated number into a trained word vector model to obtain one or more vectorization characteristics corresponding to each number output by the word vector model. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.
In some embodiments, the apparatus is further configured to: setting training parameters corresponding to the word vector model; and training the quantity model according to the labeling data respectively associated with the plurality of numbers to obtain a trained word vector model. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.
In some embodiments, the training parameters include at least one of: vectorizing the dimension number; the minimum word frequency. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.
In some embodiments, the machine learning regressor comprises any of: random forests; gradient boosting the regression tree. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and thus are not described again, and are included herein by reference.
In some embodiments, the apparatus is further configured to: segmenting target marking data associated with a target number to obtain target word data corresponding to the target marking data; vectorizing the target word data and at least one target associated number corresponding to the target number to obtain target vectorization characteristics corresponding to the target number; and inputting the target vectorization characteristics into the age prediction model to obtain predicted age information corresponding to the target number output by the age prediction model. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and thus are not described again, and are included herein by reference.
Fig. 4 shows a block diagram of a computer device for age prediction according to an embodiment of the present application, the device comprising a two-in-one module 21, a two-in-two module 22 and a two-in-three module 23. The second-first module 21 is configured to perform word segmentation on target tagging data associated with a target number to obtain target word data corresponding to the target tagging data; a second module 22, configured to perform vectorization on the target word data and at least one target associated number corresponding to the target number, to obtain a target vectorization feature corresponding to the target number; and a second and third module 23, configured to input the target vectorization feature into an age prediction model, to obtain predicted age information corresponding to the target number output by the age prediction model.
And the second-first module 21 is configured to perform word segmentation on the target tagging data associated with the target number, and obtain target word data corresponding to the target tagging data. The related operations are described in detail above, and are not described in detail herein.
A second-second module 22, configured to perform vectorization on the target word data and the at least one target association number corresponding to the target number, to obtain a target vectorization feature corresponding to the target number. The related operations are described in detail above, and are not described in detail herein.
And a second and third module 23, configured to input the target vectorization feature into an age prediction model, to obtain predicted age information corresponding to the target number output by the age prediction model. The related operations are described in detail above, and are not described in detail herein.
In some embodiments, the apparatus is further configured to: determining at least one associated number corresponding to each number according to a plurality of address book data, obtaining labeled data associated with each number, and segmenting the labeled data to obtain word data corresponding to the labeled data; vectorizing the word data and the at least one associated number to obtain one or more vectorization characteristics corresponding to each number; and performing supervised learning based on a preset machine learning regressor according to the one or more vectorization features and the age label corresponding to each number to obtain the age prediction model. The related operations are described in detail above, and are not described in detail herein.
In addition to the methods and apparatus described in the embodiments above, the present application also provides a computer readable storage medium storing computer code that, when executed, performs the method as described in any of the preceding claims.
The present application also provides a computer program product, which when executed by a computer device, performs the method of any of the preceding claims.
The present application further provides a computer device, comprising:
one or more processors;
a memory for storing one or more computer programs;
the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method of any preceding claim.
FIG. 5 illustrates an exemplary system that can be used to implement the various embodiments described herein;
in some embodiments, as shown in FIG. 5, the system 300 can be implemented as any of the devices in the various embodiments described. In some embodiments, system 300 may include one or more computer-readable media (e.g., system memory or NVM/storage 320) having instructions and one or more processors (e.g., processor(s) 305) coupled with the one or more computer-readable media and configured to execute the instructions to implement modules to perform the actions described herein.
For one embodiment, system control module 310 may include any suitable interface controllers to provide any suitable interface to at least one of processor(s) 305 and/or any suitable device or component in communication with system control module 310.
The system control module 310 may include a memory controller module 330 to provide an interface to the system memory 315. Memory controller module 330 may be a hardware module, a software module, and/or a firmware module.
System memory 315 may be used, for example, to load and store data and/or instructions for system 300. For one embodiment, system memory 315 may include any suitable volatile memory, such as suitable DRAM. In some embodiments, the system memory 315 may include a double data rate type four synchronous dynamic random access memory (DDR 4 SDRAM).
For one embodiment, system control module 310 may include one or more input/output (I/O) controllers to provide an interface to NVM/storage 320 and communication interface(s) 325.
For example, NVM/storage 320 may be used to store data and/or instructions. NVM/storage 320 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).
NVM/storage 320 may include storage resources that are physically part of the device on which system 300 is installed or may be accessed by the device and not necessarily part of the device. For example, NVM/storage 320 may be accessible over a network via communication interface(s) 325.
Communication interface(s) 325 may provide an interface for system 300 to communicate over one or more networks and/or with any other suitable device. System 300 may wirelessly communicate with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols.
For one embodiment, at least one of the processor(s) 305 may be packaged together with logic for one or more controller(s) (e.g., memory controller module 330) of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be packaged together with logic for one or more controller(s) of the system control module 310 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 305 may be integrated on the same die with logic for one or more controller(s) of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be integrated on the same die with logic for one or more controller(s) of the system control module 310 to form a system on a chip (SoC).
In various embodiments, system 300 may be, but is not limited to being: a server, a workstation, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.). In various embodiments, system 300 may have more or fewer components and/or different architectures. For example, in some embodiments, system 300 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and speakers.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Those skilled in the art will appreciate that the form in which the computer program instructions reside on a computer-readable medium includes, but is not limited to, source files, executable files, installation package files, and the like, and that the manner in which the computer program instructions are executed by a computer includes, but is not limited to: the computer directly executes the instruction, or the computer compiles the instruction and then executes the corresponding compiled program, or the computer reads and executes the instruction, or the computer reads and installs the instruction and then executes the corresponding installed program. Computer-readable media herein can be any available computer-readable storage media or communication media that can be accessed by a computer.
Communication media includes media whereby communication signals, including, for example, computer readable instructions, data structures, program modules, or other data, are transmitted from one system to another. Communication media may include conductive transmission media such as cables and wires (e.g., fiber optics, coaxial, etc.) and wireless (non-conductive transmission) media capable of propagating energy waves such as acoustic, electromagnetic, RF, microwave, and infrared. Computer readable instructions, data structures, program modules, or other data may be embodied in a modulated data signal, for example, in a wireless medium such as a carrier wave or similar mechanism such as is embodied as part of spread spectrum techniques. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. The modulation may be analog, digital or hybrid modulation techniques.
By way of example, and not limitation, computer-readable storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. For example, computer-readable storage media include, but are not limited to, volatile memory such as random access memory (RAM, DRAM, SRAM); and non-volatile memory such as flash memory, various read-only memories (ROM, PROM, EPROM, EEPROM), magnetic and ferromagnetic/ferroelectric memories (MRAM, feRAM); and magnetic and optical storage devices (hard disk, tape, CD, DVD); or other now known media or later developed that can store computer-readable information/data for use by a computer system.
An embodiment according to the present application herein comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or solution according to embodiments of the present application as described above.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not to denote any particular order.

Claims (19)

1. A method for age prediction, wherein the method comprises:
determining at least one associated number corresponding to each number according to a plurality of address book data, obtaining labeled data associated with each number, and segmenting the labeled data to obtain word data corresponding to the labeled data;
vectorizing the word data and the at least one associated number to obtain one or more vectorization characteristics corresponding to each number;
and performing supervised learning based on a preset machine learning regressor according to the one or more vectorization features and the age label corresponding to each number to obtain an age prediction model.
2. The method of claim 1, wherein the determining at least one associated number corresponding to each number comprises:
and regarding each number as at least one associated number corresponding to the number, wherein other numbers which exist in the same address book with the number are used as the associated numbers.
3. The method of claim 1, wherein the at least one association number comprises at least one first association number;
wherein, the determining at least one associated number corresponding to each number includes:
for each number, if the number exists in the address list of one or more first numbers, taking the one or more first numbers as at least one first associated number corresponding to the number.
4. The method of claim 3, wherein the at least one association number further comprises at least one second association number;
wherein, the determination of at least one associated number corresponding to each number further comprises
And for each number, taking one or more second numbers existing in the address book of the number as at least one second associated number corresponding to the number.
5. A method according to claim 3 or 4, wherein for each number a predetermined call condition is satisfied between that number and each of the at least one associated number.
6. The method of claim 5, wherein the talk condition comprises:
the number of calls is greater than or equal to a predetermined number threshold;
the total call duration is greater than or equal to a preset first duration threshold;
the time interval between the last call time and the current time is less than or equal to a preset second time length threshold value.
7. The method of claim 4, wherein the at least one association number further comprises at least one third association number;
wherein the determining of the at least one associated number corresponding to each number includes at least one of:
for each number, if the number exists in one or more address lists of the first number and the first number exists in one or more address lists of the third number, taking the third number as at least one third associated number corresponding to the number;
for each number, if one or more second numbers exist in the address book of the number and one or more fourth numbers exist in the address book of the second number, taking the one or more fourth numbers as at least one third associated number corresponding to the number;
and for each number, if one or more fifth numbers exist in the address book of the number and the fifth numbers exist in the address book of one or more sixth numbers except the number, taking the one or more sixth numbers as at least one third associated number corresponding to the number.
8. The method of claim 1, wherein the segmenting the tagged data to obtain word data corresponding to the tagged data comprises:
and performing word segmentation on the labeled data, removing meaningless words in word segmentation results, and obtaining word data corresponding to the labeled data.
9. The method according to claim 8, wherein the segmenting the tagged data, removing meaningless words in the segmentation result, and obtaining word data corresponding to the tagged data comprises:
and performing word segmentation on the labeled data, removing meaningless words in word segmentation results, and taking one or more segmented words with the occurrence frequency greater than or equal to a preset frequency threshold value in the word segmentation results as word data corresponding to the labeled data.
10. The method according to claim 1, wherein the vectorizing the word data and the at least one associated number to obtain one or more vectorized features corresponding to each number comprises:
inputting the word data and the at least one associated number into a trained word vector model to obtain one or more vectorization characteristics corresponding to each number output by the word vector model.
11. The method of claim 10, wherein the method further comprises:
setting training parameters corresponding to the word vector model;
and training the quantity model according to the labeling data respectively associated with the plurality of numbers to obtain a trained word vector model.
12. The method of claim 11, wherein the training parameters comprise at least one of:
vectorizing the dimension number;
the minimum word frequency.
13. The method of claim 1, wherein the machine learning regressor comprises any of:
random forests;
gradient boosting the regression tree.
14. The method of claim 1, wherein the method further comprises:
segmenting target marking data associated with a target number to obtain target word data corresponding to the target marking data;
vectorizing the target word data and at least one target associated number corresponding to the target number to obtain target vectorization characteristics corresponding to the target number;
and inputting the target vectorization characteristics into the age prediction model to obtain predicted age information corresponding to the target number output by the age prediction model.
15. A method for age prediction, wherein the method comprises:
segmenting target marking data associated with a target number to obtain target word data corresponding to the target marking data;
vectorizing the target word data and at least one target associated number corresponding to the target number to obtain target vectorization characteristics corresponding to the target number;
and inputting the target vectorization characteristics into an age prediction model to obtain predicted age information corresponding to the target number output by the age prediction model.
16. The method of claim 15, wherein the method further comprises:
determining at least one associated number corresponding to each number according to a plurality of address book data, obtaining label data associated with each number, and segmenting words of the label data to obtain word data corresponding to the label data;
vectorizing the word data and the at least one associated number to obtain one or more vectorization characteristics corresponding to each number;
and performing supervised learning based on a preset machine learning regressor according to the one or more vectorization features and the age label corresponding to each number to obtain the age prediction model.
17. A computer device for age prediction comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the steps of the method according to any of claims 1 to 16.
18. A computer-readable storage medium, on which a computer program/instructions are stored, which, when being executed by a processor, carry out the steps of the method according to any one of claims 1 to 16.
19. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method according to any one of claims 1 to 16 when executed by a processor.
CN202211190701.7A 2022-09-28 2022-09-28 Method, apparatus, medium, and program product for age prediction Pending CN115759326A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211190701.7A CN115759326A (en) 2022-09-28 2022-09-28 Method, apparatus, medium, and program product for age prediction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211190701.7A CN115759326A (en) 2022-09-28 2022-09-28 Method, apparatus, medium, and program product for age prediction

Publications (1)

Publication Number Publication Date
CN115759326A true CN115759326A (en) 2023-03-07

Family

ID=85350499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211190701.7A Pending CN115759326A (en) 2022-09-28 2022-09-28 Method, apparatus, medium, and program product for age prediction

Country Status (1)

Country Link
CN (1) CN115759326A (en)

Similar Documents

Publication Publication Date Title
CN110278175B (en) Graph structure model training and garbage account identification method, device and equipment
CN110780955B (en) Method and equipment for processing expression message
CN111046164A (en) Method and equipment for updating book to be read
CN110321189B (en) Method and equipment for presenting hosted program in hosted program
CN112686316A (en) Method and equipment for determining label
CN113127050B (en) Application resource packaging process monitoring method, device, equipment and medium
CN112784016B (en) Method and equipment for detecting speaking information
CN111523039B (en) Method and device for processing book promotion request in reading application
CN112866302B (en) Method, apparatus, medium and program product for integrity checking of cluster data
CN112685534B (en) Method and apparatus for generating context information of authored content during authoring process
CN111796731B (en) Method and equipment for automatically arranging icons
CN111666195A (en) Method and apparatus for providing video information or image information
CN114296651B (en) Method and device for storing custom data information
CN115759326A (en) Method, apparatus, medium, and program product for age prediction
CN110765390A (en) Method and equipment for publishing shared information in social space
CN116090590A (en) Method, equipment and medium for predicting through address book data
CN110321205B (en) Method and equipment for managing hosted program in hosted program
CN114840421A (en) Log data processing method and device
CN112750436B (en) Method and equipment for determining target playing speed of voice message
CN112269793B (en) Method and equipment for detecting user type based on blockchain
CN109358877B (en) Method and equipment for upgrading application in user equipment
CN111552906B (en) Method and equipment for responding to page access request in reading application
US11689608B1 (en) Method, electronic device, and computer program product for data sharing
CN113709146B (en) Method, equipment and medium for presenting entry information
CN114363893B (en) Method and equipment for determining hotspot sharing password failure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination