CN114185880A - Affiliated industry data determining method and device - Google Patents

Affiliated industry data determining method and device Download PDF

Info

Publication number
CN114185880A
CN114185880A CN202111508754.4A CN202111508754A CN114185880A CN 114185880 A CN114185880 A CN 114185880A CN 202111508754 A CN202111508754 A CN 202111508754A CN 114185880 A CN114185880 A CN 114185880A
Authority
CN
China
Prior art keywords
data
industry
affiliated
behavior
affiliated industry
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111508754.4A
Other languages
Chinese (zh)
Inventor
刘杰辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jindi Technology Co Ltd
Original Assignee
Beijing Jindi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jindi Technology Co Ltd filed Critical Beijing Jindi Technology Co Ltd
Priority to CN202111508754.4A priority Critical patent/CN114185880A/en
Publication of CN114185880A publication Critical patent/CN114185880A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis

Abstract

The disclosure relates to an affiliated industry data determination method and an affiliated industry data determination device, wherein the affiliated industry data determination method comprises the following steps: determining user data records missing the industry data to which the user data records belong and attribute data in the user data records; judging whether industry related data having a related relation with the attribute data exists or not; if yes, determining the affiliated industry data according to the affiliated industry associated data; if the attribute data does not exist, generating the affiliated industry data according to the attribute data, which is equivalent to guessing or predicting the affiliated industry data missing from the user data record; or, the industry data is generated directly according to the attribute data, so that the quality and reliability of user data records are improved, and other subsequent data-based applications are facilitated.

Description

Affiliated industry data determining method and device
Technical Field
The application relates to the technical field of data processing, in particular to a method and a device for determining affiliated industry data, a computer storage medium and electronic equipment.
Background
In big data application, user data generally needs to be collected, for this reason, fields of different attribute data need to be configured to store different attribute data of users, but during collection, due to network delay or low reliability of data sources, there are often attribute data that cannot be collected in corresponding fields. When the data is structured subsequently, the whole data record including the field without data is deleted completely, so that other fields with data are also deleted, thereby reducing the quality of the data and lowering the reliability of the data
Disclosure of Invention
Embodiments of the present application provide an industry-related data determining method and apparatus, a computer storage medium, and an electronic device, so as to overcome or alleviate the above technical problems in the prior art.
The technical scheme adopted by the application is as follows:
an industry-related data determination method, comprising:
determining user data records missing the industry data to which the user data records belong and attribute data in the user data records;
judging whether industry related data having a related relation with the attribute data exists or not;
if yes, determining the affiliated industry data according to the affiliated industry associated data; and if the attribute data does not exist, generating the affiliated industry data according to the attribute data.
Optionally, if the attribute data includes enterprise data to which the user belongs; wherein the judging whether the industry related data having the association relationship with the attribute data exists includes: judging whether enterprise information data corresponding to the enterprise data to which the user belongs exists or not;
the determining the affiliated industry data according to the affiliated industry related data comprises: and taking enterprise information data corresponding to the enterprise data to which the user belongs as the related data of the affiliated industry, and determining the affiliated industry data according to the related data of the affiliated industry.
Optionally, if the attribute data includes a unique ID assigned to the corresponding user; wherein the judging whether the industry related data having the association relationship with the attribute data exists includes: judging whether corresponding behavior data using the application program exist or not according to the unique ID distributed to the corresponding user;
the determining the affiliated industry data according to the affiliated industry related data comprises: and taking enterprise information data corresponding to the behavior data of the application program as the related data of the affiliated industry, and determining the affiliated industry data according to the related data of the affiliated industry.
Optionally, the using enterprise information data corresponding to the behavior data of the application program as the related industry data includes:
and determining the behavior time stamp corresponding to the behavior data of the application program, and selecting the behavior data of the application program with the behavior time stamp within the set time stamp range from the behavior time stamp so as to take the corresponding enterprise information data as the related industry data.
Optionally, if the behavior data of the application program is used, the behavior data includes: content browsing behavior data; the determining whether there is corresponding behavior data using the application program according to the unique ID allocated to the corresponding user includes: judging whether content browsing behavior data exists or not according to the unique ID distributed to the corresponding user;
the determining the affiliated industry data according to the affiliated industry related data comprises: and taking enterprise information data corresponding to the content browsing behavior data as the affiliated industry related data, and determining the affiliated industry data according to the affiliated industry related data.
Optionally, the generating the industry-related data according to the attribute data includes: and generating the predicted industry data according to the attribute data based on a pre-trained industry classification model.
Optionally, the pre-trained industry classification model comprises a logistic regression model, the logistic regression model comprising: the system comprises a plurality of weight parameter matrixes, a weight parameter matrix corresponds to industry data to which a class of alternatives belongs, each weight parameter matrix comprises a plurality of classification weight values, and the number of the classification weight values is the same as the dimensionality of the attribute data;
generating the predicted industry data of the pre-trained logistic regression model according to the attribute data, wherein the generating of the predicted industry data of the pre-trained logistic regression model comprises:
for each weight parameter matrix, multiplying the data of each dimension in the attribute data by the corresponding classification weight value in the weight parameter matrix, and then performing summation operation to obtain a predicted value;
summing the predicted values corresponding to all the weight parameter matrixes to obtain the sum of the predicted values;
calculating the ratio of a corresponding predicted value of each weight parameter matrix to the sum of the predicted values, and taking the ratio as the probability value of predicting the corresponding alternative industry data to the industry data;
and taking the alternative affiliated industry data corresponding to the maximum probability value as the affiliated industry data.
Optionally, the pre-trained industry classification model comprises a decision tree, the decision tree comprising: the nodes comprise root nodes, internal nodes and leaf nodes, the inter-node branch connecting line starts from the root nodes, passes through the internal nodes and reaches the leaf nodes, each root node and each internal node correspond to one dimension in the attribute data, and one leaf node corresponds to one type of industry data to which the alternative belongs;
generating the predicted industry data of the pre-trained logistic regression model according to the attribute data, wherein the generating of the predicted industry data of the pre-trained logistic regression model comprises:
searching a node corresponding to the data of each dimension in the attribute data in the decision tree;
determining corresponding inter-node branch connecting lines and leaf nodes positioned on the inter-node branch connecting lines according to the searched nodes;
and taking the alternative affiliated industry data corresponding to the leaf node as the affiliated industry data.
An industry-related data determination device, comprising:
the system comprises a first processing unit, a second processing unit and a third processing unit, wherein the first processing unit is used for determining a user data record of which the industry data is missing and attribute data in the user data record;
the second processing unit is used for judging whether related industry data which has an association relation with the attribute data exists or not;
the third processing unit is used for determining the affiliated industry data according to the affiliated industry related data when the affiliated industry related data exists; and generating the affiliated industry data according to the attribute data when the affiliated industry related data does not exist.
A computer storage medium having stored thereon a computer executable program, the computer executable program being operative to perform a method as in any one of the embodiments of the present application.
An electronic device comprising a memory for storing thereon a computer-executable program and a processor for executing the computer-executable program to implement the method of any of the embodiments of the present application.
A computer program product having stored thereon a computer executable program, the computer executable program being operative to perform a method as in any one of the embodiments of the present application.
According to the embodiment of the application, the user data records missing the industry data to which the user data records belong and the attribute data in the user data records are determined; judging whether industry related data having a related relation with the attribute data exists or not; if yes, determining the affiliated industry data according to the affiliated industry associated data; if the attribute data does not exist, generating the affiliated industry data according to the attribute data, and determining the affiliated industry data based on the affiliated industry associated data which has an association relationship with the attribute data, wherein the affiliated industry data is equivalent to the affiliated industry data missing in the user data record in a conjecture or prediction mode; or, the industry data is generated directly according to the attribute data, so that the quality and reliability of user data records are improved, and other subsequent data-based applications are facilitated.
Drawings
FIG. 1A is a schematic flow chart of an industry data determination method in an embodiment of the present application;
FIG. 1B is a schematic flow chart illustrating the process of generating predicted industry-related data based on a logistic regression model in an embodiment of the present application;
FIG. 1C is a schematic flow chart illustrating the generation of predicted industry-related data based on decision trees in an embodiment of the present application;
FIG. 2 is a schematic structural diagram of an industry data determination device according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device in an embodiment of the present application.
Detailed Description
To make the technical problems, technical solutions and advantages to be solved by the present application clearer, the following detailed description is made with reference to the accompanying drawings and specific embodiments.
For example, a scenario applying the present application is provided, for example, in a scenario, for an application providing query enterprise or natural person information, user portrait construction is performed based on collected user data records, but data corresponding to the data field is not collected (for example, affiliated industry data of the user is missing), thereby resulting in low accuracy of constructed user portrait; alternatively, when model training is to be performed based on the collected user data records, since data corresponding to the data field is not collected, the corresponding data is not used during model training, which results in low model accuracy.
According to the embodiment of the application, the user data records missing the industry data to which the user data records belong and the attribute data in the user data records are determined; judging whether industry related data having a related relation with the attribute data exists or not; if yes, determining the affiliated industry data according to the affiliated industry associated data; if the attribute data does not exist, generating the affiliated industry data according to the attribute data, and determining the affiliated industry data based on the affiliated industry associated data which has an association relationship with the attribute data, wherein the affiliated industry data is equivalent to the affiliated industry data missing in the user data record in a conjecture or prediction mode; or, the industry data is generated directly according to the attribute data, so that the quality and reliability of user data records are improved, and other subsequent data-based applications, such as user portrait construction or model training, are facilitated.
In the following embodiments of the present application, an execution subject of the industry data determination method may be a background server.
FIG. 1A is a schematic flow chart of an industry data determination method in an embodiment of the present application; as shown in fig. 1A, it includes:
s101, determining user data records missing the industry data to which the user data records belong and attribute data in the user data records;
optionally, in this embodiment, for each user, an industry field to which the user data record belongs is set in the corresponding user data record, and whether the user data record lacks the industry data to which the user data record belongs is determined by determining whether a value of the industry field to which the user data record belongs is valid.
Specifically, for example, in an application scenario, when the user data record is performed, if the affiliated industry data is not collected, it is specified that the value of the affiliated industry field is kept to be null, and when step S101 is performed, if the value of the affiliated industry field is null, it may be determined that the corresponding user data record lacks the affiliated industry data. Alternatively, in another usage scenario, the value in the industry field is not a valid value, for example, the value of the industry data does not conform to a predefined rule, and it may be determined that the corresponding user data record lacks the industry data.
In this embodiment, the user data records may be stored in a predetermined database, and one user corresponds to one user data record. Therefore, when there are a plurality of users, there are a plurality of user data records. Therefore, in executing step S101, each user data record in the predetermined database is traversed to determine each user data record missing the industry data, all the user data records missing the industry data form a data set (or also referred to as a data wide table), and the data set is traversed to execute steps S102-S103A described below in the present application for each user data record therein.
Of course, it should be noted here that, when step S101 is executed, a predetermined database may be traversed, and each time a user data record missing the industry data to which the user data belongs is determined, steps S102 to S103A described below in the present application may be executed, in which case, the above data set does not need to be formed.
S102, judging whether industry related data which has a related relation with the attribute data exists or not;
in this embodiment, the related industry data may be stored in a database, including any data that may directly or indirectly reflect the related industry data.
In this embodiment, when step S102 is executed, at least one of the following situations may be included, for example:
(one) case
Possibly, if the attribute data comprises the enterprise data of the user; the judging whether there is associated data of the industry having an association relationship with the attribute data includes: and judging whether enterprise information data corresponding to the enterprise data to which the user belongs exists or not.
Illustratively, the enterprise data and the enterprise information data are stored in a one-to-one correspondence manner in the corresponding databases, that is, one enterprise identifier corresponds to one enterprise information data. If one enterprise mark corresponds to a plurality of enterprise information data, merging the plurality of enterprise information data, and only keeping one enterprise information data. Here, the structure of the enterprise information data is not particularly limited, and includes any structure applicable to the present application.
(II) case II
Possibly, if the attribute data comprises a unique ID assigned to the corresponding user; the judging whether there is associated data of the industry having an association relationship with the attribute data includes: and judging whether the corresponding behavior data of the application program exists or not according to the unique ID distributed to the corresponding user.
For example, the unique ID assigned to the corresponding user is used to search the corresponding behavior data of the application program, if the behavior data of the application program is found, the corresponding enterprise information data is directly used as the related data of the industry in the subsequent step S103A, and the related industry data is determined by analyzing the enterprise information data.
In addition, possibly, in consideration of the possibility that there are a plurality of pieces of behavior data of the application program, in order to reduce the data amount and increase the data processing speed, when it is determined whether there is industry-related data having an association relationship with the attribute data in step S102, it is determined whether there is behavior data of the corresponding application program according to the unique ID assigned to the corresponding user based on the set timestamp.
Specifically, the determining whether there is behavior data of a corresponding application program according to a unique ID assigned to a corresponding user based on a set timestamp may include: determining whether a behavior timestamp range of the behavior data of the application program is judged to exist or not based on the set timestamp; and judging whether the behavior data of the application program corresponding to the unique ID exists in the range of the behavior timestamp. For example, in an application scenario, the timestamp is, for example, a timestamp when step S101 is started, or is also referred to as a start timestamp when the present embodiment is executed, so as to ensure that the behavior data of the application program has better timeliness, and the industry data obtained after step S103A is executed may have higher accuracy.
Specifically, in this embodiment, the determining whether the behavior timestamp range of the behavior data of the application program exists based on the set timestamp may include: and determining whether a behavior timestamp range of the behavior data of the using application program exists or not based on the set timestamp and the set time span unit.
Illustratively, the time span units are, for example, days, and the behavior timestamps range from a timestamp plus several times the time span units.
Further, if the behavior data of the application program is used, the behavior data comprises: content browsing behavior data; the determining whether there is corresponding behavior data using the application program according to the unique ID allocated to the corresponding user includes: judging whether content browsing behavior data exists or not according to the unique ID distributed to the corresponding user;
exemplarily, in this embodiment, the content browsing behavior data includes: and the user conducts behavior data corresponding to at least one behavior of browsing, searching, monitoring and paying attention to.
For example, according to the unique ID assigned to the corresponding user, the determination is performed by searching whether content browsing behavior data exists, and if behavior data using an application program (for example, behavior data corresponding to a behavior of browsing, searching, monitoring, or paying attention to the user) is found, the corresponding enterprise information data is directly used as the associated data of the industry to which the user belongs when the subsequent step S103A is performed, and the industry data to which the user belongs is determined by analyzing the enterprise information data.
In addition, possibly, considering that there may be a plurality of pieces of content browsing behavior data, such as behavior data corresponding to the same behavior of the user, or data corresponding to different behaviors, may be found, when it is determined in step S102 whether there is industry related data having an association relationship with the attribute data, it is determined whether there is content browsing behavior data corresponding to all behaviors based on a set timestamp and according to a unique ID allocated to a corresponding user, where the content browsing behavior data includes behavior data corresponding to one or more behaviors of browsing, searching, monitoring, and focusing performed by the user.
Specifically, the determining whether content browsing behavior data corresponding to all behaviors exists according to the unique ID based on the set timestamp may include: determining whether a behavior timestamp range of the behavior data of the application program is judged to exist or not based on the set timestamp; and judging whether content browsing behavior data corresponding to all behaviors corresponding to the unique ID exist in the range of the behavior timestamp. The timestamp ranges may refer to the above exemplary description.
Of course, in another embodiment, when searching for content browsing behavior data, the content browsing behavior data may be searched according to the priority of the content browsing behavior data, for example, the priority of the behavior data corresponding to the monitoring, focusing, searching, and browsing behaviors is sequentially decreased, so when searching for the content browsing behavior data, the behavior data corresponding to the monitoring behavior is first searched; if the behavior data corresponding to the monitoring behavior is found, executing the subsequent step S103A, and if the behavior data corresponding to the monitoring behavior is not found, finding the behavior data corresponding to the concerned behavior; if the behavior data corresponding to the behavior of interest is not found, the behavior data corresponding to the search behavior is found, and if the behavior data corresponding to the behavior of interest is found, the subsequent step S103A is executed; if the behavior data corresponding to the search behavior is not found, finding the behavior data corresponding to the browsing behavior; if the behavior data corresponding to the browsing behavior is not found, the process jumps to find the behavior data corresponding to the monitoring behavior, and if the behavior data corresponding to the browsing behavior is found, the subsequent step S103A is executed. Here, in one case, if the number of times of search reaches the set number threshold, it is determined that no behavior data is found, the search process is ended, and step S103A is not executed, and the missing industry data is directly kept in the missing state.
If yes, go to step S103A: determining the affiliated industry data according to the affiliated industry associated data;
corresponding to the possible situations in step S102, step S103A may also correspond to at least one of the following situations:
(one) case
Possibly, if so, determining the industry-related data according to the industry-related data includes: and taking enterprise information data corresponding to the affiliated enterprise data as the affiliated industry related data, and determining the affiliated industry data according to the affiliated industry related data.
Possibly, as described above, the enterprise data and the enterprise information data are in a one-to-one correspondence relationship, so that the situation that only one enterprise information data exists can be directly identified, and since the one enterprise information data is directly used as the associated data of the industry, the associated industry data can be directly determined by analyzing the enterprise information data, for example, the associated industry data is filled in the industry field of the user data record. Alternatively, if the business information data itself includes the industry-related data, the industry-related data is obtained as the industry-related data missing from the user data record, such as refilling into the industry-related field of the user data record.
(II) case II
Possibly, if so, determining the industry-related data according to the industry-related data includes: and taking enterprise information data corresponding to the behavior data of the application program as the related data of the affiliated industry, and determining the affiliated industry data according to the related data of the affiliated industry. For example, if only behavior data using an application program is found, the corresponding enterprise information data is directly used as the affiliated industry related data, and the affiliated industry data is determined by analyzing the enterprise information data, such as being filled in an affiliated industry field of the user data record. Alternatively, if the business information data itself includes the industry-related data, the industry-related data is obtained as the industry-related data missing from the user data record, for example, the industry-related data is directly filled into the industry-related field of the user data record.
Possibly, considering that there may be multiple pieces of found behavior data of the corresponding application use, if the finding of the behavior data of the application use is not performed based on the timestamp in step S102, the step S103A uses the enterprise information data corresponding to the behavior data of the application use as the industry-related data, and further includes: and determining the behavior time stamp corresponding to the behavior data of the application program, and screening the behavior data of the application program with the behavior time stamp within the set time stamp range from the behavior time stamp so as to take the corresponding enterprise information data as the related industry data.
For this reason, the method can be applied to a case where there are a plurality of pieces of behavior data of the use application having an association relationship with the unique ID, so that it is convenient to select the behavior data of the partial use application as the affiliated industry related data based on the behavior time stamp and the set time stamp range, thereby ensuring the timeliness of the affiliated industry related data, and enabling the affiliated industry data obtained after the step S103A is executed to have higher accuracy.
Specifically, in step S103A, taking the enterprise information data corresponding to the behavior data of the application program as the related industry data, the method further includes: and determining a behavior timestamp range for screening the behavior data of the using application program based on the set timestamp.
Specifically, in this embodiment, the determining, based on the set timestamp, a range of the behavior timestamp for filtering the behavior data of the application program may include: and determining a behavior timestamp range for screening the behavior data of the application program based on the set timestamp and the set time span unit.
When there are multiple pieces of behavior data using the application program, there are multiple pieces of enterprise information data correspondingly, and for this reason, in this embodiment, the determining the affiliated industry data according to the affiliated industry related data includes: and determining affiliated industry data corresponding to each piece of enterprise information data serving as the affiliated industry associated data, performing frequency statistics on the determined affiliated industry data, and taking the affiliated industry data with the maximum frequency as the missing affiliated industry data in the user data record.
Illustratively, if the plurality of pieces of behavior data of the usage application include: and monitoring, paying attention to, searching and browsing behavior data corresponding to behaviors, considering the behavior data corresponding to the behaviors as a whole, and carrying out frequency statistics on the affiliated industry data so as to take the affiliated industry data with the maximum frequency as the affiliated industry data missing from the user data record.
Alternatively, if the behavior data of the plurality of using applications includes: and monitoring, concerning, searching and browsing behavior data corresponding to the behaviors, respectively carrying out frequency statistics on corresponding affiliated industry data according to the behavior data corresponding to the behaviors, setting weights according to the frequencies of the affiliated industry data counted based on the behavior data corresponding to the behaviors, and referring to the principle of priority setting, specifically, sequentially decreasing the weight values, carrying out product on the frequency of the corresponding affiliated industry data and the weights thereof according to each type of behavior data to obtain a product result, and selecting the affiliated industry data corresponding to the largest product result as the affiliated industry data missing from the user data record.
Here, the principle of setting the priority and the weight ranking is that the affiliated industry data corresponding to the behavior data corresponding to the monitoring, focusing, searching and browsing behaviors are sequentially decreased in the accuracy of reflecting the affiliated industry data.
If not, go to step S103B: and generating the affiliated industry data according to the attribute data.
Specifically, in this embodiment, the generating the industry-related data according to the attribute data includes: and generating the predicted industry data according to the attribute data based on a pre-trained industry classification model.
For example, fig. 1B is a schematic flow chart of generating predicted industry-related data based on a logistic regression model in the embodiment of the present application; specifically, the pre-trained industry classification model comprises a logistic regression model, and the logistic regression model comprises: and one weight parameter matrix corresponds to the industry data to which the one type of candidate belongs, each weight parameter matrix comprises a plurality of classification weight values, and the number of the classification weight values is the same as the dimensionality of the attribute data. For this purpose, as shown in fig. 1B, the generating predicted industry-related data according to the attribute data based on the pre-trained logistic regression model includes:
S113B, for each weight parameter matrix, multiplying the data of each dimension in the attribute data by the corresponding classification weight value in the weight parameter matrix, and then performing summation operation to obtain a predicted value;
S123B, performing summation operation on the predicted values corresponding to all the weight parameter matrixes to obtain the sum of the predicted values;
S133B, calculating the ratio of the corresponding predicted value of each weight parameter matrix to the sum of the predicted values, and taking the ratio as the probability value of predicting the corresponding alternative industry data to the industry data;
S143B, taking the alternative affiliated industry data corresponding to the maximum probability value as the affiliated industry data.
Here, in order to ensure the accuracy of the obtained industry data in the above process, a plurality of attribute data samples and a plurality of corresponding industry data labels (an industry data label is accurate industry data corresponding to an attribute data sample) are used, a weight parameter matrix corresponds to a class of alternative industry data samples, the logistic regression model is pre-trained, during training, a classification weight value corresponding to the weight parameter matrix is a preset initial random value, and the pre-trained process is specifically as follows:
for each weight parameter matrix, multiplying each dimension data sample in each attribute data sample by a corresponding classification weight value in the weight parameter matrix, and then performing summation operation to obtain a predicted value;
summing the predicted values corresponding to all the weight parameter matrixes to obtain the sum of the predicted values;
calculating the ratio of the corresponding predicted value of each weight parameter matrix to the sum of the predicted values to determine the probability value for predicting the corresponding alternative industry data sample as the industry data label;
taking the alternative affiliated industry data sample corresponding to the maximum probability value as a predicted affiliated industry data sample, and comparing the predicted affiliated industry data sample with the corresponding affiliated industry data label;
and according to the comparison result, adjusting the corresponding classification weight value in the weight parameter matrix, and finally enabling the predicted affiliated industry data sample and the energy loss labeled corresponding to the affiliated industry data sample to be minimum so as to finish the training of the logistic regression model.
In particular, in an application scenario, the logistic regression model is, for example, in particular, a Softmax model.
For example, fig. 1C is a schematic flow chart illustrating the industry-related data predicted based on decision tree generation in the embodiment of the present application; specifically, the pre-trained industry classification model includes a decision tree, the decision tree including: the nodes comprise root nodes, internal nodes and leaf nodes, the inter-node branch connecting line starts from the root nodes, passes through the internal nodes and reaches the leaf nodes, each root node and each internal node correspond to one dimension in the attribute data, and one leaf node corresponds to one type of industry data to which the alternative belongs. Specifically, the generating of the predicted industry data of the industry based on the pre-trained logistic regression model according to the attribute data includes:
S113C, searching a node corresponding to the data of each dimension in the attribute data in the decision tree;
S123C, determining corresponding inter-node branch connecting lines and leaf nodes positioned on the inter-node branch connecting lines according to the searched nodes;
s133, 133C, taking the alternative industry data corresponding to the leaf node as the industry data.
Here, in order to ensure the accuracy of the obtained industry data in the above process, a plurality of attribute data samples and a plurality of corresponding industry data labels (the industry data labels are accurate industry data corresponding to user data record samples) are used, the decision tree is pre-trained, during training, a root node and an internal node in the decision tree randomly correspond to one dimension in the attribute data samples, and a leaf node randomly corresponds to one industry data, and the pre-training process is similar to the above process, specifically as follows:
searching a node corresponding to the data of each dimension in each attribute data sample in the decision tree;
determining corresponding inter-node branch connecting lines and leaf nodes positioned on the inter-node branch connecting lines according to the searched nodes;
taking the affiliated industry data corresponding to the leaf nodes as a predicted affiliated industry data sample, and comparing the affiliated industry data sample with the corresponding affiliated industry data label;
and adjusting the dimensionality corresponding to the root node and the internal node and the affiliated industry data corresponding to the leaf node according to the comparison result, and finally enabling the predicted affiliated industry data sample and the information entropy labeled corresponding to the affiliated industry data sample to be maximum so as to finish the training of the decision tree.
Here, the business-related data may be generated by using a decision tree and a logistic regression model at the same time, and the optimal business-related data may be selected as the business-related data missing from the user data record.
In addition, it should be noted that, the attribute data samples participating in the model training are not particularly limited, and may include, but are not limited to, specific data of dimensions such as gender, age, occupation, residence, educational experience, and the like, and of course, for the model training, the higher the dimension of the attribute data sample is, the higher the accuracy of the model is, and correspondingly, the higher the accuracy of the generated belonging industry data is. Similarly, the attribute data is not particularly limited.
For example, in the present embodiment, the industry data may be an industry name or a unique industry ID corresponding to the industry data.
FIG. 2 is a schematic structural diagram of an industry data determination device according to an embodiment of the present application; as shown in fig. 3, it includes:
the first processing unit 201 is configured to determine a user data record missing the industry data to which the user data record belongs, and attribute data in the user data record;
the second processing unit 202 is configured to determine whether there is industry related data having an association relationship with the attribute data;
the third processing unit 203 is configured to determine the affiliated industry data according to the affiliated industry related data when the affiliated industry related data exists; and generating the affiliated industry data according to the attribute data when the affiliated industry related data does not exist.
Optionally, if the attribute data includes enterprise data to which the user belongs; the second processing unit is specifically configured to determine whether there is enterprise information data corresponding to the enterprise data to which the user belongs;
if the business data exists, the third processing unit is specifically configured to use the enterprise information data corresponding to the enterprise data to which the user belongs as the associated data of the industry to which the user belongs, and determine the associated data of the industry according to the associated data of the industry to which the user belongs.
Optionally, if the attribute data includes a unique ID assigned to the corresponding user; the second processing unit is specifically configured to determine whether there is behavior data of a corresponding application program according to a unique ID assigned to a corresponding user;
if the business information data exists, the third processing unit is specifically configured to use the enterprise information data corresponding to the behavior data of the application program as the affiliated industry related data, and determine the affiliated industry data according to the affiliated industry related data.
Optionally, the third processing unit is specifically configured to determine a behavior timestamp corresponding to the behavior data of the application program, and select the behavior data of the application program with the behavior timestamp within a set timestamp range from the behavior timestamp, so as to use the enterprise information data corresponding to the behavior data as the industry-related data.
Optionally, if the behavior data of the application program is used, the behavior data includes: content browsing behavior data; the second processing unit is specifically configured to determine whether content browsing behavior data exists according to a unique ID assigned to a corresponding user;
and if the content browsing behavior data exists, the third processing unit is specifically configured to use enterprise information data corresponding to the content browsing behavior data as the affiliated industry related data, and determine the affiliated industry data according to the affiliated industry related data.
Optionally, the third processing unit is specifically configured to generate the predicted industry data according to the attribute data based on a pre-trained industry classification model.
Optionally, the pre-trained industry classification model comprises a logistic regression model, the logistic regression model comprising: the system comprises a plurality of weight parameter matrixes, a weight parameter matrix corresponds to industry data to which a class of alternatives belongs, each weight parameter matrix comprises a plurality of classification weight values, and the number of the classification weight values is the same as the dimensionality of the attribute data;
wherein the third processing unit is specifically configured to:
for each weight parameter matrix, multiplying the data of each dimension in the attribute data by the corresponding classification weight value in the weight parameter matrix, and then performing summation operation to obtain a predicted value;
summing the predicted values corresponding to all the weight parameter matrixes to obtain the sum of the predicted values;
calculating the ratio of a corresponding predicted value of each weight parameter matrix to the sum of the predicted values, and taking the ratio as the probability value of predicting the corresponding alternative industry data to the industry data;
and taking the alternative affiliated industry data corresponding to the maximum probability value as the affiliated industry data.
Optionally, the pre-trained industry classification model comprises a decision tree, the decision tree comprising: the nodes comprise root nodes, internal nodes and leaf nodes, the inter-node branch connecting line starts from the root nodes, passes through the internal nodes and reaches the leaf nodes, each root node and each internal node correspond to one dimension in the attribute data, and one leaf node corresponds to one type of industry data to which the alternative belongs;
wherein the third processing unit is specifically configured to:
searching a node corresponding to the data of each dimension in the attribute data in the decision tree;
determining corresponding inter-node branch connecting lines and leaf nodes positioned on the inter-node branch connecting lines according to the searched nodes;
and taking the alternative affiliated industry data corresponding to the leaf node as the affiliated industry data.
Embodiments of the present application further provide a computer storage medium, where a computer executable program is stored on the computer storage medium, and the computer executable program is executed to implement any of the methods described in the embodiments of the present application.
Embodiments of the present application further provide a computer program product, where a computer executable program is stored on the computer program product, and the computer executable program is executed to implement any one of the methods described in the embodiments of the present application.
FIG. 3 is a schematic structural diagram of an electronic device in an embodiment of the present application; as shown in fig. 3, the electronic apparatus includes: a memory 301 storing a computer executable program and a processor 302 for running the computer executable program to implement the data processing method in any embodiment of the present application.
The above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (11)

1. An affiliated industry data determination method, comprising:
determining user data records missing the industry data to which the user data records belong and attribute data in the user data records;
judging whether industry related data having a related relation with the attribute data exists or not;
if yes, determining the affiliated industry data according to the affiliated industry associated data; and if the attribute data does not exist, generating the affiliated industry data according to the attribute data.
2. The method of claim 1, wherein if the attribute data includes business data to which the user belongs;
wherein the judging whether the industry related data having the association relationship with the attribute data exists includes: judging whether enterprise information data corresponding to the enterprise data to which the user belongs exists or not;
the determining the affiliated industry data according to the affiliated industry related data comprises: and taking enterprise information data corresponding to the enterprise data to which the user belongs as the related data of the affiliated industry, and determining the affiliated industry data according to the related data of the affiliated industry.
3. The method of claim 1, wherein if the attribute data includes a unique ID assigned to the corresponding user;
wherein the judging whether the industry related data having the association relationship with the attribute data exists includes: judging whether corresponding behavior data using the application program exist or not according to the unique ID distributed to the corresponding user;
the determining the affiliated industry data according to the affiliated industry related data comprises: and taking enterprise information data corresponding to the behavior data of the application program as the related data of the affiliated industry, and determining the affiliated industry data according to the related data of the affiliated industry.
4. The method according to claim 3, wherein the using enterprise information data corresponding to the behavior data of the application program as the related industry data comprises:
and determining the behavior time stamp corresponding to the behavior data of the application program, and selecting the behavior data of the application program with the behavior time stamp within the set time stamp range from the behavior time stamp so as to take the corresponding enterprise information data as the related industry data.
5. The method of claim 4, wherein using behavior data of an application comprises: content browsing behavior data;
the determining whether there is corresponding behavior data using the application program according to the unique ID allocated to the corresponding user includes: judging whether content browsing behavior data exists or not according to the unique ID distributed to the corresponding user;
the determining the affiliated industry data according to the affiliated industry related data comprises: and taking enterprise information data corresponding to the content browsing behavior data as the affiliated industry related data, and determining the affiliated industry data according to the affiliated industry related data.
6. The method of claim 1, wherein said generating said industry-related data based on said attribute data comprises: and generating the predicted industry data according to the attribute data based on a pre-trained industry classification model.
7. The method of claim 6, wherein the pre-trained industry classification model comprises a logistic regression model; the logistic regression model includes: the system comprises a plurality of weight parameter matrixes, a weight parameter matrix corresponds to industry data to which a class of alternatives belongs, each weight parameter matrix comprises a plurality of classification weight values, and the number of the classification weight values is the same as the dimensionality of the attribute data;
generating the predicted industry data of the pre-trained logistic regression model according to the attribute data, wherein the generating of the predicted industry data of the pre-trained logistic regression model comprises:
for each weight parameter matrix, multiplying the data of each dimension in the attribute data by the corresponding classification weight value in the weight parameter matrix, and then performing summation operation to obtain a predicted value;
summing the predicted values corresponding to all the weight parameter matrixes to obtain the sum of the predicted values;
calculating the ratio of a corresponding predicted value of each weight parameter matrix to the sum of the predicted values, and taking the ratio as the probability value of predicting the corresponding alternative industry data to the industry data;
and taking the alternative affiliated industry data corresponding to the maximum probability value as the affiliated industry data.
8. The method of claim 6, wherein the pre-trained industry classification model comprises a decision tree; the decision tree comprises: the nodes comprise root nodes, internal nodes and leaf nodes, the inter-node branch connecting line starts from the root nodes, passes through the internal nodes and reaches the leaf nodes, each root node and each internal node correspond to one dimension in the attribute data, and one leaf node corresponds to one type of industry data to which the alternative belongs;
generating the predicted industry data of the pre-trained logistic regression model according to the attribute data, wherein the generating of the predicted industry data of the pre-trained logistic regression model comprises:
searching a node corresponding to the data of each dimension in the attribute data in the decision tree;
determining corresponding inter-node branch connecting lines and leaf nodes positioned on the inter-node branch connecting lines according to the searched nodes;
and taking the alternative affiliated industry data corresponding to the leaf node as the affiliated industry data.
9. An affiliated industry data determination device, comprising:
the system comprises a first processing unit, a second processing unit and a third processing unit, wherein the first processing unit is used for determining a user data record of which the industry data is missing and attribute data in the user data record;
the second processing unit is used for judging whether related industry data which has an association relation with the attribute data exists or not;
the third processing unit is used for determining the affiliated industry data according to the affiliated industry related data when the affiliated industry related data exists; and generating the affiliated industry data according to the attribute data when the affiliated industry related data does not exist.
10. A computer storage medium having stored thereon a computer-executable program that is executed to implement the method of any of claims 1-8.
11. An electronic device comprising a memory for storing thereon a computer-executable program and a processor for executing the computer-executable program to implement the method of any of claims 1-8.
CN202111508754.4A 2021-12-10 2021-12-10 Affiliated industry data determining method and device Pending CN114185880A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111508754.4A CN114185880A (en) 2021-12-10 2021-12-10 Affiliated industry data determining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111508754.4A CN114185880A (en) 2021-12-10 2021-12-10 Affiliated industry data determining method and device

Publications (1)

Publication Number Publication Date
CN114185880A true CN114185880A (en) 2022-03-15

Family

ID=80604419

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111508754.4A Pending CN114185880A (en) 2021-12-10 2021-12-10 Affiliated industry data determining method and device

Country Status (1)

Country Link
CN (1) CN114185880A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135901A (en) * 2019-05-10 2019-08-16 重庆天蓬网络有限公司 A kind of enterprise customer draws a portrait construction method, system, medium and electronic equipment
CN110134759A (en) * 2019-05-13 2019-08-16 极智(上海)企业管理咨询有限公司 A method of obtaining the trade information of enterprise
CN110781380A (en) * 2019-09-09 2020-02-11 深圳壹账通智能科技有限公司 Information pushing method and device, computer equipment and storage medium
CN111126422A (en) * 2018-11-01 2020-05-08 百度在线网络技术(北京)有限公司 Industry model establishing method, industry determining method, industry model establishing device, industry determining equipment and industry determining medium
CN113609174A (en) * 2021-07-28 2021-11-05 江苏汇农天下信息科技有限公司 Industry user data searching method and device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111126422A (en) * 2018-11-01 2020-05-08 百度在线网络技术(北京)有限公司 Industry model establishing method, industry determining method, industry model establishing device, industry determining equipment and industry determining medium
CN110135901A (en) * 2019-05-10 2019-08-16 重庆天蓬网络有限公司 A kind of enterprise customer draws a portrait construction method, system, medium and electronic equipment
CN110134759A (en) * 2019-05-13 2019-08-16 极智(上海)企业管理咨询有限公司 A method of obtaining the trade information of enterprise
CN110781380A (en) * 2019-09-09 2020-02-11 深圳壹账通智能科技有限公司 Information pushing method and device, computer equipment and storage medium
CN113609174A (en) * 2021-07-28 2021-11-05 江苏汇农天下信息科技有限公司 Industry user data searching method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN105701216B (en) A kind of information-pushing method and device
CN108304512B (en) Video search engine coarse sorting method and device and electronic equipment
US7577643B2 (en) Key phrase extraction from query logs
US8195674B1 (en) Large scale machine learning systems and methods
US9177042B2 (en) Determining quality of tier assignments
US8768861B2 (en) Research mission identification
JP2006127529A (en) Web page ranking with hierarchical consideration
US11531831B2 (en) Managing machine learning features
US20120233096A1 (en) Optimizing an index of web documents
CN113449168B (en) Theme webpage data grabbing method, device, equipment and storage medium
US9740986B2 (en) System and method for deducing user interaction patterns based on limited activities
US7769749B2 (en) Web page categorization using graph-based term selection
CN110188291B (en) Document processing based on proxy log
CN110019751A (en) Machine learning model modification and natural language processing
US20160321345A1 (en) Chain understanding in search
JP5315726B2 (en) Information providing method, information providing apparatus, and information providing program
CN111160699A (en) Expert recommendation method and system
CN114185880A (en) Affiliated industry data determining method and device
CN115687574A (en) Text retrieval method, text retrieval device, terminal equipment and storage medium
WO2011084238A1 (en) Method and apparatus of adaptive categorization technique and solution for services selection based on pattern recognition
CN114626366A (en) Maintenance of data vocabulary
CN115687810A (en) Webpage searching method and device and related equipment
CN112464101A (en) Electronic book sorting recommendation method, electronic device and storage medium
CN110837508A (en) Method, device and equipment for establishing aperture system and computer storage medium
CN117609175B (en) Configurable industrial control file acquisition and analysis method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination