CN112328779A - Training sample construction method and device, terminal equipment and storage medium - Google Patents

Training sample construction method and device, terminal equipment and storage medium Download PDF

Info

Publication number
CN112328779A
CN112328779A CN202011217114.3A CN202011217114A CN112328779A CN 112328779 A CN112328779 A CN 112328779A CN 202011217114 A CN202011217114 A CN 202011217114A CN 112328779 A CN112328779 A CN 112328779A
Authority
CN
China
Prior art keywords
sharing
article
articles
sequence
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011217114.3A
Other languages
Chinese (zh)
Other versions
CN112328779B (en
Inventor
老焯楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202011217114.3A priority Critical patent/CN112328779B/en
Publication of CN112328779A publication Critical patent/CN112328779A/en
Application granted granted Critical
Publication of CN112328779B publication Critical patent/CN112328779B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Abstract

The application is applicable to the technical field of artificial intelligence, and provides a training sample construction method, a training sample construction device, a training sample construction terminal device and a training sample storage medium, wherein the method comprises the following steps: screening users according to article sharing data of different user accounts to obtain sample users; constructing sharing sequence characteristics according to article sharing data corresponding to the sample user; constructing a label sequence characteristic corresponding to the sample user according to the sharing sequence characteristic; and constructing a sample according to the sharing sequence characteristics and the label sequence characteristics to obtain a training sample. According to the method and the device, the sample is constructed according to the sharing sequence characteristics and the label sequence characteristics, and the training sample can be constructed based on the characteristics of the sharing sequence between different articles shared by sample users and between corresponding labels of different articles. Namely, after the estimation model is trained based on the training sample, the article sharing behavior of the user can be accurately estimated, and the estimation accuracy of the estimation model on the article sharing behavior of the user is improved. In addition, the application also relates to a block chain technology.

Description

Training sample construction method and device, terminal equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a training sample construction method and apparatus, a terminal device, and a storage medium.
Background
Among the recommended algorithms, the CTR (Click-Through-Rate) predictive model is undoubtedly the most widely used ranking model scheme. The training samples used for the estimation model training are the key of the whole algorithm, more than 70% of energy and time are consumed in the construction of the training samples in the recommendation algorithm, and the quality of the training samples directly influences the final effect of the estimation model, so that the construction method of the training samples is more and more valued by people.
In the existing training sample construction process, feature extraction is carried out on click data of a user and exposure data of an article, and training samples are constructed according to the extracted click feature and exposure feature, so that a trained estimation model can estimate the click behavior of the user, but the trained estimation model cannot effectively estimate the article sharing behavior of the user, the estimation accuracy of the trained estimation model on the article sharing behavior of the user is low, and the use experience of the user is reduced.
Disclosure of Invention
In view of this, embodiments of the present application provide a training sample construction method and apparatus, a terminal device, and a storage medium, so as to solve the problem in the prior art that in a training sample construction process, due to the fact that a training sample is constructed according to extracted click features and exposure features, the estimation accuracy of an estimation model after training on an article sharing behavior of a user is low.
A first aspect of an embodiment of the present application provides a training sample construction method, including:
the method comprises the steps of obtaining article sharing data of different user accounts, screening users according to the article sharing data to obtain sample users, wherein the article sharing data comprise a plurality of articles shared by the user accounts;
establishing sharing sequence characteristics according to the article sharing data corresponding to the sample user, wherein the sharing sequence characteristics are used for representing the sharing sequence among different articles;
constructing a label sequence feature corresponding to the sample user according to the sharing sequence feature, wherein the label sequence feature is used for representing the sharing sequence of article labels among different articles;
and constructing a sample according to the sharing sequence characteristic and the label sequence characteristic to obtain the training sample.
Further, the constructing sharing sequence features according to the article sharing data corresponding to the sample user includes:
screening the articles in the article sharing data, and acquiring sharing time corresponding to the screened articles;
and sequencing the screened articles according to the sharing time to obtain the sharing sequence characteristics of the article sharing data corresponding to the sample user.
Further, the screening the articles in the article sharing data includes:
performing repeated sharing detection on the articles in the article sharing data, wherein the repeated sharing detection is used for detecting whether the same articles exist in the article sharing data;
if the same articles exist in the article sharing data, the sharing time of the same articles is respectively obtained, the articles are deleted according to the sharing time of the same articles, and the number of the deleted articles is smaller than that of the same articles.
Further, the deleting the articles according to the sharing time of the same article includes:
respectively calculating the time difference between the sharing time and the current time of the same article;
and reserving the same article corresponding to the minimum time difference, and deleting the same articles corresponding to the rest time differences in the article sharing data.
Further, the constructing a tag sequence feature corresponding to the sample user according to the sharing sequence feature includes:
respectively obtaining article identifiers of the articles in the sharing sequence characteristics, and performing label query according to the article identifiers to obtain article labels, wherein the article labels are used for representing keywords corresponding to the articles;
and sequencing the corresponding article labels according to the sharing sequence of the articles in the sharing sequence characteristics to obtain the label sequence characteristics.
Further, the sorting the corresponding article tags according to the sharing order of the articles in the sharing sequence feature to obtain the tag sequence feature includes:
respectively inquiring the unique hot codes of the first-level labels in the article labels, and sequencing the unique hot codes corresponding to the first-level labels according to the sharing sequence of the articles in the sharing sequence characteristics to obtain first-level label sequence characteristics;
respectively inquiring the unique hot codes of the secondary labels in the article labels, and sequencing the unique hot codes corresponding to the secondary labels according to the sharing sequence of the article in the sharing sequence characteristics to obtain secondary label sequence characteristics;
and storing the primary label sequence characteristics and the secondary label sequence characteristics to obtain the label sequence characteristics.
Further, the screening a user according to the article sharing data to obtain a sample user includes:
respectively matching the articles in the article sharing data with different preset articles;
if the article in the article sharing data is not matched with the preset article, deleting the user corresponding to the article sharing data;
if the articles in the article sharing data are matched with the preset articles, deleting the user corresponding to the article sharing data;
setting the remaining users as the sample users.
A second aspect of an embodiment of the present application provides a training sample construction apparatus, including:
the system comprises a user screening unit, a user selection unit and a display unit, wherein the user screening unit is used for acquiring article sharing data of different user accounts and screening users according to the article sharing data to obtain sample users, and the article sharing data comprises a plurality of articles shared by the user accounts;
the sharing sequence construction unit is used for constructing sharing sequence characteristics according to the article sharing data corresponding to the sample user, and the sharing sequence characteristics are used for representing the sharing sequence among different articles;
the tag sequence construction unit is used for constructing tag sequence characteristics corresponding to the sample user according to the sharing sequence characteristics, and the tag sequence characteristics are used for representing the sharing sequence of article tags of different articles;
and the sample construction unit is used for constructing a sample according to the sharing sequence characteristic and the label sequence characteristic to obtain the training sample.
A third aspect of the embodiments of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the terminal device, where the processor implements the steps of the training sample construction method provided by the first aspect when executing the computer program.
A fourth aspect of the embodiments of the present application provides a storage medium, which stores a computer program that, when executed by a processor, implements the steps of the training sample construction method provided by the first aspect.
The training sample construction method, the training sample construction device, the terminal equipment and the storage medium have the following beneficial effects:
the training sample construction method provided by the embodiment of the application improves the relevance between the sample user and the article sharing data by screening the user according to the article sharing data, improves the accuracy of construction of the sharing sequence feature and the label sequence feature based on the relevance between the sample user and the article sharing data, by constructing sharing sequence characteristics according to article sharing data corresponding to sample users, the characteristics of the sharing sequence among different articles shared by the sample users can be effectively extracted, by constructing the label sequence characteristics corresponding to the sample user according to the sharing sequence characteristics, the characteristics of the sharing sequence between the labels corresponding to different articles shared by the sample user can be effectively extracted, by constructing the sample according to the sharing sequence characteristics and the label sequence characteristics, training samples can be effectively constructed based on the characteristics of the sharing sequence between different articles shared by sample users and between corresponding labels of different articles. Namely, after the estimation model is trained based on the training sample, the article sharing behavior of the user can be accurately estimated, and the estimation accuracy of the estimation model on the article sharing behavior of the user is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a flowchart of an implementation of a training sample construction method provided in an embodiment of the present application;
FIG. 2 is a flow chart of an implementation of a training sample construction method according to another embodiment of the present application;
FIG. 3 is a flowchart of an implementation of a training sample construction method according to yet another embodiment of the present application;
fig. 4 is a block diagram illustrating a structure of a training sample constructing apparatus according to an embodiment of the present disclosure;
fig. 5 is a block diagram of a terminal device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The training sample construction method according to the embodiment of the present application may be executed by a control device or a terminal (hereinafter referred to as a "mobile terminal").
Referring to fig. 1, fig. 1 shows a flowchart of an implementation of a training sample construction method provided in an embodiment of the present application, including:
and step S10, obtaining article sharing data of different user accounts, and screening users according to the article sharing data to obtain sample users.
The article sharing data of different user accounts is obtained by obtaining the buried point data of different user accounts, and the article sharing data includes a plurality of articles shared by corresponding user accounts, for example, when the user in the step includes a user a, a user B, a user C and a user D, the article sharing data of the user a, the user B, the user C and the user D is obtained by obtaining the buried point data of the user accounts corresponding to the user a, the user B, the user C and the user D.
In the step, the article sharing data of the corresponding user is obtained by acquiring the data identifier of the article sharing data and matching the data identifier of the article sharing data with the buried point data of the corresponding user, and in the step, the user is screened according to the article sharing data, so that the relevance between the sample user and the article sharing data is improved.
Optionally, in this step, the screening the user according to the article sharing data to obtain a sample user includes:
respectively matching the articles in the article sharing data with different preset articles;
if the article in the article sharing data is not matched with the preset article, deleting the user corresponding to the article sharing data;
if the articles in the article sharing data are matched with the preset articles, deleting the user corresponding to the article sharing data;
setting the remaining users as the sample users;
the method includes the steps that articles in article sharing data are respectively matched with different preset articles to judge whether the preset articles are shared by a user, if the articles in the article sharing data are not matched with the preset articles, the user is judged not to share the preset articles, and the user can only provide negative sample data for construction of a training sample, so that the accuracy of construction of the training sample is improved by deleting the user corresponding to the article sharing data, and if the articles in the article sharing data are matched with the preset articles, the user is judged to share the preset articles, and the user can only provide positive sample data for construction of the training sample, so that the accuracy of construction of the training sample is improved by deleting the user corresponding to the article sharing data.
Further, in this step, the screening a user according to the article sharing data to obtain a sample user further includes: if the shared article data corresponding to any user is detected to share a designated article, positive sample marking is carried out on the user, if the shared article data corresponding to any user is detected not to share the designated article, negative sample marking is carried out on the user, the number of the designated articles is more than 2, the user carrying the positive sample mark and the negative sample mark is set as a sample user, and then data balance between positive sample data and negative sample data is effectively improved.
For example, when the designated article includes a designated article a, a designated article b, and a designated article c, whether the corresponding user shares the designated article a, the designated article b, and the designated article c is determined according to article sharing data, if the user shares any one or two of the designated article a, the designated article b, and the designated article c, the user is set as a sample user, the user who does not share the designated article a, the designated article b, and the designated article c is deleted, that is, the user only carries a negative sample mark, and the user who shares the designated article a, the designated article b, and the designated article c is deleted, that is, the user only carries a positive sample mark.
Step S20, constructing sharing sequence characteristics according to the article sharing data corresponding to the sample user.
For example, when the article sharing data corresponding to the sample user includes an article d, an article e, and an article f, the sharing sequence feature of the corresponding user is constructed according to the article d, the article e, and the article f, and the sharing sequence feature is used for characterizing the sharing sequence among the article d, the article e, and the article f.
Step S30, constructing a tag sequence feature corresponding to the sample user according to the sharing sequence feature.
For example, when the article sharing data corresponding to the sample user includes an article d, an article e, and an article f, and the article tags corresponding to the article d, the article e, and the article f are an article tag g, an article tag h, and an article tag i, the tag sequence feature corresponding to the user is constructed according to the sharing sequence feature, and the tag sequence feature is used for representing the sharing sequence among the article tag g, the article tag h, and the article tag i.
And step S40, constructing a sample according to the sharing sequence characteristics and the label sequence characteristics to obtain the training sample.
The training sample is constructed according to the sharing sequence characteristics and the label sequence characteristics, and the characteristics of the sharing sequence between different articles and the characteristics of the sharing sequence between article labels of different articles are effectively carried, so that the estimation model after model training based on the training sample can accurately estimate the article sharing behavior of the user, and the estimation accuracy of the estimation model on the article sharing behavior of the user is improved.
In the embodiment, the relevance between the sample user and the article sharing data is improved by screening the users according to the article sharing data, the accuracy of the construction of the sharing sequence characteristics and the label sequence characteristics is improved based on the relevance between the sample user and the article sharing data, by constructing sharing sequence characteristics according to article sharing data corresponding to sample users, the characteristics of the sharing sequence among different articles shared by the sample users can be effectively extracted, by constructing the label sequence characteristics corresponding to the sample user according to the sharing sequence characteristics, the characteristics of the sharing sequence between the labels corresponding to different articles shared by the sample user can be effectively extracted, by constructing the sample according to the sharing sequence characteristics and the label sequence characteristics, training samples can be effectively constructed based on the characteristics of the sharing sequence between different articles shared by sample users and between corresponding labels of different articles. Namely, after the estimation model is trained based on the training sample, the article sharing behavior of the user can be accurately estimated, and the estimation accuracy of the estimation model on the article sharing behavior of the user is improved.
Referring to fig. 2, fig. 2 is a flowchart illustrating a training sample construction method according to another embodiment of the present application. With respect to the embodiment corresponding to fig. 1, the training sample construction method provided in this embodiment is further detailed in step S20 in the embodiment corresponding to fig. 1, and includes:
and S21, screening the articles in the article sharing data, and acquiring the sharing time corresponding to the screened articles.
The articles in the article sharing data are screened to prevent the influence of the repeatedly shared articles in the article sharing data on the construction of the subsequent sharing sequence characteristics, wherein the sharing time is the time corresponding to the article shared by the corresponding user.
Optionally, in this step, the screening the articles in the article sharing data includes:
performing repeated sharing detection on the articles in the article sharing data;
if the same articles exist in the article sharing data, respectively acquiring the sharing time of the same articles, and deleting the articles according to the sharing time of the same articles, wherein the number of the deleted articles is smaller than that of the same articles;
the duplicate sharing detection is used for detecting whether the same article exists in the article sharing data, and if the same article exists in the article sharing data, an article duplicate sharing phenomenon exists in the article sharing data, namely that a user corresponding to the article sharing data shares the same article repeatedly.
For example, the article sharing data includes sharing information of the user in one day, and the sharing information of the user in one day includes: the article d shared in the morning, the article e and the article d shared in the afternoon are obtained by respectively obtaining the sharing time d corresponding to the article d shared in the morning and the article d shared in the afternoon because the sharing information of the user in one day repeatedly analyzes the article d, the article d is included in the article sharing data, and the article d shared in the morning and the article d shared in the afternoon are obtained1And is divided intoShared time d2And according to the sharing time d1And sharing time d2And deleting the articles to prevent the articles repeatedly shared in the article sharing data from influencing the construction of the subsequent sharing sequence characteristics.
Specifically, in this step, if the number of identical articles is n for each article, the number of articles to be deleted is n-1, for example, if the number of identical articles is 2 for article d, 1 article d is deleted, and only 1 article d is retained, and if the number of identical articles is 10 for article d, 9 articles d are deleted.
Preferably, in this step, the deleting the articles according to the sharing time of the same article includes:
respectively calculating the time difference between the sharing time and the current time of the same article;
the same article corresponding to the minimum time difference is reserved, and the same articles corresponding to the rest time differences are deleted from the article sharing data;
the method comprises the steps of calculating time differences between sharing time and current time of the same article, and reserving the same article corresponding to the minimum time difference, so that only one piece of sharing information is reserved for the article in article sharing data, and the influence of the repeatedly shared article in the article sharing data on the construction of the subsequent sharing sequence features is prevented.
For example, for article d, the same article d exists at different sharing times1Article d2And article d3Then respectively obtain the article d1Article d2And article d3Obtaining the sharing time d3Sharing time d4And sharing time d5And respectively calculating and sharing time d according to the current time3Sharing time d4And sharing time d5Time difference between them, obtaining time difference m1Time difference m2Sum time difference m3And when the time difference m1Less than time difference m2Time difference m2Less than time difference m3Then only the time is reservedDifference m1Corresponding same article d1The same article d1The sharing behavior of the article d closest to the current time is aimed at.
S22, sorting the screened articles according to the sharing time to obtain the sharing sequence characteristics of the article sharing data corresponding to the sample user.
The article sharing data after article screening only has one piece of corresponding sharing information for different articles, and the article sharing data is sorted according to the sharing time corresponding to different articles in the article sharing data after article screening, so that the sharing sequence characteristics of the sample users corresponding to the article sharing data are obtained.
In the embodiment, the articles in the article sharing data are screened to prevent the influence of the repeatedly shared articles in the article sharing data on the construction of the subsequent sharing sequence features, the articles in the article sharing data are repeatedly shared and detected to respectively obtain the sharing time of the same article, and the articles are deleted according to the sharing time of the same article, so that only one piece of corresponding sharing information exists for different articles in the article sharing data after the articles are screened, and the accuracy of the construction of the sharing sequence features is improved.
Referring to fig. 3, fig. 3 is a flowchart illustrating an implementation of a training sample construction method according to another embodiment of the present application. With respect to the embodiment corresponding to fig. 1, the training sample construction method provided in this embodiment is further detailed in step S30 in the embodiment corresponding to fig. 1, and includes:
step S31, respectively obtaining article identifiers of the articles in the sharing sequence features, and performing label query according to the article identifiers to obtain article labels.
The article identifier is used for representing a corresponding unique article, the article identifier may be stored in a manner of characters, numbers, letters, or the like, a tag lookup table is prestored in the embodiment, a corresponding relationship between different article identifiers and corresponding article tags is stored in the tag lookup table, and the article tags are used for representing keywords of the corresponding article, for example, the article tags may be "sports", "finance", or "movie", or the like.
And step S32, sequencing the corresponding article labels according to the sharing sequence of the articles in the sharing sequence characteristics to obtain the label sequence characteristics.
Specifically, in this step, the step of ranking the corresponding article tags according to the sharing order of the articles in the sharing sequence feature to obtain the tag sequence feature includes:
respectively inquiring the unique hot codes of the first-level labels in the article labels, and sequencing the unique hot codes corresponding to the first-level labels according to the sharing sequence of the articles in the sharing sequence characteristics to obtain first-level label sequence characteristics;
respectively inquiring the unique hot codes of the secondary labels in the article labels, and sequencing the unique hot codes corresponding to the secondary labels according to the sharing sequence of the article in the sharing sequence characteristics to obtain secondary label sequence characteristics;
storing the primary label sequence characteristics and the secondary label sequence characteristics to obtain the label sequence characteristics;
optionally, in this embodiment, the article label further includes a third-level label, and the third-level label may be used to represent a keyword of a specified paragraph in the corresponding article.
Specifically, in this step, the one-hot codes corresponding to the first-level tags are sorted according to the sharing sequence of the articles in the sharing sequence features to obtain first-level tag sequence features, the first-level tag sequence features are used for representing the sharing sequence among the first-level tags of different articles, the one-hot codes corresponding to the second-level tags are sorted according to the sharing sequence of the articles in the sharing sequence features to obtain second-level tag sequence features, and the second-level tag sequence features are used for representing the sharing sequence among the second-level tags of different articles.
For example, the sharing sequence in the sharing sequence feature is article d-article e-article f, and the first-level label of article d is label s1The second label is s2The first label of article e is label s3The second label is s4The first label of the article f is label s5The second label is s6Sorting the one-hot codes corresponding to the first-level labels according to the sharing sequence of the article in the sharing sequence characteristics, wherein the obtained first-level label sequence characteristics are first-level label labels s1-primary label tag s3-primary label tag s5Sequencing the one-hot codes corresponding to the secondary labels according to the sharing sequence of the article in the sharing sequence characteristics to obtain secondary label sequence characteristics which are secondary label labels s2-secondary label tags s4-secondary label tags s6
In the embodiment, the one-hot codes corresponding to the first-level tags are sequenced according to the sharing sequence of the articles in the sharing sequence features to obtain the sharing sequence among the first-level tags of different articles, the one-hot codes corresponding to the second-level tags are sequenced according to the sharing sequence of the articles in the sharing sequence features to obtain the sharing sequence among the second-level tags of different articles, so that the characteristics of the sharing sequence among different articles shared by users in different samples can be effectively carried in subsequently generated training samples, after the estimation model is trained on the basis of the training samples, the article sharing behaviors of the users can be accurately estimated, and the accuracy of the estimation model for the article sharing behaviors of the users is improved.
In all embodiments of the application, a training sample is obtained by performing sample construction based on the sharing sequence feature and the tag sequence feature, and specifically, the training sample is obtained by performing sample construction based on the sharing sequence feature and the tag sequence feature. Uploading the constructed training sample to the block chain can ensure the safety and the fairness and transparency to the user. The user equipment may download the training sample from the blockchain to verify that the training sample has been tampered with. The blockchain referred to in this example is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Referring to fig. 4, fig. 4 is a block diagram illustrating a training sample constructing apparatus 100 according to an embodiment of the present disclosure. The training sample construction apparatus 100 in this embodiment includes units for performing the steps in the embodiments corresponding to fig. 1, fig. 2, and fig. 3. Please refer to fig. 1, fig. 2, and fig. 3, and the corresponding embodiments of fig. 1, fig. 2, and fig. 3. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 4, the training sample construction apparatus 100 includes: user screening unit 10, share sequence construction unit 11, label sequence construction unit 12 and sample construction unit 13, wherein:
the user screening unit 10 is configured to acquire article sharing data of different user accounts, and screen a user according to the article sharing data to obtain a sample user, where the article sharing data includes a plurality of articles shared by the user accounts.
Wherein the user filtering unit 10 is further configured to: respectively matching the articles in the article sharing data with different preset articles;
if the article in the article sharing data is not matched with the preset article, deleting the user corresponding to the article sharing data;
if the articles in the article sharing data are matched with the preset articles, deleting the user corresponding to the article sharing data;
setting the remaining users as the sample users.
The sharing sequence constructing unit 11 is configured to construct a sharing sequence feature according to the article sharing data corresponding to the sample user, where the sharing sequence feature is used to characterize a sharing sequence between different articles.
Wherein, the sharing sequence constructing unit 11 is further configured to: screening the articles in the article sharing data, and acquiring sharing time corresponding to the screened articles;
and sequencing the screened articles according to the sharing time to obtain the sharing sequence characteristics of the article sharing data corresponding to the sample user.
Optionally, the sharing sequence constructing unit 11 is further configured to: performing repeated sharing detection on the articles in the article sharing data, wherein the repeated sharing detection is used for detecting whether the same articles exist in the article sharing data;
if the same articles exist in the article sharing data, the sharing time of the same articles is respectively obtained, the articles are deleted according to the sharing time of the same articles, and the number of the deleted articles is smaller than that of the same articles.
Optionally, the sharing sequence constructing unit 11 is further configured to: respectively calculating the time difference between the sharing time and the current time of the same article;
and reserving the same article corresponding to the minimum time difference, and deleting the same articles corresponding to the rest time differences in the article sharing data.
And a tag sequence constructing unit 12, configured to construct, according to the sharing sequence feature, a tag sequence feature corresponding to the sample user, where the tag sequence feature is used to characterize the sharing sequence of article tags of different articles.
Wherein the tag sequence constructing unit 12 is further configured to: respectively obtaining article identifiers of the articles in the sharing sequence characteristics, and performing label query according to the article identifiers to obtain article labels, wherein the article labels are used for representing keywords corresponding to the articles;
and sequencing the corresponding article labels according to the sharing sequence of the articles in the sharing sequence characteristics to obtain the label sequence characteristics.
Optionally, the tag sequence constructing unit 12 is further configured to: respectively inquiring the unique hot codes of the first-level labels in the article labels, and sequencing the unique hot codes corresponding to the first-level labels according to the sharing sequence of the articles in the sharing sequence characteristics to obtain first-level label sequence characteristics;
respectively inquiring the unique hot codes of the secondary labels in the article labels, and sequencing the unique hot codes corresponding to the secondary labels according to the sharing sequence of the article in the sharing sequence characteristics to obtain secondary label sequence characteristics;
and storing the primary label sequence characteristics and the secondary label sequence characteristics to obtain the label sequence characteristics.
And the sample construction unit 13 is configured to perform sample construction according to the sharing sequence feature and the tag sequence feature to obtain the training sample.
As can be seen from the above, by screening the users according to the article sharing data, the relevance between the sample user and the article sharing data is improved, the accuracy of the construction of the sharing sequence features and the tag sequence features is improved based on the relevance between the sample user and the article sharing data, by constructing sharing sequence characteristics according to article sharing data corresponding to sample users, the characteristics of the sharing sequence among different articles shared by the sample users can be effectively extracted, by constructing the label sequence characteristics corresponding to the sample user according to the sharing sequence characteristics, the characteristics of the sharing sequence between the labels corresponding to different articles shared by the sample user can be effectively extracted, by constructing the sample according to the sharing sequence characteristics and the label sequence characteristics, training samples can be effectively constructed based on the characteristics of the sharing sequence between different articles shared by sample users and between corresponding labels of different articles. Namely, after the estimation model is trained based on the training sample, the article sharing behavior of the user can be accurately estimated, and the estimation accuracy of the estimation model on the article sharing behavior of the user is improved.
Fig. 5 is a block diagram of a terminal device 2 according to another embodiment of the present application. As shown in fig. 5, the terminal device 2 of this embodiment includes: a processor 20, a memory 21 and a computer program 22, such as a program for a training sample construction method, stored in said memory 21 and executable on said processor 20. The processor 20, when executing the computer program 23, implements the steps in the above-mentioned embodiments of the training sample construction method, such as S10 to S40 shown in fig. 1, or S21 to S22 shown in fig. 2, or S31 to S32 shown in fig. 3. Alternatively, when the processor 20 executes the computer program 22, the functions of the units in the embodiment corresponding to fig. 4, for example, the functions of the units 10 to 13 shown in fig. 4, are implemented, for which reference is specifically made to the relevant description in the embodiment corresponding to fig. 5, which is not repeated herein.
Illustratively, the computer program 22 may be divided into one or more units, which are stored in the memory 21 and executed by the processor 20 to accomplish the present application. The one or more units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 22 in the terminal device 2. For example, the computer program 22 may be divided into a user filtering unit 10, a sharing sequence constructing unit 11, a tag sequence constructing unit 12 and a sample constructing unit 13, each of which functions as described above.
The terminal device may include, but is not limited to, a processor 20, a memory 21. It will be appreciated by those skilled in the art that fig. 5 is merely an example of a terminal device 2 and does not constitute a limitation of the terminal device 2 and may include more or less components than those shown, or some components may be combined, or different components, for example the terminal device may also include input output devices, network access devices, buses, etc.
The Processor 20 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 21 may be an internal storage unit of the terminal device 2, such as a hard disk or a memory of the terminal device 2. The memory 21 may also be an external storage device of the terminal device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 2. Further, the memory 21 may also include both an internal storage unit and an external storage device of the terminal device 2. The memory 21 is used for storing the computer program and other programs and data required by the terminal device. The memory 21 may also be used to temporarily store data that has been output or is to be output.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A training sample construction method is characterized by comprising the following steps:
the method comprises the steps of obtaining article sharing data of different user accounts, screening users according to the article sharing data to obtain sample users, wherein the articles are shared by the user accounts;
establishing sharing sequence characteristics according to the article sharing data, wherein the sharing sequence characteristics are used for representing the sharing sequence among different articles;
constructing a tag sequence feature corresponding to the sample user according to the sharing sequence feature, wherein the tag sequence feature is used for representing the sharing sequence of article tags of different articles;
and constructing a sample according to the sharing sequence characteristic and the label sequence characteristic to obtain the training sample.
2. The training sample construction method according to claim 1, wherein the construction of the sharing sequence feature according to the article sharing data corresponding to the sample user includes:
screening the articles in the article sharing data, and acquiring sharing time corresponding to the screened articles;
and sequencing the screened articles according to the sharing time to obtain the sharing sequence characteristics of the article sharing data corresponding to the sample user.
3. The training sample construction method according to claim 2, wherein the screening of the articles in the article sharing data includes:
performing repeated sharing detection on the articles in the article sharing data, wherein the repeated sharing detection is used for detecting whether the same articles exist in the article sharing data;
if the same articles exist in the article sharing data, the sharing time of the same articles is respectively obtained, the articles are deleted according to the sharing time of the same articles, and the number of the deleted articles is smaller than that of the same articles.
4. The training sample construction method according to claim 3, wherein the deleting of the article according to the sharing time of the same article includes:
respectively calculating the time difference between the sharing time and the current time of the same article;
and reserving the same article corresponding to the minimum time difference, and deleting the same articles corresponding to the rest time differences in the article sharing data.
5. The training sample construction method according to claim 1, wherein constructing the tag sequence feature corresponding to the sample user according to the sharing sequence feature comprises:
respectively obtaining article identifiers of the articles in the sharing sequence characteristics, and performing label query according to the article identifiers to obtain article labels, wherein the article labels are used for representing keywords corresponding to the articles;
and sequencing the corresponding article labels according to the sharing sequence of the articles in the sharing sequence characteristics to obtain the label sequence characteristics.
6. The training sample construction method according to claim 5, wherein the step of ranking the corresponding article tags according to the sharing order of the articles in the sharing sequence feature to obtain the tag sequence feature comprises:
respectively inquiring the unique hot codes of the first-level labels in the article labels, and sequencing the unique hot codes corresponding to the first-level labels according to the sharing sequence of the articles in the sharing sequence characteristics to obtain first-level label sequence characteristics;
respectively inquiring the unique hot codes of the secondary labels in the article labels, and sequencing the unique hot codes corresponding to the secondary labels according to the sharing sequence of the article in the sharing sequence characteristics to obtain secondary label sequence characteristics;
and storing the primary label sequence characteristics and the secondary label sequence characteristics to obtain the label sequence characteristics.
7. The training sample construction method according to claim 1, wherein the step of screening users according to the article sharing data to obtain sample users comprises:
respectively matching the articles in the article sharing data with different preset articles;
if the article in the article sharing data is not matched with the preset article, deleting the user corresponding to the article sharing data;
if the articles in the article sharing data are matched with the preset articles, deleting the user corresponding to the article sharing data;
setting the remaining users as the sample users.
8. A training sample construction apparatus, comprising:
the system comprises a user screening unit, a user selection unit and a display unit, wherein the user screening unit is used for acquiring article sharing data of different user accounts and screening users according to the article sharing data to obtain sample users, and the article sharing data comprises a plurality of articles shared by the user accounts;
the sharing sequence construction unit is used for constructing sharing sequence characteristics according to the article sharing data corresponding to the sample user, and the sharing sequence characteristics are used for representing the sharing sequence among different articles;
the tag sequence construction unit is used for constructing a tag sequence feature corresponding to the sample user according to the sharing sequence feature, and the tag sequence feature is used for representing the sharing sequence of article tags among different articles;
and the sample construction unit is used for constructing a sample according to the sharing sequence characteristic and the label sequence characteristic to obtain the training sample.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when executing the computer program.
10. A storage medium storing a computer program, characterized in that the computer program realizes the steps of the method according to any one of claims 1 to 7 when executed by a processor.
CN202011217114.3A 2020-11-04 2020-11-04 Training sample construction method, device, terminal equipment and storage medium Active CN112328779B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011217114.3A CN112328779B (en) 2020-11-04 2020-11-04 Training sample construction method, device, terminal equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011217114.3A CN112328779B (en) 2020-11-04 2020-11-04 Training sample construction method, device, terminal equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112328779A true CN112328779A (en) 2021-02-05
CN112328779B CN112328779B (en) 2024-02-13

Family

ID=74323741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011217114.3A Active CN112328779B (en) 2020-11-04 2020-11-04 Training sample construction method, device, terminal equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112328779B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150120712A1 (en) * 2013-03-15 2015-04-30 Yahoo! Inc. Customized News Stream Utilizing Dwelltime-Based Machine Learning
CN110083699A (en) * 2019-03-18 2019-08-02 中国科学院自动化研究所 News Popularity prediction model training method based on deep neural network
CN110929206A (en) * 2019-11-20 2020-03-27 腾讯科技(深圳)有限公司 Click rate estimation method and device, computer readable storage medium and equipment
CN111144986A (en) * 2019-12-25 2020-05-12 清华大学 Commodity recommendation method and device for social e-commerce website based on sharing behavior
WO2020135535A1 (en) * 2018-12-29 2020-07-02 华为技术有限公司 Recommendation model training method and related apparatus
CN111488517A (en) * 2019-01-29 2020-08-04 北京沃东天骏信息技术有限公司 Method and device for training click rate estimation model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150120712A1 (en) * 2013-03-15 2015-04-30 Yahoo! Inc. Customized News Stream Utilizing Dwelltime-Based Machine Learning
WO2020135535A1 (en) * 2018-12-29 2020-07-02 华为技术有限公司 Recommendation model training method and related apparatus
CN111488517A (en) * 2019-01-29 2020-08-04 北京沃东天骏信息技术有限公司 Method and device for training click rate estimation model
CN110083699A (en) * 2019-03-18 2019-08-02 中国科学院自动化研究所 News Popularity prediction model training method based on deep neural network
CN110929206A (en) * 2019-11-20 2020-03-27 腾讯科技(深圳)有限公司 Click rate estimation method and device, computer readable storage medium and equipment
CN111144986A (en) * 2019-12-25 2020-05-12 清华大学 Commodity recommendation method and device for social e-commerce website based on sharing behavior

Also Published As

Publication number Publication date
CN112328779B (en) 2024-02-13

Similar Documents

Publication Publication Date Title
CN112613917A (en) Information pushing method, device and equipment based on user portrait and storage medium
CN109685537B (en) User behavior analysis method, device, medium and electronic equipment
CN112669138A (en) Data processing method and related equipment
CN107622326A (en) User's classification, available resources Forecasting Methodology, device and equipment
CN112131471B (en) Method, device, equipment and medium for recommending relationship based on unowned undirected graph
CN112231555A (en) Recall method, apparatus, device and storage medium based on user portrait label
WO2023024408A1 (en) Method for determining feature vector of user, and related device and medium
CN112035449A (en) Data processing method and device, computer equipment and storage medium
US11301522B1 (en) Method and apparatus for collecting information regarding dark web
CN112328881B (en) Article recommendation method, device, terminal equipment and storage medium
CN111667018B (en) Object clustering method and device, computer readable medium and electronic equipment
WO2023272862A1 (en) Risk control recognition method and apparatus based on network behavior data, and electronic device and medium
CN112948526A (en) User portrait generation method and device, electronic equipment and storage medium
CN116629423A (en) User behavior prediction method, device, equipment and storage medium
CN111062449A (en) Prediction model training method, interestingness prediction device and storage medium
CN116167457A (en) Data labeling method, device, computer equipment and storage medium
CN113516205B (en) Employee stability classification method based on artificial intelligence and related equipment
CN112328779A (en) Training sample construction method and device, terminal equipment and storage medium
CN112328752B (en) Course recommendation method and device based on search content, computer equipment and medium
CN113591881A (en) Intention recognition method and device based on model fusion, electronic equipment and medium
CN113688206A (en) Text recognition-based trend analysis method, device, equipment and medium
CN111382343B (en) Label system generation method and device
CN113850669A (en) User grouping method and device, computer equipment and computer readable storage medium
CN113282837A (en) Event analysis method and device, computer equipment and storage medium
CN113435741A (en) Training plan generation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant