CN112989824A - Information pushing method and device, electronic equipment and storage medium - Google Patents

Information pushing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112989824A
CN112989824A CN202110515156.3A CN202110515156A CN112989824A CN 112989824 A CN112989824 A CN 112989824A CN 202110515156 A CN202110515156 A CN 202110515156A CN 112989824 A CN112989824 A CN 112989824A
Authority
CN
China
Prior art keywords
information
content
user
keyword
pushed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110515156.3A
Other languages
Chinese (zh)
Inventor
陈程
王贺
石奕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Zhuoer Digital Media Technology Co ltd
Original Assignee
Wuhan Zhuoer Digital Media Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Zhuoer Digital Media Technology Co ltd filed Critical Wuhan Zhuoer Digital Media Technology Co ltd
Priority to CN202110515156.3A priority Critical patent/CN112989824A/en
Publication of CN112989824A publication Critical patent/CN112989824A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Abstract

The embodiment of the application discloses an information pushing method, which comprises the following steps: acquiring metadata of user generated content associated with a current application, and extracting a first keyword from the metadata; generating a user interest portrait of the target user according to the first keyword and the weight of the first keyword, wherein the user interest portrait comprises: at least one user tag characterizing content of interest to the target user; generating an information content portrait of the information to be pushed according to a second keyword of the information to be pushed and the weight of the second keyword, wherein the information content portrait comprises at least one content tag indicating the information content of the information to be pushed; and selecting at least one piece of information from the information to be pushed to the target user according to the user interest portrait and the information content portrait. Therefore, the information to be pushed is selected according to the user interest portrait, and the pushed information is attached to the user interest content.

Description

Information pushing method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of information processing, and in particular, to an information pushing method and apparatus, an electronic device, and a storage medium.
Background
In the prior art, for constructing a user interest representation, generally, a webpage text browsed by a user is mapped onto an ontology concept word representing a corresponding interest point, so as to determine the ontology concept word interested by the user. However, because the web page text contains a large amount of interference information, such as advertisements, navigation bars, user misoperation and the like, so that the interest point tags in the constructed user interest representation have more interference information, the user interest representation is inaccurate, and the information such as advertisements and texts to be pushed to the user cannot be effectively matched with the content of interest of the user.
Disclosure of Invention
In view of this, embodiments of the present invention provide an information pushing method and apparatus, an electronic device, and a storage medium.
The technical scheme of the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides an information pushing method, including:
acquiring metadata of user generated content associated with the current application, and extracting a first keyword from the metadata;
generating a user interest portrait of a target user according to the first keyword and the weight of the first keyword, wherein the user interest portrait comprises: at least one user tag characterizing content of interest to a target user;
generating an information content portrait of the information to be pushed according to a second keyword of the information to be pushed and the weight of the second keyword, wherein the information content portrait comprises at least one content tag indicating the information content of the information to be pushed;
and selecting at least one piece of information from the information to be pushed to a target user according to the user interest portrait and the information content portrait.
Further, extracting the first keyword from the metadata includes:
performing word segmentation processing on the metadata to obtain a word sequence; wherein the word sequence comprises a plurality of words;
removing stop words in the word sequence;
and extracting a first keyword of which the information entropy and/or the occurrence frequency meet preset conditions from the word sequence without the stop words.
Further, extracting a first keyword of which the information entropy and/or the occurrence frequency meet preset conditions, including:
and aiming at a plurality of preset categories, respectively extracting the information entropy and/or the occurrence frequency of the first key words meeting preset conditions in each preset category.
Further, the method further comprises:
determining the information entropy of each word according to the number of other words matched with each word in the information to be pushed;
and selecting a second keyword from all words contained in the information to be pushed according to the size of the information entropy.
Further, the user interest representation includes: the system comprises a plurality of user tags, a plurality of storage units and a plurality of processing units, wherein the user tags are sequentially sequenced to form a first vector;
the information content representation comprises: the plurality of content tags are sequentially ordered to form a second vector;
according to the user interest portrait and the information content portrait, at least one piece of information is selected from the information to be pushed and pushed to a target user, and the method comprises the following steps:
according to the vector distance between the first vector and the second vector, determining the similarity of the user interest portrait and the information content portrait;
and selecting at least one information content image with the highest similarity from the information to be pushed, and pushing the information corresponding to the information content image to the target user.
Further, selecting at least one information content image with the highest similarity from the information to be pushed, and pushing the information corresponding to the at least one information content image to the target user, wherein the information to be pushed comprises:
selecting information corresponding to a preset number of information content images with highest similarity from the information to be pushed;
classifying the information of the preset quantity according to the content tags;
and selecting information corresponding to at least one information content image with the highest similarity from the classification of the corresponding content label according to the user label, and pushing the information to the target user.
Further, the user tag includes: a first keyword and a weight of the first keyword; wherein, the weights of different first keywords are different.
In a second aspect, an embodiment of the present invention provides an information pushing apparatus, including:
an acquisition unit configured to acquire metadata of user-generated content associated with a current application, and extract a first keyword in the metadata;
the generating unit is used for generating a user interest portrait of the target user according to the first key word and the weight of the first key word, wherein the user interest portrait comprises: at least one user tag characterizing content of interest to a target user; generating an information content portrait of the information to be pushed according to a second keyword of the information to be pushed and the weight of the second keyword, wherein the information content portrait comprises at least one content tag indicating the information content of the information to be pushed;
and the pushing unit is used for selecting at least one piece of information from the information to be pushed to a target user according to the user interest portrait and the information content portrait.
In a third aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes: a processor and a memory for storing a computer program capable of running on the processor;
the processor, when running said computer program, performs the steps of one or more of the preceding claims.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium storing computer-executable instructions; the computer-executable instructions, when executed by a processor, are capable of implementing the methods described in one or more of the preceding claims.
The information pushing method provided by the invention comprises the following steps: acquiring metadata of user generated content associated with a current application, and extracting a first keyword from the metadata; generating a user interest portrait of a target user according to the first keyword and the weight of the first keyword, wherein the user interest portrait comprises: at least one user tag characterizing content of interest to a target user; generating an information content portrait of the information to be pushed according to a second keyword of the information to be pushed and the weight of the second keyword, wherein the information content portrait comprises at least one content tag indicating the information content of the information to be pushed; and selecting at least one piece of information from the information to be pushed to a target user according to the user interest portrait and the information content portrait. Therefore, the keywords are extracted through the content generated by the user in the application, the interference of other operations on the judgment of the user interest content is reduced, and the extracted keywords are more in line with the user interest content. Based on the method, the user interest and the information content to be pushed are respectively portrayed, and the information close to the user interest content can be selected more easily according to the characteristics of the portrayal.
Drawings
Fig. 1 is a schematic flowchart of an information pushing method according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of an information pushing method according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of an information pushing method according to an embodiment of the present invention;
fig. 4 is a flowchart illustrating an information pushing method according to an embodiment of the present invention;
fig. 5 is a flowchart illustrating an information pushing method according to an embodiment of the present invention;
fig. 6 is a flowchart illustrating an information pushing method according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an information pushing apparatus according to an embodiment of the present invention;
fig. 8 is a flowchart illustrating an information pushing method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the description that follows, references to the terms "first \ second \ third" are intended merely to distinguish similar objects and do not denote a particular order, but rather are to be understood that the terms "first \ second \ third" may be interchanged under certain circumstances or sequences of events to enable embodiments of the invention described herein to be practiced in other than those illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
As shown in fig. 1, an embodiment of the present invention provides an information pushing method, including:
s110: acquiring metadata of user generated content associated with a current application, and extracting a first keyword from the metadata;
s120: generating a user interest portrait of the target user according to the first keyword and the weight of the first keyword, wherein the user interest portrait comprises: at least one user tag characterizing content of interest to the target user;
s130: generating an information content portrait of the information to be pushed according to a second keyword of the information to be pushed and the weight of the second keyword, wherein the information content portrait comprises at least one content tag indicating the information content of the information to be pushed;
s140: and selecting at least one piece of information from the information to be pushed to the target user according to the user interest portrait and the information content portrait.
Here, the application may be various social applications, reading applications, media applications, and the like, for example, microblog, blog, and the like. The information to be pushed may be content that needs to be recommended and delivered to the target user in the application, for example, the information may be advertisement text information, other user information that may be of interest, audio/video, articles, pictures, or other text information that may be of interest, and the advertisement text information may include: commercial advertisements, charitable advertisements, and the like. Taking the application as the microblog as an example, the information to be pushed can be contents such as advertisement microblog, blogger information which may be interested in, or microblog text.
In the embodiment of the present invention, the User generates content, also called UGC (User-generated content), which generally means that the User displays or provides original content of the User to other users through an internet platform. Here, the user-generated content may include various text contents generated based on the user's originality, for example, taking an application as a microblog as an example, the user-generated content may include original microblog contents published by the user, comment contents published by the user, and text contents input in a search bar and subjected to search browsing. Metadata of user-generated content associated with the current application may record log data of various original text content of the user for the current application.
In one embodiment, metadata of user-generated content associated with a current application is obtained, and web page information of the current application and the metadata of the user-generated content may be crawled by a web crawler. For example, log data for an application can be crawled based on the Scapy framework.
In one embodiment, for example, when an application is used as a microblog, the log data is crawled based on a script framework, and the method includes: determining character nodes of a target user, for example, positioning log data corresponding to identity identification information of the target user according to the identity identification information in recorded microblog log data of a plurality of users; crawling the user-generated content in the log data of the target user, for example, crawling original microblog content published by the target user, published comment content, searched text content, and the like.
In another embodiment, after obtaining metadata for user-generated content associated with a current application, pre-processing the content of the metadata includes: extracting text content, which may be to extract metadata based on preset tags to obtain text content corresponding to each preset tag, for example, extracting metadata by regular matching based on a fixed format of the text content to be extracted in log data to obtain text content of each microblog issued by a target user, text content of published comment content, and the like. Therefore, noise data existing in the log data can be effectively filtered based on the regular expression matching text content, for example, "@ XXX" content indicating that other users are reminded in the microblog content, URL content representing a Uniform Resource Locator (URL) linked to a website entrance, and "[ XX ] content representing emoticons in the microblog text content. And extracting text content through regular matching to obtain the text content with the most utilization value.
In one embodiment, after the log data is crawled and the text content is extracted, a first keyword in the text content is extracted
Figure 525813DEST_PATH_IMAGE001
For example, the keywords in the text content may be extracted according to a plurality of categories by CHI-square test CHI, or the keywords may be extracted based on the Institute of Computing Technology of Chinese Institute of Technology Chinese Lexical Analysis System (ICTCLAS). Extracting at least one first keyword
Figure 230464DEST_PATH_IMAGE001
Then, can be based onDetermining the frequency of the first keywords, the correlation degree with the text content or the information entropy and the like to determine each first keyword
Figure 631358DEST_PATH_IMAGE001
Weight of (2)
Figure 802577DEST_PATH_IMAGE002
For example, the weight is proportional to the frequency with which the first keyword appears within the text content; each first keyword may also be determined by Term Frequency-Inverse Text Frequency Index (TFIDF)
Figure 164288DEST_PATH_IMAGE001
Weight of (2)
Figure 531684DEST_PATH_IMAGE002
. The higher the weight is, the stronger the relevance between the corresponding first keyword and the text content is, and the interest of the user can be represented.
It can be understood that the second keyword for the information to be pushed
Figure 928030DEST_PATH_IMAGE003
And their weights
Figure 270150DEST_PATH_IMAGE004
The extraction may be performed by the keyword extraction method or other methods.
In another embodiment, the method is directed to a target userUGenerating a user interest representation
Figure 243791DEST_PATH_IMAGE005
For characterizing the interest preferences of a target user, comprising at least one user tag, each user tag may comprise a set of first keywords
Figure 24665DEST_PATH_IMAGE001
And first keyword weight
Figure 416464DEST_PATH_IMAGE002
. To the firstiGenerating information content portrait by information to be pushed
Figure 178752DEST_PATH_IMAGE006
For characterizing the content of the information to be pushed, at least one content tag is included, and each content tag may include a second keyword
Figure 780635DEST_PATH_IMAGE003
Based on the method, specific user interest portrait and information content portrait are generated according to the user and the information to be pushed, and further coincidence degree or similarity of the user interest portrait and the information content portrait can be determined, so that the information which is most consistent with the interest of the target user can be selected from the information to be pushed more accurately for pushing. Therefore, the keywords are extracted based on the user generated content, and the user generated content has high originality, so that the subjective interest preference of the user can be more accurately embodied, and the influence of other irrelevant operations or misoperation of the user on the generation of the interest portrait of the user is effectively inhibited. By extracting the text content, the influence of interference data is greatly reduced, and the coincidence degree of the user interest portrait and the user actual interest content is improved. On the basis, the push information can be close to the content interested by the user to the maximum extent, and the use experience of the user is improved.
In some embodiments, as shown in fig. 2, the S110 includes:
s111: acquiring metadata of user generated content associated with the current application, and performing word segmentation processing on the metadata to obtain a word sequence; wherein the sequence of words comprises a plurality of words;
s112: removing stop words in the word sequence;
s113: and extracting the first key words of which the information entropy and/or the occurrence frequency meet preset conditions from the word sequence without stop words.
In the embodiment of the present invention, the word segmentation processing may be performed on the metadata by ICTCLAS, or may be performed by using other tools, algorithms, and the like, such as Stanford word segmentation and source separation tools.
In one embodiment, taking the current application as the microblog as an example, after microblog log data of the content generated by the target user is acquired, text content is extracted from the log data, and then word segmentation processing is performed on the extracted text content, so that the text content has specific word segmentation, and a word sequence consisting of a plurality of words is formed.
In one embodiment, stop word removal is performed on the word sequence after the word segmentation process. Stop words are words that do not have a specific meaning as they exist in text when processing text data, e.g., "the", "at", "a", "the", etc. function words. And removing stop words, and searching and filtering the stop words existing in the word sequence based on a word matching mode through a preset stop word list.
In another embodiment, for the word sequence after word segmentation and removal of stop words, the first keyword may be determined based on the frequency of occurrence of each word in the text and/or the information entropy, where the information entropy represents the number of words that can be collocated left and right of each word, and the larger the information entropy, the richer the words that can be collocated with the word are, the word may be a keyword.
Accordingly, the weight of the first keyword may also be determined according to the frequency of the first keyword and/or the information entropy, for example, the higher the frequency of the first keyword appearing in the text content, the higher the corresponding weight. In addition, the weight of each first keyword may also be determined by TFIDF.
In another embodiment, for the word sequence after the word segmentation and the stop word removal, the CHI may be used to extract the first keyword in the text content according to multiple categories through CHI-square test, or the first keyword may be extracted based on other manners such as ICTCLAS.
Therefore, the metadata is subjected to word segmentation and stop word filtering, the metadata text content can be optimized, words of the text content are clearly and accurately divided, the condition that the keywords are extracted inaccurately due to confusion between front words and back words is restrained, the interference of meaningless functional words on the keyword extraction is reduced, and therefore the first keywords can be extracted from the metadata more conveniently.
In some embodiments, as shown in fig. 3, the S113 includes:
s1131: and respectively extracting the first key words of which the information entropies and/or the occurrence frequencies meet preset conditions in each preset category aiming at a plurality of preset categories from the word sequence with the stop words removed.
In the embodiment of the present invention, for the word sequence subjected to the word segmentation processing and the stop word filtering, the CHI-square test CHI may be adopted to perform keyword extraction, and the first keyword capable of representing each category is respectively extracted from the word sequence for a plurality of preset categories. For example, for the category "sports", the word in the word sequence that is most highly correlated with the category, or has the largest information entropy or the occurrence frequency satisfying the preset condition is determined as "basketball" based on the CHI, and the first keyword in the word sequence of the category "sports" is "basketball".
Based on this, the user interest portrayal
Figure 974987DEST_PATH_IMAGE007
In (1),
Figure 267297DEST_PATH_IMAGE008
to characterize the first keyword and the weight of the first predetermined category,
Figure 279115DEST_PATH_IMAGE009
to characterize the first keyword and the weight of the second predetermined category, and so on,
Figure 978081DEST_PATH_IMAGE010
to characterize thenA first keyword of a preset category and a weight.
Therefore, based on the classification of the preset categories, the first keywords corresponding to different categories can be determined more precisely and respectively, the condition that the extraction of the keywords is insufficient due to the fact that only the general keyword extraction is carried out on the whole word sequence is restrained, and the user generated portrait generated according to the first keywords is more comprehensive.
In some embodiments, as shown in fig. 4, the method further comprises:
s101: determining the information entropy of each word according to the number of other words which are matched with each word in the information to be pushed;
s102: and selecting a second keyword from all words contained in the information to be pushed according to the size of the information entropy.
In the embodiment of the invention, the second keyword of the information to be pushed is selected according to the information entropy of each word in the information to be pushed, for example, the second keyword can be extracted from the information to be pushed based on the information entropy through ICTCLAS. Based on this, the information content is portrait
Figure 100758DEST_PATH_IMAGE006
From
Figure 450836DEST_PATH_IMAGE011
To
Figure 305660DEST_PATH_IMAGE003
Can be arranged in sequence from large to small in entropy of information to be pushednA second keyword.
In one embodiment, determining the information entropy of each word may include determining a left information entropy and a right information entropy of each word, respectively, where the sum of the left information entropy and the right information entropy is the information entropy. The left information entropy can be determined according to the number of other words which are collocated with the words and located on the left side of the words in the information to be pushed, and the right information entropy can be determined according to the number of other words which are collocated with the words and located on the right side of the words in the information to be pushed. The second keyword may be selected based on a preset policy in combination with the left information entropy and the right information entropy, and for example, the word may be determined as the second keyword according to that the part of speech of the word and the left information entropy or the right information entropy jointly reach a certain condition.
Therefore, the information entropy is determined according to the collocation abundance of the words, and then the second keyword is selected from the information to be pushed based on the size of the information entropy, so that the second keyword can better reflect the content of the information to be pushed, and the situation that the information content cannot be accurately represented by the second keyword due to the fact that the second keyword is selected only according to the occurrence frequency is suppressed.
In some embodiments, the user interest representation includes: the system comprises a plurality of user tags, a plurality of storage units and a plurality of display units, wherein the user tags are sequentially sequenced to form a first vector;
the information content representation includes: the content tags are sequentially ordered to form a second vector;
the S140, as shown in fig. 5, includes:
s141: determining the similarity of the user interest portrait and the information content portrait according to the vector distance between the first vector and the second vector;
s142: and selecting at least one information content image with the highest similarity from the information to be pushed, and pushing the information corresponding to the information content image to the target user.
In an embodiment of the present invention, the user interest representation is generated in the form of a first vector formed by a plurality of user tags, each user tag may include a first keyword and a weight of the first keyword, for example, for the user interest representation
Figure 882135DEST_PATH_IMAGE012
Each user label
Figure 685135DEST_PATH_IMAGE013
Including a first keyword
Figure 765087DEST_PATH_IMAGE001
And their weights
Figure 259653DEST_PATH_IMAGE002
Thus the first vector can be expressed as
Figure 979216DEST_PATH_IMAGE014
The information content image is generated in the form of a second vector formed by a plurality of content labels, each content labelA second keyword may be included or a second keyword and a weight of the second keyword may be included. For example, for information content portrayal
Figure 709275DEST_PATH_IMAGE006
Each content label
Figure 253520DEST_PATH_IMAGE015
May include a second keyword
Figure 43621DEST_PATH_IMAGE003
And thus the second vector can be expressed as
Figure 453743DEST_PATH_IMAGE016
In one embodiment, the plurality of user tags in the first vector may be ordered according to a weight of the first keyword, e.g., from
Figure 394017DEST_PATH_IMAGE017
To
Figure 448561DEST_PATH_IMAGE018
The weights for the first keywords are arranged from high to low, that is, the user tags at the upper part in the first vector can represent the interest of the user. Similarly, the content tags in the second vector may also be sorted according to the weight or information entropy of the corresponding second keyword.
In another embodiment, the similarity between the user interest representation and the information content representation is calculated based on the first vector and the second vector, and then a certain amount of information corresponding to the information content representation with the highest similarity to the user interest representation is selected as information pushed to the target user according to the size of the similarity.
Calculating the similarity of the representation of interest of the user to the representation of the information content may be performed by calculating a vector distance between the first vector and the second vector, e.g. based on
Figure 268618DEST_PATH_IMAGE019
And determining the cosine similarity between the user interest portrait and the information content portrait. Here, the number of the first and second electrodes,
Figure 775823DEST_PATH_IMAGE020
is as followsiA first vector corresponding to each of the target users,
Figure 723050DEST_PATH_IMAGE021
is as followsiAnd a second vector corresponding to the information to be pushed. The higher the cosine similarity value, the closer the information content portrait of the piece of information is to the user interest portrait, the higher the possibility that the target user is interested in the information. According to the quantity of information to be pushed, for example, 3 pieces of information are required to be pushed to a target user, information corresponding to 3 information content images with the highest cosine similarity of the user interest image is selected from all the information to be pushed, and the information is pushed to the target user.
In some embodiments, as shown in fig. 6, the S142 includes:
s1421: selecting information corresponding to a preset number of information content images with highest similarity from the information to be pushed;
s1422: classifying the preset amount of information according to the content tags;
s1423: and selecting information corresponding to at least one information content image with the highest similarity from the corresponding content label classification according to the user label, and pushing the information to the target user.
In the embodiment of the present invention, a part of information with a high similarity to the user interest representation is screened out and classified according to the content tag in the information content representation of each piece of information, for example, the part of information may be classified according to the content tag with the highest weight in each information content representation, or according to the content tag corresponding to the second keyword with the highest information entropy in each information content representation. Then, the information with the highest similarity under the classification is selected according to the content tag in the user interest portrait for pushing, for example, the information with the highest similarity under the classification can be selected according to the first keyword with the highest weight in the user interest portrait.
For example, in all the information to be pushed, the preset number may be 100, the first 100 pieces of information with the maximum similarity are selected, and the 100 pieces of information are classified according to the information entropy of the second keyword corresponding to the content tag and the second keyword with the maximum information entropy in each piece of information, for example, 20 pieces of information of which the second keywords with the maximum information entropy are all "basketball" are classified into the same class, and 20 pieces of information of which the second keywords with the maximum information entropy are all "football" are classified into another class, and so on. If the first keyword with the highest weight in the interest portrait of the user is basketball, one or more pieces of information with the highest similarity are selected from the 20 pieces of information in the corresponding basketball classification for pushing.
In one embodiment, the top 100 pieces of information with the highest similarity are selected from all the pieces of information to be pushed, and the information can be stored in a database to be recommended and classified. And if the information needs to be pushed to the target user, selecting the information according to the corresponding classification of the first keyword with the highest weight in the database to be recommended.
In another embodiment, the user interest representation is updated at regular intervals, for example, the user interest representation may be regenerated every 12 hours, so as to realize timely updating of the content currently interested by the user. Correspondingly, the information in the database to be recommended may also be updated once every 12 hours, or may also be updated at other time intervals, so as to keep the information in the database to be recommended matching the current content of interest of the user.
Therefore, the information to be pushed in a preset number is classified through the content tags, so that the user interest portrait and the information content portrait can be matched based on the keywords which are divided more finely, the matching of the first keywords which are interested in the user is greatly improved, and the matching degree of the pushed information and the user interest content is further improved.
In some embodiments, the user tag comprises: a weight of said first keyword and said first keyword; wherein the weights of the first keywords are different.
In the embodiment of the invention, when each user tag is composed of a keyword and the weight thereof, the user tags are more favorably sequenced, so that the first vector which represents the interesting content of the user more clearly can be obtained, and the information to be pushed is more accurately selected based on the similarity.
As shown in fig. 7, an embodiment of the present invention provides an information pushing apparatus, including:
an obtaining unit 110, configured to obtain metadata of user-generated content associated with a current application, and extract a first keyword from the metadata;
a generating unit 120, configured to generate a user interest representation of the target user according to the first keyword and the weight of the first keyword, where the user interest representation includes: at least one user tag characterizing content of interest to the target user; generating an information content portrait of the information to be pushed according to a second keyword of the information to be pushed and the weight of the second keyword, wherein the information content portrait comprises at least one content tag indicating the information content of the information to be pushed;
and the pushing unit 130 is configured to select at least one piece of information from the information to be pushed to the target user according to the user interest representation and the information content representation.
One specific example is provided below in connection with any of the embodiments described above:
the embodiment provides an advertisement recommendation method based on interest figures of microblog users, as shown in fig. 8, the method includes:
s1, using a crawler based on a script frame to acquire data needed for analyzing interest portraits of microblog users. The script framework is an application framework written for crawling website data and extracting structural data. The webpage analysis codes are developed by using a programming language Python, and data are stored into an open source database system MongoDB based on distributed file storage in combination with a project pipeline. In the crawling process, a plurality of crawler threads are created in a multithreading mode and crawled, and a dispatcher fetches URLs from a priority queue and distributes the URLs to different threads for crawling. The web crawler mainly selects microblog users as specific initial objects to crawl. Firstly, a character node needing to be crawled is located, and then background information, social information and microblog information related to the character node are obtained. The background information includes person identification Information (ID), a nickname, a tag, and the like, the social information is an interaction relationship between a person and other users, and the microblog information includes microblog contents published by the users, comment contents published by the users, microblog roll-call information, and the like.
S2, data preprocessing is carried out on the crawled data: since web page contents are mainly written in HyperText Markup Language (HTML), the processing of web page information is mainly performed by parsing HTML. Because the HTML language is composed of tags, relevant text content can be extracted by emphasizing extraction of different tags and tag content. For the microblog text, relevant information, such as user ID, microblog content and the like, needs to be extracted from the captured metadata. The process of refining extracts the web page information by using a regular matching mode. Regular expressions are used primarily for text searching and editing, extracting sub-strings from strings by using pattern matching. Removing by regular expression: 1. the @ XXX type (forwarding microblog, reminding other users to appear, belonging to noise data); 2. URL type (URL does not contain any useful information, but is a link to an entry in another web site, belonging to noisy data); 3. emoticons (emoticons in the Sina microblog are usually of the type "[ XX ]" and belong to noise data) and the like.
S3 using ICTCCLAS open source tool to process word segmentation: (ICTSCLAS is a program package for processing Chinese text, which can complete text processing tasks such as text word segmentation, calculating key words, finding new words)
Filtering stop words: and in the microblog word segmentation process, stop words in the microblog text need to be filtered at the same time. The method comprises the steps of establishing a stop word list, comparing words in a text obtained after word segmentation with the stop word list, and if a certain word exists in the stop word list, removing the word from the text; on the contrary, if a certain word does not match any word in the stop word list, the word is kept, and stop words in the microblog text are filtered in a word matching mode.
S4 (portrait of user interest) microblog text representation:
after the Chinese word segmentation processing is carried out on the text document, the CHI is adopted for feature extraction of each category, and feature words capable of representing the category are selected. After feature selection, TFIDF is used to compute the weights of the feature words. Using Vector Space Model (VSM), users are matchedUExpressed as:
Figure 819051DEST_PATH_IMAGE022
wherein
Figure 685376DEST_PATH_IMAGE001
A word representing a characteristic of the image is represented,
Figure 820822DEST_PATH_IMAGE002
representing the weight of the feature word.
S5 (for advertising micro-blogs), extracting the micro-blog keywords and expressing the vector space by adopting ICTCCLAS. ClaICTS extracts keywords in the text based on the principle of information entropy. The key word is extracted by using the information entropy mainly by considering the left and right information entropy values of the word. A word can be called a keyword because the word can be matched left and right, i.e. if the left and right information entropy of the word are both large, the word is likely to be the keyword.
After extracting keywords from the microblog, a group of keywords is obtained to represent the microblog. Expressing the micro-blog, micro-blog using a vector space model
Figure 165216DEST_PATH_IMAGE023
The expression mode is as follows:
Figure 318985DEST_PATH_IMAGE006
wherein, in the step (A),
Figure 356212DEST_PATH_IMAGE003
to represent
Figure 978954DEST_PATH_IMAGE023
To extractnA keyword.
S6 similarity calculation between the interest portrait of the user and the advertising microblog:
the advertising microblog also carries out microblog text representation, so that the interest portrait and the advertising microblog in the user portrait are text data. According to the priori knowledge, the more similar the advertising microblog and the user interest portrait, the more interested the user is in the advertising microblog.
The user interest image is already represented in the form of a vector space model, namely, a vector form of weighting a keyword represented by the user interest image is set as
Figure 313989DEST_PATH_IMAGE024
. Obtaining an advertisement microblog text vector by adopting a vector space model for the advertisement microblog text, and setting the vector as
Figure 197632DEST_PATH_IMAGE025
Then, the cosine similarity calculation formula is:
Figure 546704DEST_PATH_IMAGE026
Figure 515797DEST_PATH_IMAGE027
the similarity between the user interest portrait and the advertising microblog is represented, and the higher the value of the similarity is, the more similar the advertising microblog and the user interest portrait is, and the more interesting the user is in the advertising microblog. According to the similarity between the advertisement microblog and the user interest portrait
Figure 326627DEST_PATH_IMAGE027
The value of the number of the advertisement microblog lists to be recommended is obtained, the first 100 advertisement microblog lists with the largest similarity are selected as the final microblog recommendation result and stored in a microblog advertisement database to be recommended, the advertisement microblog lists to be recommended are subjected to induction statistical analysis, are classified, are marked with keyword identifications, and the number of the advertisement microblog lists to be recommended is determined according to the number of the advertisement microblog lists to be recommendedThe classification is performed with the similarity sorted from high to low.
S7 recommendation module: when a recommendation request is made, searching a corresponding keyword identifier in a microblog advertisement database to be recommended according to the user interest portrait keyword identifier, selecting microblogs from high to low in sequence according to the number of advertisement microblogs to be recommended, and then recommending advertisement delivery.
S8 information update module: and updating in real time according to the data of the user interest portrait, and correspondingly updating a microblog advertisement database to be recommended, thereby realizing the advertisement recommendation method based on the microblog user interest portrait.
An embodiment of the present invention further provides an electronic device, where the electronic device includes: a processor and a memory for storing a computer program capable of running on the processor, the computer program when executed by the processor performing the steps of one or more of the methods described above.
An embodiment of the present invention further provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and after being executed by a processor, the computer-executable instructions can implement the method according to one or more of the foregoing technical solutions.
The computer storage media provided by the present embodiments may be non-transitory storage media.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, indirect coupling or communication connection between devices or units, and may be electrical, mechanical or other driving.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may be separately used as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized by hardware running or by hardware and software functional units.
In some cases, any two of the above technical features may be combined into a new method solution without conflict.
In some cases, any two of the above technical features may be combined into a new device solution without conflict.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable Memory device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, and an optical disk.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. An information pushing method, characterized in that the method comprises:
acquiring metadata of user generated content associated with a current application, and extracting a first keyword from the metadata;
generating a user interest portrait of a target user according to the first keyword and the weight of the first keyword, wherein the user interest portrait comprises: at least one user tag characterizing content of interest to the target user;
generating an information content portrait of the information to be pushed according to a second keyword of the information to be pushed and the weight of the second keyword, wherein the information content portrait comprises at least one content tag indicating the information content of the information to be pushed;
and selecting at least one piece of information from the information to be pushed to the target user according to the user interest portrait and the information content portrait.
2. The method of claim 1, wherein extracting the first keyword from the metadata comprises:
performing word segmentation processing on the metadata to obtain a word sequence; wherein the sequence of words comprises a plurality of words;
removing stop words in the word sequence;
and extracting the first key words of which the information entropy and/or the occurrence frequency meet preset conditions from the word sequence without stop words.
3. The method according to claim 2, wherein the extracting the first keyword whose information entropy and/or frequency of occurrence satisfy a preset condition includes:
and aiming at a plurality of preset categories, respectively extracting the first keywords of which the information entropy and/or the occurrence frequency meet preset conditions in each preset category.
4. The method of claim 1, further comprising:
determining the information entropy of each word according to the number of other words which are matched with each word in the information to be pushed;
and selecting a second keyword from all words contained in the information to be pushed according to the size of the information entropy.
5. The method of claim 1, wherein the user interest representation comprises: the system comprises a plurality of user tags, a plurality of storage units and a plurality of display units, wherein the user tags are sequentially sequenced to form a first vector;
the information content representation includes: the content tags are sequentially ordered to form a second vector;
the selecting and pushing at least one piece of information from the information to be pushed to the target user according to the user interest portrait and the information content portrait comprises:
determining the similarity of the user interest portrait and the information content portrait according to the vector distance between the first vector and the second vector;
and selecting at least one information content image with the highest similarity from the information to be pushed, and pushing the information corresponding to the information content image to the target user.
6. The method according to claim 5, wherein the selecting, from the information to be pushed, information corresponding to at least one information content image with the highest similarity to push to the target user comprises:
selecting information corresponding to a preset number of information content images with highest similarity from the information to be pushed;
classifying the preset amount of information according to the content tags;
and selecting information corresponding to at least one information content image with the highest similarity from the corresponding content label classification according to the user label, and pushing the information to the target user.
7. The method of claim 1, wherein the user tag comprises: a weight of said first keyword and said first keyword; wherein the weights of the first keywords are different.
8. An information pushing apparatus, characterized in that the apparatus comprises:
an acquisition unit configured to acquire metadata of user-generated content associated with a current application, and extract a first keyword in the metadata;
the generating unit is used for generating a user interest portrait of a target user according to the first keyword and the weight of the first keyword, wherein the user interest portrait comprises: at least one user tag characterizing content of interest to the target user; generating an information content portrait of the information to be pushed according to a second keyword of the information to be pushed and the weight of the second keyword, wherein the information content portrait comprises at least one content tag indicating the information content of the information to be pushed;
and the pushing unit is used for selecting at least one piece of information from the information to be pushed to the target user according to the user interest portrait and the information content portrait.
9. An electronic device, characterized in that the electronic device comprises: a processor and a memory for storing a computer program capable of running on the processor; wherein the content of the first and second substances,
the processor, when executing the computer program, performs the steps of the information push method according to any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon computer-executable instructions; the computer-executable instructions, when executed by a processor, enable the information push method of any one of claims 1 to 7 to be implemented.
CN202110515156.3A 2021-05-12 2021-05-12 Information pushing method and device, electronic equipment and storage medium Pending CN112989824A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110515156.3A CN112989824A (en) 2021-05-12 2021-05-12 Information pushing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110515156.3A CN112989824A (en) 2021-05-12 2021-05-12 Information pushing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112989824A true CN112989824A (en) 2021-06-18

Family

ID=76337615

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110515156.3A Pending CN112989824A (en) 2021-05-12 2021-05-12 Information pushing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112989824A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114173200A (en) * 2021-12-06 2022-03-11 南京辰和软件有限公司 Video management pushing method and device based on private radio and television network
CN115689616A (en) * 2022-12-20 2023-02-03 陕西长锦网络科技有限公司 Cloud content pushing method and system based on big data characteristic analysis
CN116760882A (en) * 2023-08-18 2023-09-15 广州朗歌信息技术有限公司 Multimedia information distribution supervision system and method based on Internet of things

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605808A (en) * 2013-12-10 2014-02-26 合一网络技术(北京)有限公司 Search-based UGC (user generated content) recommendation method and search-based UGC recommendation system
CN106126582A (en) * 2016-06-20 2016-11-16 乐视控股(北京)有限公司 Recommend method and device
CN108763213A (en) * 2018-05-25 2018-11-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Theme feature text key word extracting method
CN109190024A (en) * 2018-08-20 2019-01-11 平安科技(深圳)有限公司 Information recommendation method, device, computer equipment and storage medium
CN111882370A (en) * 2020-09-27 2020-11-03 武汉卓尔数字传媒科技有限公司 Advertisement recommendation method and device and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605808A (en) * 2013-12-10 2014-02-26 合一网络技术(北京)有限公司 Search-based UGC (user generated content) recommendation method and search-based UGC recommendation system
CN106126582A (en) * 2016-06-20 2016-11-16 乐视控股(北京)有限公司 Recommend method and device
CN108763213A (en) * 2018-05-25 2018-11-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Theme feature text key word extracting method
CN109190024A (en) * 2018-08-20 2019-01-11 平安科技(深圳)有限公司 Information recommendation method, device, computer equipment and storage medium
CN111882370A (en) * 2020-09-27 2020-11-03 武汉卓尔数字传媒科技有限公司 Advertisement recommendation method and device and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
周朝阳等主编: "《大学生服务外包大赛案例解析》", 31 January 2019 *
陶乾等著: "《群体智能与大数据分析技术》", 30 April 2018 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114173200A (en) * 2021-12-06 2022-03-11 南京辰和软件有限公司 Video management pushing method and device based on private radio and television network
CN114173200B (en) * 2021-12-06 2022-08-26 江苏省广电有线信息网络股份有限公司镇江分公司 Video management pushing method and device based on private radio and television network
CN115689616A (en) * 2022-12-20 2023-02-03 陕西长锦网络科技有限公司 Cloud content pushing method and system based on big data characteristic analysis
CN115689616B (en) * 2022-12-20 2023-11-17 北京国联视讯信息技术股份有限公司 Cloud content pushing method and system based on big data feature analysis
CN116760882A (en) * 2023-08-18 2023-09-15 广州朗歌信息技术有限公司 Multimedia information distribution supervision system and method based on Internet of things
CN116760882B (en) * 2023-08-18 2023-10-31 广州朗歌信息技术有限公司 Multimedia information distribution supervision system and method based on Internet of things

Similar Documents

Publication Publication Date Title
CN109325165B (en) Network public opinion analysis method, device and storage medium
CN109145216B (en) Network public opinion monitoring method, device and storage medium
CN109145215B (en) Network public opinion analysis method, device and storage medium
CN108694223B (en) User portrait database construction method and device
CN107291780B (en) User comment information display method and device
US9201880B2 (en) Processing a content item with regard to an event and a location
US10032081B2 (en) Content-based video representation
US8630972B2 (en) Providing context for web articles
CN104199833B (en) The clustering method and clustering apparatus of a kind of network search words
CN111898031B (en) Method and device for obtaining user portrait
CN110019943B (en) Video recommendation method and device, electronic equipment and storage medium
CN112989824A (en) Information pushing method and device, electronic equipment and storage medium
EP3189449A2 (en) Sentiment rating system and method
WO2011080899A1 (en) Information recommendation method
CN112434151A (en) Patent recommendation method and device, computer equipment and storage medium
WO2013059290A1 (en) Sentiment and influence analysis of twitter tweets
CN106383887A (en) Environment-friendly news data acquisition and recommendation display method and system
CN110309251B (en) Text data processing method, device and computer readable storage medium
JP6097126B2 (en) RECOMMENDATION INFORMATION GENERATION DEVICE AND RECOMMENDATION INFORMATION GENERATION METHOD
CN104866554B (en) A kind of individuation search method and system based on socialization mark
CN111309936A (en) Method for constructing portrait of movie user
CN106537387B (en) Retrieval/storage image associated with event
CN112328857B (en) Product knowledge aggregation method and device, computer equipment and storage medium
JP2011108053A (en) System for evaluating news article
CN102915358B (en) Navigation website implementation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210618