CN108829822B - Media content recommendation method and device, storage medium and electronic device - Google Patents

Media content recommendation method and device, storage medium and electronic device Download PDF

Info

Publication number
CN108829822B
CN108829822B CN201810603143.XA CN201810603143A CN108829822B CN 108829822 B CN108829822 B CN 108829822B CN 201810603143 A CN201810603143 A CN 201810603143A CN 108829822 B CN108829822 B CN 108829822B
Authority
CN
China
Prior art keywords
word
words
content
candidate
keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810603143.XA
Other languages
Chinese (zh)
Other versions
CN108829822A (en
Inventor
郑茂
颜景善
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201810603143.XA priority Critical patent/CN108829822B/en
Publication of CN108829822A publication Critical patent/CN108829822A/en
Application granted granted Critical
Publication of CN108829822B publication Critical patent/CN108829822B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a media content recommendation method and device, a storage medium and an electronic device. Wherein the method comprises the following steps: acquiring a recommendation request, wherein the recommendation request is used for requesting to recommend media content to a target object; responding to a recommendation request, acquiring a subject word from words of a first content, wherein the first similarity of a word vector of the subject word and a semantic vector of the first content is greater than or equal to a second similarity, the semantic vector of the first content is determined according to the word vector of each keyword in a plurality of keywords and the weight of each keyword, and the keywords are keywords in the words of the first content; and selecting second content matched with the subject term from the candidate media content, and recommending the second content to the target object. The invention solves the technical problem of low accuracy of the recommended media content in the related technology.

Description

Media content recommendation method and device, storage medium and electronic device
Technical Field
The invention relates to the field of internet, in particular to a recommendation method and device for media content, a storage medium and an electronic device.
Background
With the rapid development of social media, people receive and process a large amount of information from the physical world and the information world every moment. However, the large amount of information, the complex structure, the nonsensical information and the like make it impossible for people to process and process each piece of received information to identify the valuable part. Therefore, how to obtain useful information from text is a key to achieve quick and accurate processing of information.
In the real world, keywords are the most intuitive representations of useful information, so how to obtain keywords of interest from text is a current urgent problem to be solved. The key words which are concerned by people are obtained from the text, so that people can be helped to understand the content of the information quickly, and important technical support can be provided for the fields of text mining, natural language processing, knowledge engineering and the like, and the method has very wide application. For example, in the marketing field, the aspects of interest to the customer can be revealed by keywords, thereby recommending the content more conforming to the habit of the customer to the user. However, content cannot be pushed to the user accurately due to inaccurate positioning of keywords, etc., because the content recommended to the user does not satisfy the preference thereof, resulting in a low click rate of the recommended content by the user.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the invention provides a media content recommendation method and device, a storage medium and an electronic device, which are used for at least solving the technical problem of low accuracy of media content recommended in the related technology.
According to an aspect of an embodiment of the present invention, there is provided a recommendation method for media content, including: acquiring a recommendation request, wherein the recommendation request is used for requesting to recommend media content to a target object; responding to a recommendation request, acquiring a subject word from words of a first content, wherein the first similarity of a word vector of the subject word and a semantic vector of the first content is greater than or equal to a second similarity, the semantic vector of the first content is determined according to the word vector of each keyword in a plurality of keywords and the weight of each keyword, and the keywords are keywords in the words of the first content; and selecting second content matched with the subject term from the candidate media content, and recommending the second content to the target object.
According to an aspect of an embodiment of the present invention, there is provided a method for screening media content keywords, including: acquiring media content; acquiring word vectors of keywords in media content; according to the word vector of each keyword in the media content and the weight of each keyword, calculating to obtain the semantic vector of the media content, wherein the keywords are keywords in the words of the media content; and calculating the similarity between the word vector of each word in the media content and the semantic vector of the media content, and when the similarity is greater than or equal to a threshold value, determining the corresponding word as the subject word of the media content.
According to another aspect of the embodiment of the present invention, there is also provided a recommendation apparatus for media content, including: the first acquisition unit is used for acquiring a recommendation request, wherein the recommendation request is used for requesting to recommend media content to a target object; the second acquisition unit is used for responding to the recommendation request and acquiring a subject word from words of the first content, wherein the first similarity of the word vector of the subject word and the semantic vector of the first content is greater than or equal to the second similarity, the semantic vector of the first content is determined according to the word vector of each keyword in the plurality of keywords and the weight of each keyword, and the plurality of keywords are keywords in the words of the first content; and the recommending unit is used for selecting the second content matched with the subject word from the candidate media content and recommending the second content to the target object.
According to an aspect of an embodiment of the present invention, there is provided an apparatus for screening media content subject words, including: a third acquisition unit configured to acquire media content; a fourth obtaining unit, configured to obtain a word vector of a word in the media content; the first computing unit is used for computing to obtain semantic vectors of the media content according to word vectors of keywords in the media content and weights of the keywords, wherein the keywords are keywords in words of the media content; and the second calculation unit is used for calculating the similarity between the word vector of the words in the media content and the semantic vector of the media content, and when the similarity is larger than a first threshold value, the corresponding words are confirmed to be subject words of the media content.
According to another aspect of the embodiments of the present invention, there is also provided a storage medium including a stored program that executes the above-described method when running.
According to another aspect of the embodiments of the present invention, there is also provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the method described above by the computer program.
In the embodiment of the invention, when a recommendation request is acquired, acquiring a subject word for representing a target object, wherein the subject word is selected from words of first content, a semantic vector of the first content is determined according to a word vector of each keyword in a plurality of keywords and a weight configured for each keyword in the plurality of keywords, and the plurality of keywords are keywords in the words of the first content; the second content matched with the subject term is selected from the candidate media content, the second content is recommended to the target object, and the key term can embody the main content of the content, so that the determined semantic vector of the first content can be more accurate by determining the semantic vector of the key term, the subject term used for describing the habit of the target object can be more accurately determined, the subject term can be more accurately used for content recommendation conveniently, the technical problem that the accuracy of the recommended media content in the related technology is lower can be solved, and the technical effect of accurately performing content recommendation is achieved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a schematic diagram of a hardware environment of a media content recommendation method according to an embodiment of the present application;
FIG. 2 is a flow chart of an alternative media content recommendation method according to an embodiment of the present application;
FIG. 3 is a flow chart of an alternative media content recommendation method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an alternative user interface according to an embodiment of the application;
FIG. 5 is a schematic illustration of an alternative user interface according to an embodiment of the application;
FIG. 6 is a schematic diagram of an alternative user interface according to an embodiment of the application;
FIG. 7 is a schematic diagram of an alternative terminal interacting with a server in accordance with an embodiment of the present application;
FIG. 8 is a schematic diagram of an alternative weight iteration according to an embodiment of the present application;
FIG. 9 is a schematic diagram of an alternative media content recommendation device according to an embodiment of the present application;
and
fig. 10 is a block diagram of a structure of a terminal according to an embodiment of the present application.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to an aspect of the embodiment of the invention, a method embodiment of a recommendation method for media content is provided.
Alternatively, in the present embodiment, the above-described recommendation method of media content may be applied to a hardware environment constituted by the server 101 and the terminal 103 as shown in fig. 1. As shown in fig. 1, server 101 is connected to terminal 103 via a network including, but not limited to: the terminal 103 is not limited to a PC, a mobile phone, a tablet computer, or the like. The recommendation method of media content of the embodiment of the present invention may be performed by the server 101.
FIG. 2 is a flow chart of an alternative media content recommendation method according to an embodiment of the present invention, as shown in FIG. 2, the method may include the steps of:
in step S202, the server acquires a recommendation request, where the recommendation request is used to request recommendation of media content to the target object.
The media content can be used for popularization and propaganda, and can be one of text, voice, video, audio, pictures, moving pictures and other types of content, or a combination of a plurality of the text, voice, video, audio, pictures, moving pictures and other types of content.
In step S204, in response to the recommendation request, the server obtains a subject word from the words of the first content, where the first similarity between the word vector of the subject word and the semantic vector of the first content is greater than or equal to the second similarity, and the semantic vector of the first content is determined according to the word vector of each keyword of the plurality of keywords and the weight of each keyword, and the plurality of keywords are keywords in the words of the first content. In this embodiment, the first content refers to media content that has been viewed by the user.
Optionally, the first similarity between the word vector of the subject word and the semantic vector of the first content is greater than the second similarity, where the second similarity is a preset threshold value, or is a similarity between the word vector of the word except the subject word in the word of the first content and the semantic vector of the first content, and if the second similarity is a similarity between the word vector of the word except the subject word in the word of the first content and the semantic vector of the first content. The target object may be an object, such as a user, that views the media content.
When words are extracted from media contents such as first contents, the words can be extracted in a corresponding mode according to the types of contents (such as text, pictures, videos and audios) included in the media contents, if the media contents comprise the text, the words in the media contents can be words in the text, and if the media contents do not comprise the text, the words can be extracted in a mode such as extracting the words in the media contents from subtitles of the videos, extracting the words in the media contents from voices of the videos or directly from audios through a voice-to-text conversion tool, and extracting the words in the media contents from texts on pictures or moving pictures or labels.
The above-mentioned subject terms are also called "narrative terms" which are used to express the subject of documents such as media contents in indexing and retrieval, and have conceptual and standardized features, and the selection of the subject terms can be implemented according to a subject term table, for example, according to the terms in the subject term table, the terms are indexed in the media contents one by one, and if the terms are indexed, the terms are used as one of the subject terms of the media contents. Alternatively, the aforementioned subject words may be subject words of the target object (or the target account for identifying the target object), which may be words for expressing interests, hobbies, habits, and the like of the target object, such as the name "Wang Mou" of entertainment stars, story type "thrill", "democratic", cell phone brand "XX", and the like.
If natural language is to be processed by an algorithm in machine learning, language can be mathematically performed, word vectors (Distributed Representation in english) are one way to mathematically process words in the language, each word in a certain language can be mapped into a vector of a fixed length through training, a word vector space can be formed by putting all the vectors together, each vector is a point in the space, and a "distance" is introduced in the space, so that similarity (such as lexical and semantic) between words can be judged according to the distance between the words.
Alternatively, word2vec models may be utilized to train word semantic vectors on media corpora such as news.
In the related art, the semantics of the first content is the synthesis of word senses represented by all words in the first content, for example, text in the media content is segmented, the text can be segmented into n words, and the word sense vectors corresponding to the n words are weighted and summed with a weight of 1/n, so that the semantic vectors of the media content such as news can be obtained.
When the similarity between the word vector of the word (including the subject word and the word except the subject word in the first content) and the semantic vector of the first content is obtained, the similarity between the vector of each candidate word and the cosine of the semantic vector of the news (or called cosine similarity) can be used as a measure of the correlation between each candidate word and the news semantic.
It should be noted that, when calculating the semantic vector, the related technology simply performs addition and averaging of the word semantic vector, however, words which are irrelevant to subjects may exist in the media content (such as the news text), when the words are often far away from other words in the vector space, the real semantics of the media content often depend on important words and are irrelevant to the words which are irrelevant to the subjects, so that the semantic vector of the finally obtained media content in the related technology cannot well reflect the real semantics of the news text.
In the technical scheme provided by the embodiment of the application, the probability graph model texttrank can be utilized to obtain the importance score of the words in the media content, and the word vectors are weighted and summed according to the importance score to obtain the semantic vectors of the media content, namely, the word vectors of the keywords are only considered when the semantic vectors of the media content are determined, so that the defect of original semantic vector calculation is overcome.
In step S206, the server selects the second content matched with the subject term from the candidate media content and recommends the second content to the target object.
The above embodiment is described by taking the embodiment of the present application in which the recommendation method of the media content is executed by the server 101 as an example, the recommendation method of the media content in the embodiment of the present application may be executed by the terminal 103, that is, the execution subject of the above steps is replaced by the server, or may be executed by the server 101 and the terminal 103 together. The recommendation method of the terminal 103 to execute the media content according to the embodiment of the present application may also be executed by a client installed thereon.
Through the steps S202 to S206, when the recommendation request is obtained, a subject word for representing the target object is obtained, the subject word is selected from words of the first content, the semantic vector of the first content is determined according to the word vector of each of the plurality of keywords and the weight configured for each of the plurality of keywords, and the plurality of keywords are keywords in the words of the first content; the second content matched with the subject term is selected from the candidate media content, the second content is recommended to the target object, and the key term can embody the main content of the content, so that the determined semantic vector of the first content can be more accurate by determining the semantic vector of the key term, the subject term used for describing the habit of the target object can be more accurately determined, the subject term can be more accurately used for content recommendation conveniently, the technical problem that the accuracy of the recommended media content in the related technology is lower can be solved, and the technical effect of accurately performing content recommendation is achieved.
The following further details the technical scheme of the present application in conjunction with the steps shown in fig. 2:
the extraction of the subject words can be used in an online recommendation system of products such as express reports, news, entertainment and the like, articles clicked by users and the like are taken as input, a group of words similar to the news subject semantics, namely the subject words, are extracted through a subject word extraction method based on a semantic graph model, and the extracted results are taken as input of user portraits, so that the interests of the users are depicted. When the recommendation is required, if the user starts the client of the product, refreshes the client, and the like, the execution of the technical scheme of the application can be triggered, and at this time, the server acquires a recommendation request for requesting to recommend media content to the target object according to the technical scheme provided in step S202, and the recommendation system recalls the corresponding article according to the portrait interest of the user and recommends the article to the user.
In the technical solution provided in step S204, in response to the recommendation request, the server obtains a subject word for representing the target object, where the subject word is selected from words of the first content, a first similarity between a word vector of the subject word and a semantic vector of the first content is greater than a second similarity, where the second similarity may be a similarity between a word vector of words other than the subject word in the words of the first content and a semantic vector of the first content, and the semantic vector of the first content is determined according to a keyword in the words of the first content.
In an embodiment of the present application, acquiring the subject term for representing the target object may include the following steps 1 to 5:
step 1, obtaining a plurality of candidate words by word segmentation of the first content, wherein the candidate words are words in the first content.
Chinese word segmentation (Chinese Word Segmentation) is a process of segmenting a Chinese character sequence into individual words, namely, recombining continuous word sequences into word sequences according to a certain specification, and can be used by the word segmentation method including but not limited to: word segmentation method based on character string matching, word segmentation method based on understanding and word segmentation method based on statistics.
When a plurality of candidate words are obtained by word segmentation of the first content, the word segmentation method can be directly adopted for word segmentation; the first content can be subjected to denoising treatment, and then the first content subjected to denoising treatment is subjected to word segmentation to obtain a plurality of words, wherein the denoising treatment is used for eliminating interference words in the first content, namely unnecessary information, such as XX report and the like; filtering the words according to the part of speech, for example, filtering out 'adverbs', and merging the filtered words to obtain a plurality of candidate words, namely, fishing back the fine-grained word, for example, fishing back 'intelligent' and 'equipment' to form 'intelligent equipment', and taking 'intelligent equipment' as one candidate word.
In the above embodiment, denoising the first content includes:
step 11, obtaining the deletion probability of the words in the first content:
P(w i ) Representing the ith word w i Is deleted probability of f (w i ) Representing word w i The frequency of occurrence in the first content, t being a parameter representing a threshold;
step 12, determining the ith word as an interference word if the deletion probability of the ith word is greater than a second threshold value, such as greater than 50%;
and step 13, deleting the interference words in the first content.
Alternatively, whether the word is deleted or not may be triggered randomly according to the deletion probability, for example, the deletion probability of the ith word is 10%, and then the probability of triggering deletion every time the word appears is 10%.
And 2, determining semantic vectors of the first content according to word vectors of keywords in the plurality of candidate words.
Alternatively, the model for performing the semantic vector recognition may be trained when determining the semantic vector of the first content, or may be trained in advance before determining the semantic vector of the first content. How model training is performed is detailed below in connection with embodiments of the present application:
when training a model, namely before using candidate words as input of a first model (such as a word2vec model), a training set is obtained by word segmentation of third content, and words stored in the training set are words obtained by word segmentation of the third content; and taking the words belonging to the same sentence in the third content in the training set as the input of the second model according to the sequence position in the sentence, so as to train the second model, and taking the trained second model as the first model, thereby obtaining the first model.
Alternatively, during training, the objective function may be optimized by the negative sampling method "negative sampling", which reduces the number of computations because only a small portion of the model weight needs to be updated for each training sample, thereby reducing the computational burden.
In the above embodiment, before determining the semantic vector of the first content according to the word vector of the keyword in the plurality of candidate words, the candidate words may be input into the first model, and the word vector of the candidate word output by the first model may be obtained, so that the word vector of each candidate word is determined, and since the candidate word includes the keyword, the word vector of the keyword is naturally also determined.
And 3, determining the semantic vector of the first content according to the word vector of the keyword in the plurality of candidate words.
In an embodiment of the present application, determining the semantic vector of the first content according to the word vector of the keyword in the plurality of candidate words may include:
step 31, determining keywords in the plurality of candidate words according to the sequence positions of the candidate words in the first content.
In an embodiment of step 31, determining at least one keyword of the plurality of candidate words according to the sequence position of the candidate word in the first content may include:
constructing a word graph comprising a plurality of candidate words, wherein each candidate word is used as a node in the word graph, the nodes of the candidate words belonging to the same sentence in the first content are connected in the word graph according to the sequence positions in the sentence, for example, one sentence is "the center of a bus going to the city of the king and the two buses", and the candidate word obtained by processing the sentence comprises: the words "king" and "sitting", "bus", "going" and "city center" are respectively used as a node in the word graph, the nodes "king" and "sitting", "bus", "going" and "city center" are sequentially connected, the pointing relationship of the edge (i.e. one edge in the word graph) formed by connecting the "king" and "sitting" can be that the node "king" points to the node "sitting", "bus" and "sitting", the pointing relationship of the edge (i.e. one edge in the word graph) formed by connecting the "bus" and "sitting" can be that the node "sitting" points to the node "bus", and the rest nodes are the same;
Performing iterative operation on the weight parameters of each candidate word in the word graph according to the following formula:
wherein S (V) i ) Representing the ith candidate in the iterative operation of the roundWeight parameter of the selection word, S (V j ) The weight parameter representing the j-th candidate word In the previous iteration, d represents the damping coefficient, in (V i ) Representing a set of candidate words pointing to the jth candidate word, out (V) j ) Representing the number of edges of the jth candidate word;
in any round of iterative operation, if the difference value between the weight parameter of the candidate word in the iterative operation of the round and the weight parameter of the candidate word in the previous iterative operation is not in the target range, continuing to execute the next round of iterative operation, otherwise stopping the iterative operation;
and after stopping the iterative operation, acquiring the keywords in the plurality of candidate words according to the weight parameters, wherein the weight parameters of the keywords are larger than the weight parameters of the words except the keywords in the plurality of candidate words.
And step 32, summing the intermediate vectors of all the keywords to obtain a semantic vector of the first content, wherein the intermediate vector of the keywords is the product of the word vector of the keywords and the weights set for the keywords.
Alternatively, the word vector of the mth keyword may be used as a row vector or a column vector Weights k of all candidate words m The sum is 1, and the intermediate vector can be expressed as +.>
In the embodiment shown in step 32, summing the intermediate vectors of all keywords to obtain the semantic vector of the first content may include:
the weight set for each keyword is determined by normalizing the weight parameters of all keywords, and the weight parameter calculation mode for each keyword is as follows: after weight parameters (namely weights initially distributed to the keywords) of all the keywords are obtained, the ratio between the weight of the keywords and the sum obtained by the calculation is used as a new weight after normalization processing; obtaining the product of a word vector of a keyword and a weight set for the keyword as an intermediate vector of the keyword; and summing the intermediate vectors of all the keywords to obtain the semantic vector of the first content.
And 4, acquiring the similarity between the word vector of each candidate word and the semantic vector of the first content, and optionally taking the cosine similarity between the vector of each candidate word and the semantic vector of the news as the measure of the correlation between each candidate word and the news semantic.
Cosine similarity takes the cosine value of the included angle of two vectors in the vector space as a measure for the difference between two individuals, and compared with distance measurement, the cosine similarity is more focused on the difference of the two vectors in the direction rather than the difference in the distance or the length.
Wherein θ represents a cosine angle, p and q are corner labels of the two vectors, and are not greater than n.
Similar to Euclidean distance, the calculation method based on cosine similarity can take the preference of a user as a point in an n-dimensional coordinate system, and a straight line (vector) is formed by connecting the point with the origin of the coordinate system, wherein the similarity value between the two vectors is the cosine value of an included angle between the two straight lines (vectors), the smaller the included angle is, the more similar the candidate word is with the content attribute, and the larger the included angle is, the smaller the similarity is. Meanwhile, in the triangular coefficient, the cosine value of the angle is between [ -1,1], the cosine value of the angle of 0 degree is 1, and the cosine value of the angle of 180 degrees is-1.
And 5, taking the candidate words with the similarity with the semantic vector of the first content being greater than a first threshold value as subject words. The first threshold may be set according to requirements, such as 0.8, 0.9.
In the technical solution provided in step S206, the server selects a second content matching the subject term from the candidate media contents, and recommends the second content to the target object.
The selection of the second content matching the subject word from the candidate media contents is similar to the above manner, and the similarity between the subject word of the user and the semantic vector of each candidate content can be calculated, and one or more of the candidate media contents with the maximum similarity can be selected.
According to an aspect of the embodiment of the invention, a method embodiment for screening media content subject terms is provided. The method may comprise the steps of:
and step 1, acquiring media content.
And 2, acquiring word vectors of the keywords in the media content.
And 3, calculating to obtain the semantic vector of the media content according to the word vector of each keyword in the media content and the weight of each keyword, wherein the keywords are keywords in the words of the media content.
Optionally, calculating the semantic vector of the media content according to the word vector of each keyword and the weight of each keyword in the media content includes: the weight set for each keyword is determined by carrying out normalization processing on the weight parameters of all keywords; obtaining the product of a word vector of a keyword and a weight set for the keyword as an intermediate vector of the keyword; and summing the intermediate vectors of all the keywords to obtain the semantic vector of the media content.
And 4, calculating the similarity between the word vector of each word in the media content and the semantic vector of the media content, and when the similarity is greater than or equal to a threshold (such as second similarity), determining the corresponding word as the subject word of the media content.
The specific subject term screening method can be referred to the previous embodiment.
As an alternative embodiment, the technical solution of the present application will be further described below by taking the application of the technical solution of the present application to news media recommendation as an example.
When a group of words closest to the news expression theme semantic is extracted, a probability graph model textrank (used for a keyword extraction algorithm and also used for extracting phrases and automatic abstracts) is combined with a word vector model word2vec, and finally, the measurement of the semantic relevance of the words and the news text is obtained. Using a probability graph model textrank to sort importance of words in each news according to the position information of word occurrence in the news text, and giving importance scores; training word semantic vectors on news corpus by using word2 vec; the word vectors of the words with the top (such as the top 20) of the texttrank scores are weighted and summed according to the importance scores to obtain semantic vectors of the news text; according to the cosin similarity of the vector of each candidate word and the news semantic vector as the measurement of the similarity of each candidate word and the news semantic, setting a semantic relevance threshold value, and considering the words larger than the threshold value as the subject words of the news
A specific implementation flow of the subject word extraction method based on the semantic graph model of the recommendation method based on the media content is shown in fig. 3.
The computation of text semantic vectors (i.e., semantic vectors of content) requires as inputs importance weights of words computed by texttrank model and semantic vectors of words trained in advance.
the step of texttrank model calculating importance weights of words is shown in fig. 3:
in step S302, the server receives an online triggered request for acquiring a news.
As shown in fig. 4, when a user starts an application (such as a "flash" application), the request is triggered to request to push news conforming to the habit of the user, as shown in fig. 5; when the user is already in the news browsing interface, as shown in fig. 6, the request is triggered to update the recommended news when the bottom of the news recommendation list is pulled down or the "refresh" button is clicked.
Alternatively, as shown in fig. 7, after the user initiates a flash report or clicks on an update, the user terminal generates the request and transmits it to the server through the network.
In step S304, denoising is performed on the news text, mainly to filter out unnecessary information such as "XX report", "XX message", etc.
Step S306, word segmentation is carried out on the news text through a word segmentation system.
Recall candidate words: the fine-grained word segmentation is fished back, such as 'intelligent' and 'refrigerator' are fished back to an intelligent refrigerator, and filtering is performed through parts of speech, etc.
And the generated candidate words and text word segmentation context are used as the input of a texttrank model to calculate the importance degree of the words.
the texttrank is realized as follows, and the method mainly comprises the steps of calculating weights required by text semantic vectors:
1) The news text is segmented, i.e., t= [ S1, S2, ], sm ], where Si is each sentence and the words in the sentence are candidates that remain after filtering.
2) Constructing a candidate word graph G= (V, E) according to candidate words in each sentence, wherein V is the number of nodes in the graph, E is the edge between the candidate words, whether two nodes construct the edge is determined according to the co-occurrence length of words in each sentence, if the co-occurrence window can be set to be 2, the on-line code can also be set to be 2.
3) And (3) iteratively calculating a candidate texttrank value according to a formula until convergence, wherein the calculation formula is as follows:
wherein S (V) i ) Representing the importance of the candidate word i in the news text, d is a damping coefficient, for example, the damping coefficient is set to 0.85, 0.88, etc., and the on-line code can also be set to 0.85, 0.88, etc. In (V) i ) Represents a set of candidate words pointing to candidate word i, |Out (V j ) The number of edges of the candidate word j, i.e., the number of edges pointing to the candidate word j and the candidate word j pointing to another candidate word, S (V) j ) Representing the importance of the candidate word j in the text in the last iteration.
4) And normalizing the importance degrees of the candidate words, and then sequencing the candidate words in a reverse order to obtain the candidate words with the front (such as 20 maximum importance degrees) and the corresponding importance degrees of the candidate words in the text.
The word2vec model trains the semantic vectors of the words as follows:
step S308, denoising the historical data, such as selecting news historical data within half a year from the current time.
In step S310, the article is segmented, and fine-grained segmented words are fished back through a high-quality vocabulary, for example, a word of "news boy" is a media name, and any single word in "news" and "boy" alone cannot express such meaning, so that as long as the segmented words are segmented into "news" and "boy", the segmented words can be fished back into "news boy" to generate semantic vectors of "news boy" later.
Step S312, generating a training corpus.
1) And taking the training corpus as the input of the word2vec model to obtain model parameters, namely required word vectors.
The word is an abstract crystal of human intelligence and can be converted into a numerical form which can be recognized by a computer, namely word embedding (word embedding), namely, a word2vec model mainly uses a word as input to obtain word probability in a context of the word (such as a result obtained by using a Skip-gram model structure) or uses a context of the word as input to obtain word probability of the word (such as a result obtained by using a CBOW model structure), and training is performed by using the Skip-gram model structure to obtain word vectors with news history data as corpus, as shown in fig. 8.
2) When model training is completed, weights in the neural network are obtained, for example, an input word is "newsletter", and a valid code (english name is one-hot encoder) is represented as [1,0, 0..0 ], among weights of the hidden layers, only the weight corresponding to the position 1 is activated, and the number of the weights is the number of the hidden layers (i.e. super parameters) which are set, so that a semantic vector of "newsletter" is obtained and is represented by the vector of the weights, and each word in a vocabulary is unique by 1 in one-hot encoder, so that each word is correspondingly and uniquely represented.
Points that need to be noted during training can sample high frequency words in addition to treating common word combinations as single words.
For example, if the text is "the external media so as to evaluate Zhang Sanzhan", and the window size set in the training is 2, then there is the following training sample: training samples of "outer media", "like" and "like" do not provide more semantic information about "outer media" because the word "like" can appear in the context of many words, and the number of training samples of the word "like" and another word "XX" is far greater than the number of samples required to learn the word vector "like".
Thus, for each word, the probability size of each word being deleted can be calculated according to the frequency of the word, as follows:
p (wi) represents the size of the probability that word i is deleted, f (wi) represents the frequency of occurrence of wi, t represents a threshold value, typically 1e-3 to 1e-5, e.g. 1e-5 is taken here.
Optimizing the objective function by the negative sampling method "negative sampling" reduces a large amount of computation because training of each training sample only needs to update a small part of model weights, thereby reducing the computation load, besides the known objective words (active words), the acquisition of non-objective words (active words) is also related to the frequency of word occurrence, the higher the frequency is, the easier is to be selected as the active words, and the formula is realized as follows:
P (wi) is the probability that word i is selected as a candidate word, f (w) i ) The frequency of word i is indicated.
In step S314, a text semantic vector is calculated.
1) According to texttrank calculation, the weight of each candidate word in the text candidate words is obtained, and the iterative calculation formula is as follows:
wherein O is i Representation ofImportance (or weighting parameter) of non-normalized candidate word i, S (V i ) Representing the importance (or weighting) of candidate word i in news text in the semantic graph, d is a damping coefficient, and if set to 0.85, the on-line code can also be set to 0.85.In (V) i ) Represents a set of candidate words pointing to candidate word i, |Out (V j ) I indicates the number of sides of the candidate word j, S (V j ) And the importance degree of the candidate word j in the news text in the semantic graph of the last iteration is represented.
2) According to the obtained O i Then normalization processing is carried out to obtain the importance degree H of each candidate word i after normalization i The calculation formula is as follows:
wherein O is i Representing the importance of the candidate word i in the news text before normalization, H i The normalized importance of the representative candidate word i in the semantic graph is a scalar.
3) According to the importance Hi of the normalized candidate words, the candidate words i (i > =1 and i < =20) of 20 before the ranking are selected.
4) Vector representations of the training corpus words are obtained by the word2vec model, and the representation forms are as follows:
W i =(x 1 ,x 2 ,x 3 ,...,x n-2 ,x n-1 ,x n )
if n is set to 100 on the line, i.e., each word is a vector representation with dimensions of 100, x i Is a scalar.
The normalized weight of each candidate word in the first twenty candidate words is multiplied by the vector representation of the corresponding candidate word, and then weighted and summed to obtain a text semantic vector, the calculation formula is as follows:
wherein i represents the i-th word in the candidate top20, and P (i|graph) tableShowing the normalized importance of word i in text, W i Is a trained word vector, S represents a text semantic vector, and can be represented by a 100-dimensional vector.
In step S316, a correlation (or similarity) is calculated.
Taking the cosin similarity Ri of each candidate word semantic vector and the news text semantic vector as the measurement of the correlation between each candidate word and the news semantic, and calculating the following formula:
where S represents a text semantic vector, e.g., a vector of dimension 100, W i Word2vec vector representation representing the i-th term of the first 20 candidate terms may also be a vector of dimension 100, and Ri is a scalar value.
A semantic relevance threshold is set, and words above the threshold are considered as subject words of news.
In the technical scheme provided by the application, a probability graph model textrank is utilized to obtain importance scores of words in the text, and the word vectors are weighted and summed according to the importance scores to obtain the semantic vectors of the news text, so that the defect of semantic computation of the original news text is overcome.
Through offline evaluation and comparison, the extraction of the subject terms by using the proposed technical scheme is improved by more than 30% compared with ndcg indexes (indexes for measuring the sorting quality) for extracting the subject terms by using related technical schemes, and the click rate of users is improved by more than 20% by observing the online effect of the subject term recommendation strategy after the online operation.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
According to another aspect of the embodiment of the present invention, there is also provided a media content recommendation apparatus for implementing the media content recommendation method. FIG. 9 is a schematic diagram of an alternative media content recommendation device, as shown in FIG. 9, according to an embodiment of the present invention, the device may include:
a first obtaining unit 901, configured to obtain a recommendation request, where the recommendation request is used to request to recommend media content to a target object.
The second obtaining unit 903 is configured to obtain, in response to the recommendation request, a subject word from words of a first content, where a first similarity between a word vector of the subject word and a semantic vector of the first content is greater than or equal to a second similarity, the semantic vector of the first content is determined according to a word vector of each keyword of a plurality of keywords and a weight of each keyword, and the plurality of keywords are keywords in the words of the first content.
When words are extracted from media contents such as first content and second content, the words can be extracted in a corresponding mode according to the types of the contents (such as text, pictures, video and audio) included in the media contents, if the media contents comprise text, the words in the media contents can be words in the text, and if the media contents do not comprise text, the words can be extracted in a mode such as extracting the words in the media contents from subtitles of the video, extracting the words in the media contents from the voice of the video or directly from the audio through a voice-to-text conversion tool, and extracting the words in the media contents from the words in the pictures or the tags.
The subject words are also called as the narrative words, are used for expressing the subject words such as media contents and the like in indexing and retrieval, have conceptual and standardized characteristics, the selection of the subject words can be realized according to a subject word list, for example, the subject words are indexed in the media contents one by one according to the words in the subject word list, and if the index is reached, the words are used as one of the subject words of the media contents. Alternatively, the aforementioned subject words may be used as subject words of the target object (or the target account for identifying the target object), and these subject words may be used for words representing interests, hobbies, habits, and the like of the target object, such as the name "Wang Mou" of entertainment stars, story type "thrill", "civil monster", cell phone brand "XX", and the like.
If natural language is to be processed by an algorithm in machine learning, language can be mathematically performed, word vectors (Distributed Representation in english) are one way to mathematically process words in the language, each word in a certain language can be mapped into a vector of a fixed length through training, a word vector space can be formed by putting all the vectors together, each vector is a point in the space, and a "distance" is introduced in the space, so that similarity (such as lexical and semantic) between words can be judged according to the distance between the words.
Alternatively, word2vec models may be utilized to train word semantic vectors on media corpora such as news.
In the related art, the semantics of the first content is the synthesis of word senses represented by all words in the first content, for example, text in the media content is segmented, the text can be segmented into n words, and the word sense vectors corresponding to the n words are weighted and summed with a weight of 1/n, so that the semantic vectors of the media content such as news can be obtained.
When the similarity between the word vector of the word (including the subject word and the word except the subject word in the first content) and the semantic vector of the first content is obtained, the similarity between the vector of each candidate word and the cosine of the semantic vector of the news (or called cosine similarity) can be used as a measure of the correlation between each candidate word and the news semantic.
It should be noted that, when calculating the semantic vector, the related technology simply performs addition and averaging of the word semantic vector, however, words which are irrelevant to subjects may exist in the media content (such as the news text), when the words are often far away from other words in the vector space, the real semantics of the media content often depend on important words and are irrelevant to the words which are irrelevant to the subjects, so that the semantic vector of the finally obtained media content in the related technology cannot well reflect the real semantics of the news text.
In the technical scheme provided by the embodiment of the application, the probability graph model texttrank can be utilized to obtain the importance score of the words in the media content, and the word vectors are weighted and summed according to the importance score to obtain the semantic vectors of the media content, namely, the word vectors of the keywords are only considered when the semantic vectors of the media content are determined, so that the defect of original semantic vector calculation is overcome.
And a recommending unit 905, configured to select a second content matching the subject term from the candidate media contents, and recommend the second content to the target object.
It should be noted that, the first acquiring unit 901 in this embodiment may be used to perform step S202 in the embodiment of the present application, the second acquiring unit 903 in this embodiment may be used to perform step S204 in the embodiment of the present application, and the recommending unit 905 in this embodiment may be used to perform step S206 in the embodiment of the present application.
It should be noted that the above modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to what is disclosed in the above embodiments. It should be noted that the above modules may be implemented in software or hardware as a part of the apparatus in the hardware environment shown in fig. 1.
Through the module, when a recommendation request is acquired, acquiring a subject word for representing a target object, wherein the subject word is selected from words of first content, a semantic vector of the first content is determined according to a word vector of each keyword in a plurality of keywords and a weight configured for each keyword in the plurality of keywords, and the plurality of keywords are keywords in the words of the first content; the second content matched with the subject term is selected from the candidate media content, the second content is recommended to the target object, and the key term can embody the main content of the content, so that the determined semantic vector of the first content can be more accurate by determining the semantic vector of the key term, the subject term used for describing the habit of the target object can be more accurately determined, the subject term can be more accurately used for content recommendation conveniently, the technical problem that the accuracy of the recommended media content in the related technology is lower can be solved, and the technical effect of accurately performing content recommendation is achieved.
Alternatively, the second acquisition unit may include: the word segmentation module is used for obtaining a plurality of candidate words through word segmentation of the first content, wherein the candidate words are words in the first content; a first determining module, configured to determine a semantic vector of the first content according to word vectors of keywords in the plurality of candidate words; the acquisition module is used for acquiring the similarity between the word vector of each candidate word and the semantic vector of the first content; and the second determining module is used for taking the candidate words with the similarity with the semantic vector of the first content being greater than or equal to the second similarity as subject words.
The first determining module may include: a determining submodule, configured to determine a keyword in a plurality of candidate words according to a sequence position of the candidate word in the first content; and the operation sub-module is used for summing the intermediate vectors of all the keywords to obtain semantic vectors of the first content, wherein the intermediate vectors of the keywords are products between the word vectors of the keywords and weights set for the keywords.
Optionally, the determining submodule is further operable to:
constructing a word graph comprising a plurality of candidate words, wherein each candidate word is used as a node in the word graph, and the nodes where the candidate words belonging to the same sentence in the first content are connected in the word graph according to the sequence position in the sentence;
Performing iterative operation on the weight parameters of each candidate word in the word graph according to the following formula:
wherein S (V) i ) Weight parameter representing the i-th candidate word in the iterative operation of the round, S (V j ) The weight parameter representing the j-th candidate word In the previous iteration, d represents the damping coefficient, in (V i ) Representing a set of candidate words pointing to the jth candidate word, out (V) j ) Representing the number of edges of the jth candidate word;
in any round of iterative operation, if the difference value between the weight parameter of the candidate word in the iterative operation of the round and the weight parameter of the candidate word in the previous iterative operation is not in the target range, continuing to execute the next round of iterative operation, otherwise stopping the iterative operation;
and after stopping the iterative operation, acquiring the keywords in the plurality of candidate words according to the weight parameters, wherein the weight parameters of the keywords are larger than the weight parameters of the words except the keywords in the plurality of candidate words.
The operator module described above may also be used to:
the weight set for each keyword is determined by carrying out normalization processing on the weight parameters of all keywords;
obtaining the product of a word vector of a keyword and a weight set for the keyword as an intermediate vector of the keyword;
And summing the intermediate vectors of all the keywords to obtain the semantic vector of the first content.
Alternatively, the word segmentation module of the present application may include: the denoising module is used for denoising the first content and segmenting the denoised first content to obtain a plurality of words, wherein the denoising module is used for eliminating interference words in the first content; the filtering module is used for filtering the plurality of words according to the part of speech and merging the filtered words to obtain a plurality of candidate words.
Optionally, the first determining module is further configured to take the candidate word as an input of the first model before determining the semantic vector of the first content according to the word vector of the keyword in the plurality of candidate words, and obtain the word vector of the candidate word output by the first model.
In an alternative embodiment, the apparatus of the present application may further comprise: training unit for: before the candidate words are used as the input of the first model, a training set is obtained by word segmentation of the third content, wherein the words stored in the training set are words obtained by word segmentation of the third content; and taking the words belonging to the same sentence in the third content in the training set as the input of the second model according to the sequence position in the sentence, so as to train the second model and obtain the first model.
The denoising module can be further used for:
acquiring the deletion probability of words in the first content:
wherein P (w) i ) Representing the ith word w i Is deleted probability of f (w i ) Representing word w i Is the frequency of occurrence of t is a parameter;
determining the ith word as an interference word under the condition that the deletion probability of the ith word is larger than a second threshold value;
the interfering words in the first content are deleted.
According to another aspect of the embodiment of the present invention, there is also provided an apparatus for filtering media content keywords for implementing the method for filtering media content keywords. The apparatus may include:
a third acquisition unit configured to acquire media content;
a fourth obtaining unit, configured to obtain a word vector of a word in the media content;
the first computing unit is used for computing to obtain semantic vectors of the media content according to word vectors of keywords in the media content and weights of the keywords, wherein the keywords are keywords in words of the media content;
and the second calculation unit is used for calculating the similarity between the word vector of the words in the media content and the semantic vector of the media content, and when the similarity is larger than a first threshold value, the corresponding words are confirmed to be subject words of the media content.
In the technical scheme of the application, when a group of words closest to the news expression theme semantics are extracted, a probability graph model texttrank (used for extracting keywords and also used for extracting phrases and automatic abstracts) is combined with a word vector model word2vec, and finally, the measurement of the semantic relevance of the words and the news text is obtained. Using a probability graph model textrank to sort importance of words in each news according to the position information of word occurrence in the news text, and giving importance scores; training word semantic vectors on news corpus by using word2 vec; the word vectors of the words with the top (such as the top 20) of the texttrank scores are weighted and summed according to the importance scores to obtain semantic vectors of the news text; according to the cosin similarity of the vector of each candidate word and the news semantic vector as the measurement of the similarity of each candidate word and the news semantic, setting a semantic relevance threshold value, and considering the words larger than the threshold value as the subject words of the news
It should be noted that the above modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to what is disclosed in the above embodiments. It should be noted that the above modules may be implemented in software or in hardware as part of the apparatus shown in fig. 1, where the hardware environment includes a network environment.
According to another aspect of the embodiment of the present invention, there is also provided a server or a terminal for implementing the recommendation method of media content.
Fig. 10 is a block diagram of a terminal according to an embodiment of the present invention, and as shown in fig. 10, the terminal may include: one or more (only one is shown in fig. 10) processors 1001, a memory 1003, and a transmission means 1005 (such as the transmission means in the above embodiment), as shown in fig. 10, the terminal may further include an input-output device 1007.
The memory 1003 may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for recommending media content in the embodiment of the present invention, and the processor 1001 executes the software programs and modules stored in the memory 1003, thereby executing various functional applications and data processing, that is, implementing the method for recommending media content described above. Memory 1003 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 1003 may further include memory located remotely from processor 1001, which may be connected to the terminal by a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 1005 is used for receiving or transmitting data via a network, and may also be used for data transmission between the processor and the memory. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission means 1005 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 1005 is a Radio Frequency (RF) module for communicating with the internet wirelessly.
In particular, the memory 1003 is used to store an application program.
The processor 1001 may call an application program stored in the memory 1003 through the transmission means 1005 to perform the steps of:
acquiring a recommendation request, wherein the recommendation request is used for requesting to recommend media content to a target object;
responding to a recommendation request, acquiring a subject word from words of a first content, wherein the first similarity of a word vector of the subject word and a semantic vector of the first content is greater than or equal to a second similarity, the semantic vector of the first content is determined according to the word vector of each keyword in a plurality of keywords and the weight of each keyword, and the keywords are keywords in the words of the first content;
And selecting second content matched with the subject term from the candidate media content, and recommending the second content to the target object.
The processor 1001 is further configured to perform the steps of:
constructing a word graph comprising a plurality of candidate words, wherein each candidate word is used as a node in the word graph, and the nodes of the candidate words belonging to the same sentence in the first content are connected according to the positions in the sentence in the word graph;
performing iterative operation on the weight parameter of each candidate word in the word graph by combining the damping coefficient;
in any round of iterative operation, if the difference value between the weight parameter of the candidate word in the iterative operation of the round and the weight parameter of the candidate word in the previous iterative operation is not in the target range, continuing to execute the next round of iterative operation, otherwise stopping the iterative operation;
and after stopping the iterative operation, acquiring the keywords in the plurality of candidate words according to the weight parameters, wherein the weight parameters of the keywords are larger than the weight parameters of the words except the keywords in the plurality of candidate words.
When the recommendation request is acquired, acquiring a subject word for representing a target object, wherein the subject word is selected from words of first content, the first similarity of a word vector of the subject word and a semantic vector of the first content is larger than the second similarity, the second similarity is the similarity of word vectors of words except the subject word in the words of the first content and the semantic vector of the first content, and the semantic vector of the first content is determined according to keywords in the words of the first content; the second content matched with the subject term is selected from the candidate media content, the second content is recommended to the target object, and the key term can embody the main content of the content, so that the determined semantic vector of the first content can be more accurate by determining the semantic vector of the key term, the subject term used for describing the habit of the target object can be more accurately determined, the subject term can be more accurately used for content recommendation conveniently, the technical problem that the accuracy of the recommended media content in the related technology is lower can be solved, and the technical effect of accurately performing content recommendation is achieved.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the structure shown in fig. 10 is only illustrative, and the terminal may be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 10 is not limited to the structure of the electronic device. For example, the terminal may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in fig. 10, or have a different configuration than shown in fig. 10.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.
The embodiment of the invention also provides a storage medium. Alternatively, in the present embodiment, the storage medium described above may be used for executing the program code of the recommendation method of media content.
Alternatively, in this embodiment, the storage medium may be located on at least one network device of the plurality of network devices in the network shown in the above embodiment.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of:
s12, acquiring a recommendation request, wherein the recommendation request is used for requesting to recommend media content to a target object;
s14, acquiring a subject word from words of the first content in response to a recommendation request, wherein the first similarity of a word vector of the subject word and a semantic vector of the first content is greater than or equal to the second similarity, the semantic vector of the first content is determined according to the word vector of each keyword in a plurality of keywords and the weight of each keyword, and the keywords are keywords in the words of the first content;
s16, selecting second content matched with the subject word from the candidate media content, and recommending the second content to the target object.
Optionally, the storage medium is further arranged to store program code for performing the steps of:
s22, constructing a word graph comprising a plurality of candidate words, wherein each candidate word is used as a node in the word graph, and the nodes of the candidate words belonging to the same sentence in the first content are connected in the word graph according to the positions in the sentence;
S24, carrying out iterative operation on the weight parameter of each candidate word in the word graph by combining the damping coefficient;
s26, in any round of iterative operation, if the difference value between the weight parameter of the candidate word in the iterative operation of the round and the weight parameter of the candidate word in the previous iterative operation is not in the target range, continuing to execute the next round of iterative operation, otherwise stopping the iterative operation;
and S28, after stopping the iterative operation, acquiring the keywords in the plurality of candidate words according to the weight parameters, wherein the weight parameters of the keywords are larger than the weight parameters of the words except the keywords in the plurality of candidate words.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments, and this embodiment is not described herein.
Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method described in the embodiments of the present invention.
In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In several embodiments provided by the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (11)

1. A method of recommending media content, comprising:
acquiring a recommendation request, wherein the recommendation request is used for requesting to recommend media content to a target object, and the media content comprises texts and audios and videos;
responding to the recommendation request, dividing the first content according to the type of the media content to obtain a plurality of sentences, and extracting words from each sentence to obtain a plurality of candidate words associated with the plurality of sentences, wherein the first content is the media content watched by the user;
Carrying out normalization processing on the weight parameters of the plurality of candidate words, determining normalized weight parameters of the plurality of candidate words, and determining N candidate words with the largest normalized weight parameters as candidate subject words, wherein the weight parameters are used for indicating importance scores of the plurality of candidate words, and N is a positive integer;
determining keywords in all candidate subject matters according to the sequence positions of the candidate subject matters in the first content;
summing the intermediate vectors of all the keywords to obtain semantic vectors of the first content, wherein the intermediate vectors of the keywords are products between word vectors of the keywords and weights set for the keywords;
taking the candidate words with the similarity to the semantic vector of the first content being greater than or equal to a second similarity as the subject words;
and selecting second content matched with the subject term from the candidate media content, and recommending the second content to the target object.
2. The method according to claim 1, wherein the method further comprises:
constructing a word graph comprising a plurality of candidate words, wherein each candidate word is used as a node in the word graph, and the nodes of the candidate words belonging to the same sentence in the first content are connected in the word graph according to the sequence position in the sentence;
According to the weight parameter of the candidate word in the previous iteration operation, carrying out iteration operation on the weight parameter of each candidate word in the word graph by combining a damping coefficient;
in any round of iterative operation, when the difference value between the weight parameter of the candidate word in the iterative operation of the round and the weight parameter of the candidate word in the iterative operation of the previous round exceeds a target range, continuing to execute the next round of iterative operation, and when the difference value between the weight parameter of the candidate word in the iterative operation of the round and the weight parameter of the candidate word in the iterative operation of the previous round is within the target range, stopping the iterative operation;
and after stopping the iterative operation, acquiring the keywords in the plurality of candidate words according to weight parameters, wherein the weight parameters of the keywords are larger than the weight parameters of the words except the keywords in the plurality of candidate words.
3. The method according to claim 1, wherein the method further comprises:
the weight set for each candidate subject term is determined by carrying out normalization processing on the weight parameters of all the candidate subject terms;
obtaining the product of the word vector of the candidate subject word and the weight set for the candidate subject word as the intermediate vector of the candidate subject word;
And summing the intermediate vectors of all the candidate subject terms to obtain the semantic vector of the first content.
4. The method of claim 2, wherein extracting words from each sentence to obtain a plurality of candidate words associated with a plurality of sentences comprises:
denoising the first content, and segmenting the denoised first content to obtain a plurality of words;
and filtering the plurality of words according to the part of speech, and merging the filtered words to obtain the plurality of candidate words.
5. The method according to claim 1, wherein the method further comprises:
and taking the candidate word as the input of a first model, and acquiring a word vector of the candidate word output by the first model, wherein the first model is used for acquiring the word vector of the word.
6. The method of claim 5, wherein prior to entering the candidate word as the first model, the method further comprises:
obtaining a training set by word segmentation of the third content;
and taking the words belonging to the same sentence in the third content in the training set as the input of a second model according to the sequence position in the sentence, so as to train the second model and obtain the first model.
7. The method of claim 5, wherein denoising the first content comprises:
determining the deletion probability of the words in the first content according to the occurrence frequency of the words in the first content;
determining the ith word as an interference word under the condition that the deletion probability of the ith word is larger than a second threshold value;
and deleting the interference words in the first content.
8. The method of claim 1, wherein the second similarity is a preset threshold or is a similarity of a word vector of words of the first content other than the subject word to a semantic vector of the first content.
9. A recommendation device for media content, comprising:
the first acquisition unit is used for acquiring a recommendation request, wherein the recommendation request is used for requesting to recommend media content to a target object, and the media content comprises texts and audios and videos;
the second acquisition unit is used for responding to the recommendation request, dividing the first content according to the type of the media content to obtain a plurality of sentences, and extracting words from each sentence to obtain a plurality of candidate words associated with the plurality of sentences, wherein the first content is the media content watched by the user;
The device is further used for carrying out normalization processing on the weight parameters of the plurality of candidate words, determining normalized weight parameters of the plurality of candidate words, and determining N candidate words with the largest normalized weight parameters as candidate subject words, wherein the weight parameters are used for indicating importance scores of the plurality of candidate words, and N is a positive integer;
the device is further used for determining keywords in all the candidate subject words according to the sequence positions of the candidate subject words in the first content; summing the intermediate vectors of all the keywords to obtain semantic vectors of the first content, wherein the intermediate vectors of the keywords are products between word vectors of the keywords and weights set for the keywords; taking the candidate words with the similarity to the semantic vector of the first content being greater than or equal to a second similarity as the subject words;
and the recommending unit is used for selecting second content matched with the subject word from the candidate media content and recommending the second content to the target object.
10. A storage medium comprising a stored program, wherein the program when run performs the method of any one of the preceding claims 1 to 8.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor performs the method of any of the preceding claims 1 to 8 by means of the computer program.
CN201810603143.XA 2018-06-12 2018-06-12 Media content recommendation method and device, storage medium and electronic device Active CN108829822B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810603143.XA CN108829822B (en) 2018-06-12 2018-06-12 Media content recommendation method and device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810603143.XA CN108829822B (en) 2018-06-12 2018-06-12 Media content recommendation method and device, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN108829822A CN108829822A (en) 2018-11-16
CN108829822B true CN108829822B (en) 2023-10-27

Family

ID=64143877

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810603143.XA Active CN108829822B (en) 2018-06-12 2018-06-12 Media content recommendation method and device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN108829822B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684458A (en) * 2018-12-26 2019-04-26 北京壹捌零数字技术有限公司 A kind of calculation method and device of sentence vector
CN110033851B (en) * 2019-04-02 2022-07-26 腾讯科技(深圳)有限公司 Information recommendation method and device, storage medium and server
CN111866610B (en) * 2019-04-08 2022-09-30 百度时代网络技术(北京)有限公司 Method and apparatus for generating information
CN110147499B (en) * 2019-05-21 2021-09-14 智者四海(北京)技术有限公司 Labeling method, recommendation method and recording medium
CN110795936B (en) * 2019-08-14 2023-09-22 腾讯科技(深圳)有限公司 Word vector acquisition method and device, storage medium and electronic device
CN110750640B (en) * 2019-09-17 2022-11-04 平安科技(深圳)有限公司 Text data classification method and device based on neural network model and storage medium
CN110909550B (en) * 2019-11-13 2023-11-03 北京环境特性研究所 Text processing method, text processing device, electronic equipment and readable storage medium
CN110895879A (en) * 2019-11-26 2020-03-20 浙江大华技术股份有限公司 Method and device for detecting co-running vehicle, storage medium and electronic device
CN111259232B (en) * 2019-12-03 2022-08-12 江苏艾佳家居用品有限公司 Recommendation system optimization method based on personalized recall
CN111180086B (en) * 2019-12-12 2023-04-25 平安医疗健康管理股份有限公司 Data matching method, device, computer equipment and storage medium
CN111079010B (en) * 2019-12-12 2023-03-31 国网四川省电力公司 Data processing method, device and system
CN111090741B (en) * 2019-12-13 2023-04-07 国网四川省电力公司 Data processing method, device and system
CN111191119B (en) * 2019-12-16 2023-12-12 绍兴市上虞区理工高等研究院 Neural network-based scientific and technological achievement self-learning method and device
CN111191126B (en) * 2019-12-24 2023-11-03 绍兴市上虞区理工高等研究院 Keyword-based scientific and technological achievement accurate pushing method and device
CN111274785B (en) * 2020-01-21 2023-06-20 北京字节跳动网络技术有限公司 Text error correction method, device, equipment and medium
CN111476029A (en) * 2020-04-13 2020-07-31 武汉联影医疗科技有限公司 Resource recommendation method and device
CN111914564B (en) * 2020-07-13 2023-03-14 北京邮电大学 Text keyword determination method and device
CN113326385B (en) * 2021-08-04 2021-12-07 北京达佳互联信息技术有限公司 Target multimedia resource acquisition method and device, electronic equipment and storage medium
CN113792131B (en) * 2021-09-23 2024-02-09 深圳平安智慧医健科技有限公司 Keyword extraction method and device, electronic equipment and storage medium
CN114417021B (en) * 2022-01-24 2023-08-25 中国电子科技集团公司第五十四研究所 Semantic information accurate distribution method based on time, space and sense multi-constraint fusion
CN115344787B (en) * 2022-08-23 2023-07-04 华南师范大学 Multi-granularity recommendation method, system, device and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069143A (en) * 2015-08-19 2015-11-18 百度在线网络技术(北京)有限公司 Method and device for extracting keywords from document
WO2017084267A1 (en) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 Method and device for keyphrase extraction
CN106951420A (en) * 2016-01-06 2017-07-14 富士通株式会社 Literature search method and apparatus, author's searching method and equipment
CN107133315A (en) * 2017-05-03 2017-09-05 有米科技股份有限公司 A kind of smart media based on semantic analysis recommends method
CN107193803A (en) * 2017-05-26 2017-09-22 北京东方科诺科技发展有限公司 A kind of particular task text key word extracting method based on semanteme
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body
CN108052593A (en) * 2017-12-12 2018-05-18 山东科技大学 A kind of subject key words extracting method based on descriptor vector sum network structure

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069143A (en) * 2015-08-19 2015-11-18 百度在线网络技术(北京)有限公司 Method and device for extracting keywords from document
WO2017084267A1 (en) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 Method and device for keyphrase extraction
CN106951420A (en) * 2016-01-06 2017-07-14 富士通株式会社 Literature search method and apparatus, author's searching method and equipment
CN107133315A (en) * 2017-05-03 2017-09-05 有米科技股份有限公司 A kind of smart media based on semantic analysis recommends method
CN107193803A (en) * 2017-05-26 2017-09-22 北京东方科诺科技发展有限公司 A kind of particular task text key word extracting method based on semanteme
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body
CN108052593A (en) * 2017-12-12 2018-05-18 山东科技大学 A kind of subject key words extracting method based on descriptor vector sum network structure

Also Published As

Publication number Publication date
CN108829822A (en) 2018-11-16

Similar Documents

Publication Publication Date Title
CN108829822B (en) Media content recommendation method and device, storage medium and electronic device
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN106649818B (en) Application search intention identification method and device, application search method and server
CN108073568B (en) Keyword extraction method and device
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
US20180336193A1 (en) Artificial Intelligence Based Method and Apparatus for Generating Article
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
CN109388743B (en) Language model determining method and device
CN109597493B (en) Expression recommendation method and device
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN112559684A (en) Keyword extraction and information retrieval method
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
CN110909145A (en) Training method and device for multi-task model
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN113342958B (en) Question-answer matching method, text matching model training method and related equipment
CN110597968A (en) Reply selection method and device
CN113806588A (en) Method and device for searching video
CN116150306A (en) Training method of question-answering robot, question-answering method and device
CN110852071A (en) Knowledge point detection method, device, equipment and readable storage medium
AU2018226420B2 (en) Voice assisted intelligent searching in mobile documents
CN115794898B (en) Financial information recommendation method and device, electronic equipment and storage medium
CN115827990A (en) Searching method and device
CN115017886A (en) Text matching method, text matching device, electronic equipment and storage medium
CN111221880B (en) Feature combination method, device, medium, and electronic apparatus
CN113688633A (en) Outline determination method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant