CN104915399A

CN104915399A - Recommended data processing method based on news headline and recommended data processing method system based on news headline

Info

Publication number: CN104915399A
Application number: CN201510290279.6A
Authority: CN
Inventors: 罗剑波; 张俊彬; 蔡勋梁
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-05-29
Filing date: 2015-05-29
Publication date: 2015-09-16

Abstract

The invention discloses a recommended data processing method based on a news headline. The method comprises the following steps: recognizing a news headline related to an entity pair from a webpage; calculating a keyword set of the entity pair; intercepting text fragments from the news headline so as to obtain a text fragment set with time information, extracting first characteristic values of all the text fragments in the text fragment set; calculating semantic vectors of all the text fragments in the text fragment set, extracting second characteristic values of all the text fragments according to the semantic vectors; and fitting the first characteristic values and the second characteristic values according to the click data of users, thereby obtaining a recommendation reason sequence. The method is capable of solving the problem that webpage intelligent recommendation reasons in the prior art are lack of interestingness and further ensuring that the recommendation reason has accuracy and attraction.

Description

Based on recommending data disposal route and the system of headline

Technical field

The present invention relates to computer network field, specifically, relate to a kind of recommending data disposal route based on headline and system.

Background technology

Rapidly, people generally adopt network to consult all kinds of news and information in current network information development.In the evolution of Internet news, as the important component part that ripe business commending system is indispensable, rationale for the recommendation sets forth recommendation logic objective and accurately.

Allow the intelligence of user awareness commending system, significant to lifting Consumer's Experience.Current rationale for the recommendation mainly relies on predefine template way to generate, and be limited to the richness of template, rationale for the recommendation is lack of diversity on language performance.Recommend in this class entertainment recommendations scene star in amusement circle, the spirit being also only limitted to " related person ", " guessing that you like ", " other people are also searching " these machine-made rationale for the recommendation and amusement at present supreme is incompatible with, is difficult to win user's favor.

For the rationale for the recommendation solving Web page intelligent commending system in prior art lacks this problem interesting, make rationale for the recommendation take into account accuracy and attractive force simultaneously, need a kind of brand-new recommending data disposal route and system badly.

Summary of the invention

In order to the rationale for the recommendation solving Web page intelligent commending system in prior art lacks this problem interesting, embodiments of the present invention provide a kind of recommending data disposal route based on headline and system.

On the one hand, embodiment of the present invention provides a kind of recommending data disposal route based on headline, and described method comprises:

Identify to entity relevant headline from webpage;

Calculate the keyword set that described entity is right;

From described headline, intercept text fragments, obtain the text fragments set of being with temporal information, extract the First Eigenvalue of each text fragments in described text fragments set;

Calculate the semantic vector of each text fragments in described text fragments set, extract the Second Eigenvalue obtaining each text fragments described according to described semantic vector;

According to the click data of user, described the First Eigenvalue and described Second Eigenvalue matching are obtained rationale for the recommendation sequence.

Accordingly, embodiment of the present invention additionally provides a kind of recommending data disposal system based on headline, and described system comprises:

Header identification module, for identifying to entity relevant headline from webpage;

Keyword computing module, for calculating the right keyword set of described entity;

Text fragments interception module, for intercepting text fragments from described headline, obtaining the text fragments set of being with temporal information, extracting the First Eigenvalue of each text fragments in described text fragments set;

Characteristic value calculating module, for calculating the semantic vector of each text fragments in described text fragments set, extracts the Second Eigenvalue obtaining each text fragments described according to described semantic vector;

Screening module, for the click data according to user, obtains rationale for the recommendation sequence by described the First Eigenvalue and described Second Eigenvalue matching.

Implement various embodiment of the present invention and there is following beneficial effect: can accurately recommend to have more the network information that is interesting and attractive force to user intelligently again.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the recommending data disposal route based on headline according to embodiment of the present invention;

Fig. 2 shows the particular flow sheet of the step S5 of method shown in Fig. 1;

Fig. 3 is the Organization Chart of the recommending data disposal system based on headline according to embodiment of the present invention;

Fig. 4 shows the block diagram of the screening module 500 shown in Fig. 3.

Embodiment

Be described in detail to various aspects of the present invention below in conjunction with the drawings and specific embodiments.Wherein, well-known module, unit and connection each other, link, communication or operation do not illustrate or do not elaborate.Further, described feature, framework or function can combine by any way in one or more embodiments.It will be appreciated by those skilled in the art that following various embodiments are only for illustrating, but not for limiting the scope of the invention.Can also easy understand, the module in each embodiment described herein and shown in the drawings or unit or processing mode can be undertaken combining and designing by various different configuration.

Fig. 1 is the process flow diagram of the recommending data disposal route based on headline according to embodiment of the present invention.See Fig. 1, described method comprises the steps:

S1, identifies to entity relevant headline from webpage;

S2, calculates the keyword set that described entity is right;

S3, intercepts text fragments from described headline, obtains the text fragments set of being with temporal information, extracts the First Eigenvalue of each text fragments in described text fragments set;

S4, calculates the semantic vector of each text fragments in described text fragments set, extracts the Second Eigenvalue obtaining each text fragments described according to described semantic vector; S5, according to the click data of user, obtains rationale for the recommendation sequence by described the First Eigenvalue and described Second Eigenvalue matching.

In embodiments of the present invention, the recommending data disposal route based on headline can comprise: perform step S1, identifies to entity relevant headline from webpage.Between step S1 and step S2, also can comprise the steps: to detect the time interval that described entity breaks out news.The time interval that Gauss's abnormity point breaks out news at first detection model detection entity can be utilized.Such as: the news total amount of certain star within the A time period can be detected, and within the B time period this Star News amount abnormal increase, namely the news explosion time of this star is the B time period.By above-mentioned detection entity to the step of the time interval that news breaks out, can inquire and the time of concentration of entity to related news, thus reduce the query context of rationale for the recommendation data and improve search efficiency.

Next, perform step S2, calculate the keyword set that described entity is right, specifically, can comprise and calculate described entity in keyword set interval sometime according to tf-idf algorithm.Wherein, tf-idf (term frequency – inverse document frequency) is a kind of conventional weighting technique prospected for information retrieval and information.Lists of keywords can be obtained according to tf-idf model extraction, such as: in certain time period, the keyword set of N name before intercepting according to tf-idf value order from high to low.

Next, perform step S3, from described headline, intercept text fragments, obtain the text fragments set of being with temporal information, extract the First Eigenvalue of each text fragments in described text fragments set.Such as, regular expression can be utilized from headline to intercept text fragments, obtain entity with temporal information to text fragments set.

Then, perform step S4, calculate the semantic vector of each text fragments in described text fragments set, extract the Second Eigenvalue obtaining each text fragments described according to described semantic vector.Such as, by convolutional neural networks degree of depth learning model, each semantic segment can obtain the semantic feature vector of 200 dimensions, such as: " romance is proposed successfully " can obtain V1, " propose and successfully become front-page headline " and obtain V2, due to these two text fragments semantic similarity, the cosine similarity of V1 and V2 can close to 1, and the cosine similarity that the not identical text fragments of semanteme obtains can be tending towards 0 is even less than 0;

Wherein, described the First Eigenvalue comprises: the ageing feature of syntactic structure characteristic sum; Described Second Eigenvalue comprises: correlative character, attention rate feature, attractive force feature.Specifically, dependency analysis instrument can be utilized to calculate the syntactic structure feature of text fragments, the text fragments not meeting Chinese syntactic structure is deleted; According to the text fragments with temporal information, the ageing feature that this entity is right can be inquired, such as, breaks out the time interval of news; Can according to whether a collection of text fragments of attractive artificial mark is as standard data set, training SVM (Support Vector Machine, support vector machine) disaggregated model, and utilize the attractive force of this SVM model prediction text fragments, obtain attractive force feature; The heat right from search engine search Web log mining entity searches word, calculates heat and searches word and entity to the semantic similarity of text fragments, obtain user's attention rate feature; The right relation of entity is obtained from knowledge base, computational entity is to the semantic similarity of relation and text fragments, obtain correlative character, such as: pass through convolutional neural networks, can obtain the semantic feature vector of " man and wife ", " girlfriend ", " boyfriend " these entity relationship vocabulary, the semantic similarity of they and text fragments is for representing the correlative character of this relation and text fragments.The similarity of such as " romance is proposed successfully " this text fragments and " boyfriend " is higher than the similarity of " unmanned plane is wanted to become front-page headline ", therefore the cosine similarity of the semantic feature vector of relation right for entity and text fragments can be represented correlative character.

By adopting described method of the present invention, the rationale for the recommendation that can solve Web page intelligent commending system in prior art lacks interesting problem, makes rationale for the recommendation take into account accuracy and attractive force simultaneously.

Fig. 2 shows the particular flow sheet of the step S5 of method shown in Fig. 1.See Fig. 2, described step S5 comprises:

S51, is converted into the polled data to described the First Eigenvalue and described Second Eigenvalue by described click data;

S52, obtains the sequence of described rationale for the recommendation according to described polled data, and extracts rationale for the recommendation according to described rationale for the recommendation sequence order from high to low.

In embodiments of the present invention, according to click data on artificial annotation results and line, consider attractive force, architectural feature, user's attention rate, correlativity, the order models of the features training text fragments such as ageing, each entity centering, the text fragments that rank is the highest just clicks the forward ballot that can be understood as text fragments each time as the rationale for the recommendation user that this entity is right, text fragments number of clicks more multilist bright it is more welcome, also more rationale for the recommendation is suitable as, so just, the click behavior of user is converted into the training data of order models, utilize this training data, we can train logistic regression (Logistic Regression) model on the foundation characteristic of 5 of text fragments, thus select the text fragments of high-quality as rationale for the recommendation, also can extract rank the first or the text fragments of front N name as rationale for the recommendation.

Fig. 3 is the Organization Chart of the recommending data disposal system 1 based on headline according to embodiment of the present invention.See Fig. 3, described system 1 comprises:

Header identification module 100, for identifying to entity relevant headline from webpage;

Keyword computing module 200, for calculating the right keyword set of described entity;

Text fragments interception module 300, for intercepting text fragments from described headline, obtaining the text fragments set of being with temporal information, extracting the First Eigenvalue of each text fragments in described text fragments set;

Characteristic value calculating module 400, for calculating the semantic vector of each text fragments in described text fragments set, extracts the Second Eigenvalue obtaining each text fragments described according to described semantic vector;

Screening module 500, for the click data according to user, obtains rationale for the recommendation sequence by described the First Eigenvalue and described Second Eigenvalue matching.

In embodiments of the present invention, the recommending data disposal system based on headline can comprise: header identification module 100, identifies to entity relevant headline from webpage.System also can comprise detection module, for described from webpage identify and entity to relevant headline after, before calculating the right keyword set of described entity, detect the time interval that described entity breaks out news.Such as: the news total amount of certain star within the A time period can be detected, and within the B time period this Star News amount abnormal increase, namely the news explosion time of this star is the B time period.By above-mentioned detection entity to the step of the time interval that news breaks out, can inquire and the time of concentration of entity to related news, thus reduce the query context of rationale for the recommendation data and improve search efficiency.

Keyword computing module 200, calculates the keyword set that described entity is right, specifically, can comprise and calculate described entity in keyword set interval sometime according to tf-idf algorithm.Wherein, tf-idf is a kind of conventional weighting technique prospected for information retrieval and information.Lists of keywords can be obtained according to tf-idf model extraction, such as: in certain time period, the keyword set of N name before intercepting according to tf-idf value order from high to low.

Text fragments interception module 300, intercepts text fragments from described headline, obtains the text fragments set of being with temporal information, extracts the First Eigenvalue of each text fragments in described text fragments set.Such as, regular expression can be utilized from headline to intercept text fragments, obtain entity with temporal information to text fragments set.

Characteristic value calculating module 400, calculates the semantic vector of each text fragments in described text fragments set, extracts the Second Eigenvalue obtaining each text fragments described according to described semantic vector.Such as, by convolutional neural networks degree of depth learning model, each semantic segment can obtain the semantic feature vector of 200 dimensions, such as: " romance is proposed successfully " can obtain V1, " propose and successfully become front-page headline " and obtain V2, due to these two text fragments semantic similarity, the cosine similarity of V1 and V2 can close to 1, and the cosine similarity that the not identical text fragments of semanteme obtains can be tending towards 0 is even less than 0;

Wherein, described the First Eigenvalue comprises: the ageing feature of syntactic structure characteristic sum; Described Second Eigenvalue comprises: correlative character, attention rate feature, attractive force feature.Specifically, dependency analysis instrument can be utilized to calculate the syntactic structure feature of text fragments, the text fragments not meeting Chinese syntactic structure is deleted; According to the text fragments with temporal information, the ageing feature that this entity is right can be inquired, such as, breaks out the time interval of news; Can according to whether a collection of text fragments of attractive artificial mark is as standard data set, training svm classifier model, and utilize the attractive force of this SVM model prediction text fragments, obtain attractive force feature; The heat right from search engine search Web log mining entity searches word, calculates heat and searches word and entity to the semantic similarity of text fragments, obtain user's attention rate feature; The right relation of entity is obtained from knowledge base, computational entity is to the semantic similarity of relation and text fragments, obtain correlative character, such as: pass through convolutional neural networks, can obtain the semantic feature vector of " man and wife ", " girlfriend ", " boyfriend " these entity relationship vocabulary, the semantic similarity of they and text fragments is for representing the correlative character of this relation and text fragments.The similarity of such as " romance is proposed successfully " this text fragments and " boyfriend " is higher than the similarity of " unmanned plane is wanted to become front-page headline ", therefore the cosine similarity of the semantic feature vector of relation right for entity and text fragments can be represented correlative character.

By adopting described system of the present invention, the rationale for the recommendation that can solve Web page intelligent commending system in prior art lacks interesting problem, makes rationale for the recommendation take into account accuracy and attractive force simultaneously.

Fig. 4 shows the block diagram of the screening module 500 shown in Fig. 3; See Fig. 4, described screening module 500 comprises:

Sequencing unit 510, for being converted into the polled data to described the First Eigenvalue and described Second Eigenvalue by described click data;

Extraction unit 520, for obtaining the sequence of described rationale for the recommendation according to described polled data, and extracts rationale for the recommendation according to described rationale for the recommendation sequence order from high to low.

In embodiments of the present invention, according to click data on artificial annotation results and line, consider attractive force, architectural feature, user's attention rate, correlativity, the order models of the features training text fragments such as ageing, each entity centering, the text fragments that rank is the highest just clicks the forward ballot that can be understood as text fragments each time as the rationale for the recommendation user that this entity is right, text fragments number of clicks more multilist bright it is more welcome, also more rationale for the recommendation is suitable as, so just, the click behavior of user is converted into the training data of order models, utilize this training data, we can train Logic Regression Models on the foundation characteristic of 5 of text fragments, thus select the text fragments of high-quality as rationale for the recommendation, also can extract rank the first or the text fragments of front N name as rationale for the recommendation.

Through the above description of the embodiments, those skilled in the art can be well understood to the present invention and can realize by the mode of software combined with hardware platform, can certainly all be implemented by hardware.Based on such understanding, what technical scheme of the present invention contributed to background technology can embody with the form of software product in whole or in part, this computer software product can be stored in storage medium, as ROM/RAM, magnetic disc, CD etc., comprising some instructions in order to make a computer equipment (can be personal computer, server, smart mobile phone or the network equipment etc.) perform the method described in some part of each embodiment of the present invention or embodiment.

The term used in instructions of the present invention and wording, just to illustrating, are not meaned and are formed restriction.It will be appreciated by those skilled in the art that under the prerequisite of the ultimate principle not departing from disclosed embodiment, can various change be carried out to each details in above-mentioned embodiment.Therefore, scope of the present invention is only determined by claim, and in the claims, except as otherwise noted, all terms should be understood by the most wide in range rational meaning.

Claims

1. based on a recommending data disposal route for headline, it is characterized in that, described method comprises:

Identify to entity relevant headline from webpage;

Calculate the keyword set that described entity is right;

2. the method for claim 1, is characterized in that, described from webpage identify and entity to relevant headline after, comprise before calculating the right keyword set of described entity:

Detect the time interval that described entity breaks out news.

3. method as claimed in claim 2, is characterized in that, the right keyword set of the described entity of described calculating comprises:

Described entity is calculated to the keyword set at described time interval according to tf-idf algorithm.

4. the method for claim 1, is characterized in that, described the First Eigenvalue comprises: the ageing feature of syntactic structure characteristic sum; Described Second Eigenvalue comprises: correlative character, attention rate feature, attractive force feature.

5. the method for claim 1, is characterized in that, according to the click data of user, described the First Eigenvalue and described Second Eigenvalue matching is obtained rationale for the recommendation sequence and comprises:

Described click data is converted into the polled data to described the First Eigenvalue and described Second Eigenvalue, obtains the sequence of described rationale for the recommendation according to described polled data, and extract rationale for the recommendation according to described rationale for the recommendation sequence order from high to low.

6. based on a recommending data disposal system for headline, it is characterized in that, described system comprises:

7. system as claimed in claim 6, it is characterized in that, described system comprises:

Detection module, for described from webpage identify and entity to relevant headline after, before calculating the right keyword set of described entity, detect the time interval that described entity breaks out news.

8. system as claimed in claim 7, it is characterized in that, the keyword set calculating described entity right in described keyword computing module comprises:

9. system as claimed in claim 6, it is characterized in that, described the First Eigenvalue comprises: the ageing feature of syntactic structure characteristic sum; Described Second Eigenvalue comprises: correlative character, attention rate feature, attractive force feature.

10. system as claimed in claim 6, it is characterized in that, described screening module comprises:

Sequencing unit, for being converted into the polled data to described the First Eigenvalue and described Second Eigenvalue by described click data;

Extraction unit, for obtaining the sequence of described rationale for the recommendation according to described polled data, and extracts rationale for the recommendation according to described rationale for the recommendation sequence order from high to low.