CN102855282B

CN102855282B - A kind of document recommendation method and device

Info

Publication number: CN102855282B
Application number: CN201210272764.7A
Authority: CN
Inventors: 徐兴军
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2012-08-01
Filing date: 2012-08-01
Publication date: 2018-10-16
Anticipated expiration: 2032-08-01
Also published as: CN102855282A

Abstract

The invention discloses a kind of document recommendation method and device, a kind of document recommendation method includes：In preset collection of document, centered on document A, clustered to document according to the similarity degree of document content；According to there is currently document associations information, determine with document A with cluster document associated document；Using identified and document A with the associated document of cluster document, the first recommendation results of document A are constituted.Compared with prior art, using technical solution provided by the present invention, manually new publishing documents need not be pre-processed, to be effectively saved human cost.The document even newly issued in this way, or it generates recommendation results, efficiently solves the problems, such as cold start-up and Sparse Problem.

Description

A kind of document recommendation method and device

Technical field

The present invention relates to computer application technologies, more particularly to a kind of document recommendation method and device.

Background technology

With the development of Internet technology, the information content on internet is in explosive growth.In order to make user more square Just these information are quickly obtained, recommended technology is widely applied in information system.Wherein, correlation recommendation technology becomes again One important component of recommended technology, the basic thought of correlation recommendation technology are the one or more features based on information, The correlation between different information is found, and further establishes the contact relationship between information, when user browses a certain information, Commending system can will have the information for the relationship that contacts also to recommend user with the information.

For the research emphasis of correlation recommendation technology, other than excavating the features that more can be used for recommending, also reside in as What sets up the relationship between information according to these features in practical applications.Currently, more common mode is according to user The relationship between information is established in behavior, can be according to user to historical behaviors such as browsing, the search of document by taking document is recommended as an example Record, analyzes the interest of user, and then according to the interest similarity degree of single or multiple users, the contact established between document is closed System finally carries out document recommendation according to the relationship established.

But existing correlation recommendation method, it there is very serious cold start-up and Sparse Problem, it is so-called cold to open Dynamic refers to the information newly issued, and Sparse refers to then：For some information, the associated user's behavior record of itself is seldom （Or it is 0）, therefore, it is difficult to generate recommendation results according to user behavior.Currently used solution is the side by manual intervention Formula is some the preset recommendation results of information newly issued, but this mode needs to consume human cost, and requires operator Member has abundant priori, and recommendation results are there is also larger limitation and subjectivity, frequent nothing in practical applications Method meets the actual demand of information browse person.

Invention content

In order to solve the above technical problems, a kind of document recommendation method of offer of the embodiment of the present invention and device, to solve document Cold start-up problem in associated recommendation and Sparse Problem.Specific technical solution is as follows：

A kind of document recommendation method, including：

In preset collection of document, centered on document A, document is gathered according to the similarity degree of document content Cluster；

According to there is currently document associations information, determine with document A with cluster document associated document；

Using identified and document A with the associated document of cluster document, the first recommendation results of document A are constituted.

In a kind of specific implementation mode of the present invention, the document associations information is：

The related information between established different document is recorded according to the relevant user behavior of document.

Related information between the different document established according to the classification that document is belonged to.

In a kind of specific implementation mode of the present invention, centered on the A by document, according to the similarity degree of document content It clusters to document, including：

Document content is carried out to sentence weight, the document with document A content multiplicities more than predetermined threshold value is polymerized to a text Shelves cluster.

It is described that document is gathered according to the similarity degree of document content in a kind of specific implementation mode of the present invention Cluster, including：

It is retrieved using document A, will be more than the document of predetermined threshold value with the document A content degrees of correlation according to retrieval result It is polymerized to a document clusters.

In a kind of specific implementation mode of the present invention, this method further includes：

Using the same cluster document of document A, the second recommendation results of document A are constituted.

A kind of document recommendation apparatus, including：

Cluster unit, is used in preset collection of document, centered on document A, according to the similarity degree of document content It clusters to document；

Associative cell, for according to there is currently document associations information, determine and with document A be associated with text with cluster document Shelves；

Recommendation unit, for being pushed away with the associated document of cluster document, the first of composition document A with document A using identified Recommend result.

The related information between established different document is recorded according to the relevant user behavior of document.The present invention's In a kind of specific implementation mode, the document associations information is：

In a kind of specific implementation mode of the present invention, the unit that clusters is specifically used for：

In a kind of specific implementation mode of the present invention, the recommendation unit is additionally operable to：

The technical solution that the embodiment of the present invention is provided gathers document based on the similarity degree of document particular content Then cluster carries out document recommendation according to the result that clusters.It is equivalent to several similar documents of content, is considered as an identical point It is handled.The document even newly issued in this way, or it generates recommendation results, on the other hand, for currently Document with recommendation results can also be further optimized recommendation results according to the situation that clusters.

Compared with prior art, using technical solution provided by the present invention, manually new publishing documents need not be carried out Pretreatment, to be effectively saved human cost.Moreover, it is assumed that current existing incidence relation is reasonable between document, that Recommendation results after being clustered based on content similarities are still reasonable, that is to say, that the present invention program in recommendation process, The recommendation of high confidence level can be provided to the document newly issued under the premise of not introducing the influence of operating personnel's factor and individual subjective factor As a result, to further promote the performance of commending system.

Description of the drawings

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments described in invention can also be obtained according to these attached drawings other for those of ordinary skill in the art Attached drawing.

Fig. 1 is a kind of flow chart of document recommendation method of the embodiment of the present invention；

Fig. 2 is a kind of structural schematic diagram of document recommendation apparatus of the embodiment of the present invention.

Specific implementation mode

It is provided for the embodiments of the invention a kind of document recommendation method first to illustrate, this method may include following Step：

Document in the embodiment of the present invention can show as diversified forms, such as can be the files shape such as TXT, DOC, PDF The document of formula, can also be the document of form web page, these have no effect on the realization of the present invention program.

The document recommendation method that the embodiment of the present invention is provided is to be carried out within the scope of certain document, that is to say, that root According to different application environments, all there is a preset collection of document.Such as：Recommended in network library, then in library The upper transmitting file of all users constitutes preset collection of document；Recommended in knowledge platform, then knowledge all in the platform Theme constitutes preset collection of document；Recommended in news website, then news web page all in the website is constituted preset Collection of document.Certainly, according to actual application needs, the size of recommended range can be flexibly set, as low as some specific text Shelves subject categories, greatly to full internet range, the present invention does not need to this to be defined.

The technical solution that the embodiment of the present invention is provided is primarily based on the similarity degree of document particular content, to document into Row clusters, and then carries out document recommendation according to the result that clusters.It is equivalent to several similar documents of content, is considered as identical one A point is handled.

Assuming that A is new publishing documents, it, will be with the approximate document of document A contents after being clustered centered on document A B, document C, document D are gathered for identical cluster.In this way, if B, C, D itself have associated document, it can be by B, C, D Associated document feeds back to user as the recommendation results of A.

In order to make those skilled in the art more fully understand the technical solution in the present invention, implement below in conjunction with the present invention Attached drawing in example, technical solution in the embodiment of the present invention is described in detail, it is clear that described embodiment is only A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art The every other embodiment obtained, should all belong to the scope of protection of the invention.

Fig. 1 show a kind of flow chart for document recommendation method that the embodiment of the present invention is provided, and this method may include Following steps：

S101, in preset collection of document, centered on document A, according to the similarity degree of document content to document into Row clusters；

Currently, the information content in internet is very big, but by the study found that can exist among these many similar or even complete The multiple content of full weight may have the similar news report of many contents for example, being directed to same hot ticket；Different user may The identical document of content can be uploaded to library platform, etc..Document similar for content, for many-sided reason（Such as The resource quantity that the morning and evening of issuing time, publisher itself are possessed is different, published method difference etc.）Both, may cause The associated document data volume possessed is different.For example, document A is identical with document B contents, wherein document A is the text just issued Shelves, without any data that can be used for establishing incidence relation, and document B has had a large amount of associated data accumulation. So, it is completely rationally by the associated document of the document B also recommendation results as document A from the point of view of " content is similar " 's.

According to above-mentioned principle, the present invention uses centered on document A, according to document content similar journey any document A Spend the mode that clusters, found all in document sets with the approximate document of document A contents, then according to by other in cluster at Recommended candidate of the associated document of member as document A, generates the recommendation results of document A.

In a kind of specific implementation mode of the present invention, weight technology can be sentenced using text and is clustered to document.

Objective application environment Internet-based will necessarily have the document that a large amount of content repeats, in order to heavy to these Multiple document is effectively managed, and is accordingly produced many texts and is sentenced weight technology, for example, based on the signature algorithm of documentation level into Row sentences weight, and algorithms most in use includes MD5 algorithms, simhash algorithms etc..It, can be directly sharp in scheme provided herein Sentence weight technology with these ripe documents, document different in preset document sets is sentenced and is handled again, by the identical text of content Shelves are grouped into together.

In specific implementation process, can subordinate sentence be carried out to document first, such as by looking for newline, fullstop, exclamation, question mark Etc. segmentation marks to document carry out cutting；Then the sentence after cutting is normalized, such as such as the conversion of full half-angle, size Write conversion, either traditional and simplified characters conversion, removal noise character, more blank character normalizings etc.；It finally signs to sentence, and calculates two documents The common length or similarity of signature vectors indicate content registration with common length or similarity.

It only schematically illustrates, should not be constituted as one kind it is understood that document presented above sentences density current journey Limitation to the present invention program.

In practical applications, due to user's change etc., the content between some documents might have in some details Difference, but content on the whole still tends to unanimously.And the purpose of the present invention is the content similarity degree based on document into Row recommends part therefore can preset a content multiplicity threshold value（Such as 80%, 90% etc.）, during sentencing weight, if Similarity between document is more than this threshold value, then it is assumed that and the same document clusters can be gathered and be become to the difference very little between document, And then between same cluster member, associated document can be general mutually.

In another specific implementation mode of the present invention, it can also be clustered to document using retrieval technique.

The basic function of search engine, be exactly according to given search key, find out it is identical as the key words content or Other similar Internet resources.According to the basic function of search engine, in the present invention it is possible to utilize document A（In clustering The heart）Composition of contents search key input search engine, scanned within the scope of preset document sets, then according to search As a result the member to cluster is determined.

A kind of most basic implementation is：The title of document A can be drawn directly as search key input search It holds up, if the title of search result and document A are same or similar, which can be gathered to the document centered on A Cluster.For example, document A it is entitled " in examine reading（Chinese language）", entitled " the middle written comments on the work, etc of public of officials of another document B are obtained by retrieval Text is read ", then directly the document B can be gathered in cluster.

Certainly in practical applications, if the body matter of search result is similar to the title of document A, it is also assumed that full Cluster condition enough, might not shall be limited only to the extent " title is similar ".In theory, other than title, the other parts of document A are all It can be used for retrieving, such as author, abstract etc..During constituting search key, it can also carry out as segmented, removing Stop words etc. pre-processes.In addition, many search engines are very intelligent at present, such as search engine itself can carry out automatically The pretreatments such as participle, removal stop words, and search result also generally all can be according to related to keyword（It is similar）Degree into Row sequence, therefore can directly take the preceding n of search result（N is positive integer）Position, the same cluster member as A.In short, this field skill The specific strategy to cluster using search result, this hair can be flexibly arranged according to practical application request and application scenarios in art personnel It is bright that this is not needed to be defined.

Compared with based on the method that weight technology clusters is sentenced, the method that clusters based on search technique is in similarity judgement Accuracy on be short of, but can directly utilize existing search engine, therefore cost of implementation is relatively low.In practical application In, two schemes can be independently operated, or be used in combination.Certainly, in the premise for the basic thought for not departing from the present invention Under, those skilled in the art can also be clustered method using others, and these methods can be independently operated, or with The method that the embodiment of the present invention is provided is used in combination.

S102, according to there is currently document associations information, determine with document A with cluster document associated document；

By clustering after obtaining document similar with document A contents, is recommended to be directed to document A, needed first Determine the associated document of similar document.

The present invention program is based on such hypothesis：In preset document sets, there are a part of document, these documents itself There is related information.So, if this kind of document is gathered with document A in same cluster, these can be utilized existing Related information generates the recommendation results of document A.

In a kind of specific implementation mode of the present invention, it can be recorded according to the relevant user behavior of document, foundation Related information between different document.

For document B and document B1, in the access process of user, embodied correlation, then can establish document B and The incidence relation of document B1.Wherein " access of user " may include browsing, search, the actively behaviors such as recommendation.For example, certain user In certain navigation process, document B " middle written comments on the work, etc of public of officials text is read " is first browsed, then having browsed document B1 again, " middle written comments on the work, etc of public of officials text is made Text ", then can establish the incidence relation of document B and document B1.

In a specific embodiment, preset collection of document can be initialized as to a figure（graph）, document sets Each document in conjunction constitutes the point set of the figure, subsequently gathers if there is new document is added, then accordingly increases by one in figure A point.

The initial edge set of figure is combined into sky（Side right weight i.e. between any two points is 0）, for arbitrary two points, if one Correlation has been embodied in the access behavior of name user, then has increased a line between the two points, if in another user Access behavior in also embody correlation, then increase and have the weight ... on side repeatedly, pass through and analyze a large number of users Historical behavior records, and is stepped up the quantity and weight on side.Finally obtain the related information of all documents in collection of document.

In practical applications, different weighted values can also be assigned to different user behaviors.Such as：For " search " row For the correlation embodied, the weight of 0.5 unit is assigned；For the correlation that " browsing " behavior is embodied, the power of 1 unit is assigned Weight；For " the user actively recommends " correlation that behavior is embodied, the weight, etc. of 2 units is assigned.

In a kind of specific implementation mode of the present invention, the classification that can also be belonged to according to document, the not identical text of foundation Related information between shelves.

Document classification refers to determining one to each document in collection of document according to the attribute according to document or content Classification.In this way, user is not only able to easily in specific classification browsing document, and can be made by limiting search range The lookup of document is more easy.

Document B and document B1 can be established if the two itself is in identical classification for document B and document B1 Incidence relation." middle written comments on the work, etc of public of officials text is read " and document B1 " in examine language composition " belong to the class of " the middle written comments on the work, etc of public of officials are literary " for example, document B Not, then the incidence relation of document B and document B1 can be established.

It is understood that " the existing related information " of document can obtain, above two side in any way Case only schematically illustrates.In practical applications, two schemes can be independently operated, or be used in combination, such as " will belong to In the same category " certain weighted value is assigned, with " correlation that user access activity is embodied " collective effect.Certainly, not Under the premise of the basic thought for being detached from the present invention, those skilled in the art can also use other sides for establishing related information Method, and these methods can be independently operated, or are used in combination with the method that the embodiment of the present invention is provided.

S103 constitutes the first recommendation results of document A using identified and document A with the associated document of cluster document.

For document A, it is assumed that, will be with the approximate document B of document A contents, text after being clustered centered on document A Shelves C, document D are gathered for identical cluster.Also, B, C, D are respectively provided with following associated document：

The associated document of B is B1, B2, B3, B4（It sorts by associated weights, similarly hereinafter）；

The associated document of C is C1, C2, C3；

The associated document of D is D1, D2；

So, the associated document as the same cluster member of A, B1, C1, C2, C3, D1, D2 just constitute the recommended candidate collection of A It closes, the recommendation results of document A can be generated according to the set.

According to actual demand, different strategies can be had by generating recommendation results using recommended candidate set, such as：It can divide Each top N associated document with cluster member is not chosen generates recommendation results；

The associated document that according to cluster member to the distance at cluster center, can also choose different number generates recommendation knot Fruit, such as：Recommendation results are added for apart from nearest cluster member, choosing 3 associated documents, for the close cluster of distance time at Member chooses 2 associated documents and recommendation results is added, and for remaining cluster member, chooses 1 associated document respectively and is added and recommend knot Fruit, etc..

In addition, if during generating recommendation results, it is found that there is identical association texts between different cluster members Shelves, then it is assumed that such document associations confidence level is higher, and recommendation results can preferentially be added.Such as：

The associated document of B is B1, B2, B3, B4；

The associated document of C is C1, C2, C3, X；

The associated document of D is D1, D2, X；

According to existing related information, document X constitutes document C and the associated document of document D simultaneously, then is generating recommendation knot During fruit, the additional ranking weightings of document X can be given according to the co-occurrence degree of document X.

Furthermore, it is contemplated that B, C, D inherently can also therefore during actual recommendation with the approximate document of A contents Consider B, C, D being also further added in recommendation results.

Using above-mentioned technical proposal, it is assumed that A is new publishing documents, then can be using the associated document of B, C, D as A's Recommendation results feed back to user.On the other hand, if document A has had some associated documents for recommending originally, After clustering, A is just provided with more recommended candidates, this is also beneficial to be further optimized recommendation results.

It is shown in Figure 2 the present invention also provides a kind of document recommendation apparatus corresponding to above method embodiment, the dress Set including：

Cluster unit 110, is used in preset collection of document, centered on document A, according to the similar journey of document content Degree clusters to document；

The present invention is used and is clustered centered on document A, according to document content similarity degree for any document A Mode is found all in document sets with the approximate document of document A contents, then according to by the associated document of other members in cluster As the recommended candidate of document A, the recommendation results of document A are generated.

Associative cell 120, for according to there is currently document associations information, determine with document A being associated with cluster document Document；

In a kind of specific specific implementation mode, preset collection of document can be initialized as to a figure（graph）, text Each document in shelves set constitutes the point set of the figure, subsequently if there is new document is added, then accordingly increases a point.

The initial edge set of figure is combined into sky, for arbitrary two points, if embodied in the access behavior of a user Correlation then increases a line, if also embody correlation in the access behavior of another user between the two points Property, then the weight ... for increasing existing side by analyzing the historical behavior record of a large number of users, is stepped up side repeatedly Quantity and weight.Finally obtain the related information of all documents in collection of document.

Recommendation unit 130 constitutes the first of document A for the associated document using identified and document A with cluster document Recommendation results.

The associated document of C is C1, C2, C3；

The associated document of D is D1, D2；

The associated document of B is B1, B2, B3, B4；

The associated document of C is C1, C2, C3, X；

The associated document of D is D1, D2, X；

For convenience of description, it is divided into various units when description apparatus above with function to describe respectively.Certainly, implementing this The function of each unit is realized can in the same or multiple software and or hardware when invention.

As seen through the above description of the embodiments, those skilled in the art can be understood that the present invention can It is realized by the mode of software plus required general hardware platform.Based on this understanding, technical scheme of the present invention essence On in other words the part that contributes to existing technology can be expressed in the form of software products, the computer software product It can be stored in a storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are used so that a computer equipment （Can be personal computer, server or the network equipment etc.）Execute the certain of each embodiment or embodiment of the invention Method described in part.

Each embodiment in this specification is described in a progressive manner, identical similar portion between each embodiment Point just to refer each other, and each embodiment focuses on the differences from other embodiments.Especially for device reality For applying example, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to embodiment of the method Part explanation.The apparatus embodiments described above are merely exemplary, wherein described be used as separating component explanation Unit may or may not be physically separated, the component shown as unit may or may not be Physical unit, you can be located at a place, or may be distributed over multiple network units.It can be according to the actual needs Some or all of module therein is selected to realize the mesh system of this embodiment scheme or the distributed computing environment etc. of equipment.

The present invention can describe in the general context of computer-executable instructions executed by a computer, such as program Module.Usually, program module includes routines performing specific tasks or implementing specific abstract data types, program, object, group Part, data structure etc..The present invention can also be put into practice in a distributed computing environment, in these distributed computing environments, by Task is executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with In the local and remote computer storage media including storage device.

The above is only the specific implementation mode of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of document recommendation method, which is characterized in that including：

In preset collection of document, centered on document A, clustered to document according to the similarity degree of document content；

According to there is currently document associations information, determine with document A with cluster document associated document；The document associations information According to related information between established different document is recorded with the relevant user behavior of document；Or returned according to document Related information between the different document that the classification of category is established；

2. according to the method described in claim 1, it is characterized in that, centered on the A by document, according to the similar of document content Degree clusters to document, including：

Document content is carried out to sentence weight, the document with document A content multiplicities more than predetermined threshold value is polymerized to a document clusters.

3. according to the method described in claim 1, it is characterized in that, the similarity degree according to document content carries out document It clusters, including：

It is retrieved using document A, according to retrieval result, the document with the document A content degrees of correlation more than predetermined threshold value is polymerize For a document clusters.

4. according to the method described in claim 1, it is characterized in that, this method further includes：

5. a kind of document recommendation apparatus, which is characterized in that including：

Cluster unit, is used in preset collection of document, centered on document A, according to the similarity degree of document content to text Shelves cluster；

Associative cell, for according to there is currently document associations information, determine with document A with cluster document associated document；Institute State according to document associations information the related information recorded with the relevant user behavior of document between established different document；Or Related information between the different document that person is established according to the classification that document is belonged to；

Recommendation unit, for recommending knot with the associated document of cluster document, the first of composition document A with document A using identified Fruit.

6. device according to claim 5, which is characterized in that the unit that clusters is specifically used for：

7. device according to claim 5, which is characterized in that the unit that clusters is specifically used for：

8. device according to claim 5, the recommendation unit, are additionally operable to：