CN100545847C - A kind of method and system that blog articles is sorted - Google Patents

A kind of method and system that blog articles is sorted Download PDF

Info

Publication number
CN100545847C
CN100545847C CNB2007101236257A CN200710123625A CN100545847C CN 100545847 C CN100545847 C CN 100545847C CN B2007101236257 A CNB2007101236257 A CN B2007101236257A CN 200710123625 A CN200710123625 A CN 200710123625A CN 100545847 C CN100545847 C CN 100545847C
Authority
CN
China
Prior art keywords
blog articles
weights
blog
text
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CNB2007101236257A
Other languages
Chinese (zh)
Other versions
CN101127046A (en
Inventor
邵荣防
谢海劝
董亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tencent Computer Systems Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CNB2007101236257A priority Critical patent/CN100545847C/en
Publication of CN101127046A publication Critical patent/CN101127046A/en
Priority to PCT/CN2008/072319 priority patent/WO2009046649A1/en
Application granted granted Critical
Publication of CN100545847C publication Critical patent/CN100545847C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Abstract

The present invention relates to the communications field, a kind of method and system that blog articles is sorted are provided.A kind of method that blog articles is sorted comprises and sets up index, and inquire about ordering according to user's input, it is characterized in that, described set up index step comprise: A. extracts correlation factors from blog system; B. calculate the correlativity weights of term and each blog articles according to described correlation factors, the blog articles that text is practised fraud is discerned and is fallen power and handle simultaneously; C. make up index between term and each blog articles according to falling correlativity weights after the power.The present invention discerns and handles the blog articles of text cheating in the calculating of correlativity weights, thereby set up accurately index and the blog articles that searches is sorted based on this index, therefore improve the objective and accurate property of ordering, guaranteed user's retrieval quality.

Description

A kind of method and system that blog articles is sorted
Technical field
The present invention relates to the communications field, more particularly, relate to a kind of method and system that blog articles is sorted.
Background technology
Along with Internet development, network log (Weblog, Blog is made in letter, Chinese i.e. " blog ") has become a kind of common network service.Existing a large amount of at present Internet firms release blog search engine separately, the sort method that these blog search engines are taked the blog articles that retrieves is not quite similar, but all be to carry out computing by retrieval string to user's input, find maximally related one group of result, return to the user, thereby the user can be found and the maximally related blog articles of oneself expectation.At present ubiquitous two kinds of sortords are, press relevancy ranking and according to time sequence, sort and relatively be typically according to the degree of correlation.
The detailed process that sorts according to the degree of correlation is: at first calculate the correlativity weights between retrieval string and each blog, comprise numerical value correlativity weights and text relevant weights, generally be that the retrieval string is resolved into a plurality of terms, make the retrieval string and the correlativity of blog be decomposed into the correlativity of term and blog, thereby according to the index between correlativity weights foundation retrieval string and the blog articles; When the user retrieves, then go here and there to and search in the index of setting up, and each blog articles is sorted according to the size of correlativity weights according to the retrieval of user input, the result after will sort at last sends to user's demonstration.
Though above-mentioned sort method can provide the ranking results of blog articles to a certain extent exactly for the user, a drawback that exists is to tend to make that the ordering of some inferior quality articles is earlier.The blog articles that this patent enriches vocabulary, have substance in speech is defined as the high-quality article, and the article that vocabulary poorness, a large amount of statement are repeated is defined as the inferior quality article.Some inferior quality articles, its in the whole text or the part have only several speech to occur with tossing about, but in above-mentioned sort method, the repetition that these articles but can be by word and pile up and obtain more forward ordering, this is a kind of typical text cheating.For the blog search engine that present great majority sort according to the degree of correlation, its ranking results to blog articles all can't be got rid of the influence that above-mentioned text cheating causes.
Therefore need a kind of new method that blog articles is sorted, the influence of avoiding the text cheating that the objective and accurate property of ranking results is caused, thus improve user's retrieval quality.
Summary of the invention
The object of the present invention is to provide a kind of system that blog articles is sorted, be intended to solve the influence that to get rid of text cheating when prior art sorts to blog articles, make user's the lower problem of retrieval quality.
The present invention also aims to provide a kind of method that blog articles is sorted, to solve the above-mentioned problems in the prior art better.
In order to realize goal of the invention, the described system that blog articles is sorted be used to set up index index, inquire about the searcher of ordering according to user's input, described index comprises:
Be used for extracting the device of correlation factors from blog system;
Be used for calculating the correlativity weights of term and each blog articles, and simultaneously the article of text cheating discerned and fallen the device of power processing according to described correlation factors;
Be used for making up the device of the index between term and each blog articles according to falling correlativity weights after the power;
The described correlativity weights that are used for calculating term and each blog articles according to described correlation factors, the blog articles that text is practised fraud is discerned and is fallen the device of weighing processing and further comprises simultaneously:
Be used to utilize that a max cap. is fixed, adjustable length moving window traversal blog articles, and the device of the record maximum length that moving window reached, wherein said capacity is meant the quantity of the different vocabulary that moving window holds, and length is meant the total amount of vocabulary in the moving window;
When being used for the traversal end, maximum length and a threshold value that described moving window is reached compare, and then this blog articles are judged to be the device that has the text cheating if surpass threshold value;
Be used for the device that power is handled is fallen in the described correlativity weights of the blog articles of text cheating that exist.
Preferably, described correlation factors is the text relevant factor, and described correlativity weights are text relevant weights.
Preferably, the described correlation factors that is used for extracting from blog system the device of correlation factors comprises the numerical value correlation factors and the text relevant factor, and described correlativity weights are the comprehensive correlativity weights after exponential quantity correlativity weights superpose with the text relevant weights.
Preferably, described input according to the user searcher of inquiring about ordering comprises:
Be used to receive the device of the term of user's input; Be used for according to described term the device of the correlativity weights from the index of having set up between this term of inquiry and each blog articles;
Be used for each blog articles relevant with term being sorted, and ranking results fed back to user's device according to the size of described correlativity weights.
Preferably, the searcher that ordering is inquired about in described input according to the user comprises: be used to receive the retrieval string of user's input, and be the device of a plurality of terms with the cutting of described retrieval string;
Be used for from the correlativity weights between each term of the index set up inquiry and each blog articles, and calculate that retrieval is gone here and there and blog articles between the device of multiple correlation weights;
Be used for each blog articles relevant with the retrieval string being sorted, and ranking results fed back to user's device according to the size of described multiple correlation weights.
In order to realize goal of the invention better, the described method that blog articles is sorted comprises and sets up index, and inquire about ordering according to user's input, described set up index step comprise:
A. extract correlation factors from blog system;
B. calculate the correlativity weights of term and each blog articles according to described correlation factors, the blog articles that text is practised fraud is discerned and is fallen power and handle simultaneously;
C. make up index between term and each blog articles according to falling correlativity weights after the power.
Preferably, the correlation factors in the described steps A is meant the text relevant factor, and described correlativity weights then are meant the text relevant weights.
Preferably, the correlation factors in the described steps A comprises the numerical value correlation factors and the text relevant factor, and described correlativity weights are the comprehensive correlativity weights after exponential quantity correlativity weights and the stack of text relevant weights.
Preferably, the blog articles of among the described step B text being practised fraud is discerned and is fallen the step of weighing processing and further comprises:
B1. utilize that a max cap. is fixed, adjustable length moving window traversal blog articles, and the record maximum length that moving window reached, wherein said capacity is meant the quantity of the different vocabulary that moving window holds, and length is meant the total amount of vocabulary in the moving window;
When B2. traversal finished, maximum length and a threshold value that described moving window was reached compared, and then this blog articles were judged to be the text cheating if surpass threshold value;
B3. the power processing is fallen in the described correlativity weights of the blog articles of text cheating that exist.
Preferably, described input according to the user step of inquiring about ordering comprises:
D. receive the term of user's input;
E. according to described term, from the index of having set up, inquire about the correlativity weights between this term and each blog articles;
F. the size according to described correlativity weights sorts to each blog articles relevant with term, and ranking results is fed back to the user.
Preferably, described input according to the user step of inquiring about ordering comprises:
D '. receive the retrieval string of user's input, and be a plurality of terms the cutting of described retrieval string;
E '. the correlativity weights from the index of having set up between each term of inquiry and each blog articles, and calculate the multiple correlation weights of retrieving between string and the blog articles;
F '. the size according to described multiple correlation weights sorts to each blog articles relevant with the retrieval string, and ranking results is fed back to the user.
The present invention discerns and handles the blog articles of text cheating in the calculating of correlativity weights, thereby set up accurately index and the blog articles that searches is sorted based on this index, therefore improve the objective and accurate property of ordering, guaranteed user's retrieval quality.
Description of drawings
Fig. 1 is the system construction drawing that among the present invention blog articles is sorted;
Fig. 2 is the structural drawing of index in one embodiment of the present of invention;
Fig. 3 is the structural drawing of searcher in one embodiment of the present of invention;
Fig. 4 is the present invention sets up index in the process that blog articles is sorted a method flow diagram;
Fig. 5 is one embodiment of the present of invention are set up index in the process that blog articles is sorted a method flow diagram;
Fig. 6 is the method flow diagram that the present invention discerns and handles the text cheating in setting up the process of index;
To be one embodiment of the present of invention set up the method flow diagram of in the process of index the text cheating being discerned and being handled at Fig. 4 or Fig. 5 to Fig. 7;
Fig. 8 is the method flow diagram that one embodiment of the present of invention sort to blog articles based on the index of setting up among Fig. 4 or Fig. 5;
Fig. 9 is the method flow diagram that another embodiment of the present invention sorts to blog articles based on the index of setting up among Fig. 4 or Fig. 5.
Embodiment
In order to make purpose of the present invention, technical scheme and advantage clearer,, the present invention is further elaborated below in conjunction with drawings and Examples.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.
The index that the present invention is based on foundation sorts to blog articles, and set up index is to be undertaken by the correlativity weights that calculate term and blog articles, the present invention is owing to paste the blog articles that recognizer identifies the text cheating by water in the calculating of correlativity weights, and it is fallen power handle, therefore can set up index more accurately, thereby improved the objective and accurate property that sorts based on this index, guaranteed that the user carries out the quality of blog retrieval.
Fig. 1 shows the system architecture that among the present invention blog articles is sorted, and this system comprises blog system 100, index 200, searcher 300, proxy server 400 and client 500.Should be noted that the annexation between each equipment is the needs of explaining its information interaction and control procedure for clear in all diagrams of the present invention, therefore should be considered as annexation in logic, and should not only limit to physical connection.Wherein:
(1) blog system 100 is used to the user that the blog related service is provided, comprise blog articles is stored and management etc., and provide correlation factors for index 200 in the present invention, (for example comprise the text relevant factor, the classification of text, title, text, the pet name, space name etc.), and numerical value correlation factors (for example, the liveness factor, the reprinting rate factor, the response rate factor, deliver time factor etc.).The core of this blog system 100 can be a Website server, but the present invention does not limit its concrete form.
(2) index 200 is used for setting up index according to the data of blog system 100, the blog articles of being searched for is sorted based on this index for searcher 300.
In one embodiment, as shown in Figure 2, this index 200 further comprises numerical value correlativity identifying unit 201, text relevant identifying unit 202, text cheating recognition unit 203, stack computing unit 204 and index construct unit 205, wherein: numerical value correlativity identifying unit 201 is used for calculating the numerical value correlativity weights of term and each blog articles according to the numerical value correlation factors from blog system 100 extractions; Text relevant identifying unit 202 is used for calculating the text relevant weights of term and each blog articles according to the text relevant factor from blog system 100 extractions; During text relevant weights that text cheating recognition unit 203 is used for calculating between terms and the blog articles at text relevant identifying unit 202, the blog articles of text cheating is discerned and fallen power and handle, and the text relevant weights that will fall after power is handled send to text relevant identifying unit 202; Stack computing unit 204 is used for the calculating that superposes of aforesaid numerical value correlativity weights and text relevant weights is obtained the comprehensive correlativity weights of this term, and sends into index construct unit 205; Index construct unit 205 is according to this comprehensive correlativity weights index building.
In another embodiment, this index 200 only comprises text relevant identifying unit 202, text cheating recognition unit 203 and index construct unit 205, wherein: text relevant identifying unit 202 is used for calculating the text relevant weights of term and each blog articles according to the text relevant factor from blog system 100 extractions; Text cheating recognition unit 203 is used to discern the blog articles of text cheating, and power is fallen in its text relevant weights handle, and the text relevant weights after will handling again send to text relevant identifying unit 202 and transmit; The text relevant weights that 205 bases in index construct unit receive make up the index between term and each blog articles.Though because present embodiment can realize that because the process of index building is only considered the text relevant factor, the accuracy of index is not high enough, therefore the structure of index 200 is more extensive and typical in current application among the last embodiment.
(3) searcher 300 is inquired about according to the term of user's input and blog articles is sorted.
In one embodiment, as shown in Figure 3, this searcher 300 further comprises query unit 301, multiple correlation computing unit 302, sequencing unit 303.In this embodiment, what the user imported at first is the retrieval string that comprises a plurality of terms, by proxy server 400 cuttings is to send into searcher 300 behind the term, searcher 300 is then handled after receiving term, wherein: query unit 301 is inquired about correlativity weights (the text relevant weights between each term and each blog articles from the index that index 200 has been set up, and send into sequencing unit or comprehensive correlativity weights); 302 correlativity weights according to each term of multiple correlation computing unit calculate the multiple correlation weights between retrieval string and each blog articles, and send into sequencing unit 303; Sequencing unit 303 sorts to each blog articles relevant with the retrieval string according to the multiple correlation weights.
In another embodiment, this searcher 300 only comprises query unit 301, sequencing unit 303, and this embodiment is applicable to that the user imports term but not the situation of retrieval string, so searcher 300 can directly link to each other with client 500 and communicate among this embodiment.Wherein: query unit 301 is inquired about the correlativity weights (text relevant weights, or comprehensive correlativity weights) between this term and each blog articles, and is sent into sequencing unit 303 according to the term of user's input from the index that index 200 has been set up; Sequencing unit 303 sorts to each blog articles relevant with term according to the size of the correlativity weights of being received.Should be noted that since at present the user mostly input all be the retrieval string that comprises a plurality of terms, therefore the structure of searcher 300 is more extensive and typical in current application among the last embodiment.
(4) proxy server 400 is used to receive the retrieval string that client 500 sends, and will retrieve the string cutting be term, sends to searcher 300, and with searcher 300 retrievals and the result after sorting be transmitted to client 500.In one embodiment of the invention, the structure of searcher 300 as shown in Figure 3, proxy server 400 is that system of the present invention is necessary in this case.
(5) login has the user in the client 500, it receives the term or the retrieval string of user's input: if user's input is term, can directly send it to searcher 300, and behind the blog articles ranking results that receives searcher 300 feedbacks, ranking results is drawn and is shown on the user interface; If user input is the retrieval string, then beard and hair is given proxy server 400 and is carried out cutting, and behind the blog articles ranking results that receives proxy server 400 feedbacks, ranking results is drawn and is shown on the user interface.Client 500 typically can be the various terminal devices that can login the internet; personal computer (Personal Computer for example; PC), personal digital assistant (Personal Digital Assistant; PDA), mobile phone (Mobile Phone; MP) etc., thus protection scope of the present invention should not be defined as the client of certain particular type.
Fig. 4 shows the present invention sets up index in the process that blog articles is sorted method flow, may further comprise the steps:
In step S401, index 200 extracts correlation factors from blog system 100, and these data are formatd.Alleged correlation factors among the present invention, (for example comprise the text relevant factor, the classification of text, title, text, the pet name, space name etc.), and numerical value correlation factors (for example, the liveness factor, the reprinting rate factor, the response rate factor, deliver time factor etc.).The value of most correlation factors all is mapped to a fixed interval, for example [0,100], fraction then is the original value of data, these correlation factors when index 200 index buildings, the input parameter in the time of will calculating as the correlativity weights.
In step S402, index 200 calculates the correlativity weights of terms and each blog articles, simultaneously the blog articles with text cheating is discerned and is fallen the power processing.
In one embodiment, index 200 is only considered the text relevant factor, it calculates the text relevant weights of term according to the text relevant factor, and identify the blog articles of text cheating, then the text relevant weights of term and this blog articles are carried out the suitable power of falling and handle, make it arrange by the back.
In another embodiment, index 200 not only consider text relevant because of, also considered the numerical value correlation factors, calculate text relevant weights and numerical value correlativity weights respectively, identify the blog articles of text cheating simultaneously, then the text relevant weights of term and this blog articles are carried out the suitable power of falling and handle, the calculating that again text relevant weights and numerical value correlativity weights superposeed at last obtains comprehensive correlativity weights.Hence one can see that, and last embodiment just falls power to the text relevant weights to be handled, and present embodiment has in fact also applied effect to comprehensive correlativity weights to the power processing of falling that the text relevant weights carry out.Therefore present embodiment has further improved the accuracy of data because the numerical value correlation factor has also been considered to come in.
In step S403, index 200 makes up index between term and each blog articles according to falling correlativity weights after the power.This index record the correlativity weights between each term, blog articles, term and the blog articles corresponding with term, thereby can be when the user imports term and searches for, can sort to the blog articles that searches according to the data in the index, make the user can find maximally related blog articles rapidly.
Fig. 5 shows one embodiment of the present of invention are set up index in the process that blog articles is sorted method flow, and this flow process specifically comprises based on the structure of Figure 1 and Figure 2:
In step S501, index 200 extracts correlation factors from blog system 100, and these data are formatd.Alleged correlation factors among the present invention, (for example comprise the text relevant factor, the classification of text, title, text, the pet name, space name etc.), and numerical value correlation factors (for example, the liveness factor, the reprinting rate factor, the response rate factor, deliver time factor etc.).
In step S502, index 200 utilizes its numerical value correlativity identifying unit 201 to calculate the numerical value correlativity weights of term and each blog articles.
In one embodiment, the numerical value correlation factors comprises liveness factor W PO, reprinting rate factor W DU, response rate factor W RE, deliver time factor W PTThese four kinds, wherein: liveness factor W POCalculate by blog system 100, span is [0,100], its user who has taken all factors into consideration the blog personal space logins frequency, blog articles is delivered factors such as frequency, it is the comprehensive measurement index of blog personal space active degree, liveness is high more, and the ranking results relative importance value of blog articles is high more; Reprinting rate factor W DUBe to calculate according to the blog articles repeat number that obtains in the heavy system of row, span is in [0,100], and the reprinting rate is high more, and the ranking results relative importance value of blog articles is high more; Response rate factor W REBe that answer number of times according to blog articles calculates, span is in [0,100], response rate factor W REHigh more, the ranking results relative importance value of blog articles is high more; Deliver time factor W PTBe delivering the time of blog articles, can adopt the UNIX time to represent, the ranking results relative importance value of the new more blog articles of delivering is high more.Numerical value correlativity weights are then calculated through line style by all correlation factors of listing above and normalization draws, and its span is in interval [0,1], and its computing formula is as follows:
W NUM=∑λ i×W i/MAX_VALUE (1)
W wherein iBe all correlation calculations factors listed earlier, λ iBe the correction factor of correspondence, be used for increasing or reducing the effect of correlation factors, can in the process that ranking results is adjusted, determine λ iMore satisfactory value, MAX_VALUE is the possible maximum occurrences of these numerical value correlativity weights.Should be noted that the aforementioned calculation formula is an example, or not also can not calculate by similar formula in order to limit protection scope of the present invention.
In step S503, index 200 utilizes its text relevant identifying unit 202 to calculate the text relevant weights of term and each blog articles, and utilizes the blog articles of 203 pairs of text cheatings of text cheating recognition unit to fall the power processing.In the present invention, the text relevant factor just can be used to the text field of retrieving.
In one embodiment, these the text fields comprise these 5 of classification, title, text, the pet name, space name, and each field has a fixing weighted value W and a correction factor λ, and is as shown in the table successively:
Field name Correction factor Weight
Classification λ CA W CA
Title λ TI W TI
Text λ CO W CO
The pet name λ NI W NI
The space name λ ZO W ZO
The computing formula of text relevant weights is as follows:
W TEXT=λ CA×W CATI×W TICO×W CONI×W NIZO×W ZO (2)
Wherein, λ CA+ λ TI+ λ CO+ λ NI+ λ ZO=1.Should be noted that the aforementioned calculation formula is an example, or not also can not calculate by similar formula in order to limit protection scope of the present invention.
After obtaining the text relevant weights, text cheating recognition unit 203 further identification has the blog articles of text cheating, and process comprises as shown in Figure 6: S601, utilize moving window traversal blog articles, and write down the maximum length that this moving window reaches; S602 compares maximum length and a threshold value of active window, then this blog articles is judged to be the text cheating if surpass threshold value; S603 carries out the suitable power of falling to the correlativity weights of this blog articles and handles, and for example can carry out the amplitude adjustment, 60% before the size of text relevant weights is modified to.Detailed process about the cheating to text is discerned and handled will be described in detail in Fig. 7.
In step S504, index 200 utilizes the calculating that superposes of its stack computing unit 204 logarithm value correlativity weights and text relevant weights, obtains comprehensive correlativity weights.In one embodiment, the stack computing formula is as follows:
Weight(q,d)=λ text*W textnum*W num (3)
Wherein, λ Text, λ NumBe respectively the correction factors of two kinds of correlativity weights when superposeing, size can be adjusted flexibly, and λ Text+ λ Num=1.Should be noted that the aforementioned calculation formula is an example, or not also can not calculate by similar formula in order to limit protection scope of the present invention.
In step S505, index 200 utilizes its comprehensive correlativity weights of index construct unit 205 bases and stores, and the extraction during for user search is used.
The method flow that the one embodiment of the present of invention that show Fig. 7 adopt water card recognizer that the text cheating is discerned and handled in setting up the process of index, this algorithm utilizes that a max cap. is fixed, adjustable length moving window from left to right travels through entire article, and writes down the maximum length that this window once reached." capacity " of window is defined as the number of the different speech that this window holds, " length " of window is defined as total number of speech in the window, be the distance between the border, the left and right sides, window is elongation (right margin moves to right) as far as possible always, only just shortens when surpassing max cap. (left margin moves to right).Fixedly the time, the article of vocabulary poorness can have long length of window at the capacity of window, and therefore, the maximized window length of one piece of blog articles is big more, and it may be the inferior quality article that has the text cheating more.
In this algorithm, the capacity of establishing moving window is C, and its maximal value is set at C MaxThe array that increases progressively with a C '=C+1 is deposited different speech in this moving window, is recorded as " window vocabulary "; And the length of establishing moving window is L, and its threshold setting is L T
In step S701, from blog articles, read first speech to moving window, recording capacity C=1, length L=1.
In step S702, judge whether to read next speech: if then carry out S703; If not, then change step S710.
In step S703, the right margin of moving window moves to right, and the neologisms that read are included in the moving window.
In step S704, judge whether this speech has been present in the window vocabulary: if, execution in step S705 then; If not, execution in step step S706 then.
In step S705, window vocabulary and capacity C are constant, and length L increases progressively, and this step finishes back commentaries on classics step S702 and continues to read.
In step S706, this speech is not present in the window vocabulary, then it is added the window vocabulary, and capacity C increases progressively, and length L increases progressively.
In step S707, judge whether the window capacity C surpasses maximal value C MaxIf:, execution in step S708 then; If not, then changeing step S702 continues to read.
In step S708, whether the window capacity C surpasses maximal value C Max, the left margin of window moves to right, and window foreshortens to and only comprises the up-to-date speech that reads.
In step S709, judge whether this piece blog articles has traveled through to finish: if, execution in step S710 then; If not, then changeing step S702 continues to read.
In step S710, when the blog articles traversal finishes,, judge the importance of this blog articles: if the moving window maximum length is greater than threshold value L then according to the moving window maximum length of record T, illustrate that then there is the text cheating in this blog articles, need fall power to its text relevant weights and handle.
Fig. 8 shows the method flow that one embodiment of the present of invention sort to blog articles based on the index of setting up among Fig. 4 or Fig. 5, and this embodiment is the situation that the user imports term, comprising:
In step S801, searcher 300 receives the term of user's input in the client 500.
In step S802, searcher 300 extracts the correlativity weights of each term and blog articles from the index that index 200 has made up, these correlativity weights may be the text relevant weights, also may be the comprehensive correlativity weights after the stack of text relevant weights and numerical value correlativity weights.
In step S803, searcher 300 sorts to the blog articles that searches according to the correlativity weights, and ranking results is fed back to client 500.
Fig. 9 shows the method flow that another embodiment of the present invention sorts to blog articles based on the index of setting up among Fig. 4 or Fig. 5, and this embodiment is the situation that the user imports the retrieval string, specifically comprises:
In step S901, proxy server 400 is a term with the retrieval string cutting of user's input in the client 500, and sends into searcher 300.
In step S902, searcher 300 extracts the correlativity weights of each term and blog articles from the index that index 200 makes up, these correlativity weights may be the text relevant weights, also may be the comprehensive correlativity weights after the stack of text relevant weights and numerical value correlativity weights.
In step S903, searcher 300 calculates the multiple correlation weights of retrieval string and blog articles.
In the present invention, the user imports the correlativity of retrieval string and blog articles, can think the synthesis result of the correlativity of single term and this blog articles, therefore in one embodiment, adopt the model of averaging after the simple addition to calculate the multiple correlation weights.If for retrieval string Q, Q={q 1, q 2..., q n, n is the index terms number after the cutting of retrieval string, d is a term q nAll blog articles of hitting, the computing formula of the multiple correlation weights between this retrieval string Q and the blog articles is so:
Weight ( Q , d ) = Σ i = 1 n ( Weight ( q i , d ) ) n - - - ( 4 )
Should be noted that the aforementioned calculation formula is an example, or not also can not calculate by similar formula in order to limit protection scope of the present invention.
In step S904, searcher 300 sorts to the blog articles that searches according to the multiple correlation weights, and ranking results is sent into proxy server 400.
In step S905, proxy server 400 is transmitted to client 500 with ranking results, and ranking results is shown on the user interface.
The above only is preferred embodiment of the present invention, not in order to restriction the present invention, all any modifications of being done within the spirit and principles in the present invention, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1, a kind of system that blog articles is sorted comprises the index that is used to set up index, inquires about the searcher of ordering according to user's input, it is characterized in that described index comprises:
Be used for extracting the device of correlation factors from blog system;
Be used for calculating the correlativity weights of term and each blog articles, and simultaneously the article of text cheating discerned and fallen the device of power processing according to described correlation factors;
Be used for making up the device of the index between term and each blog articles according to falling correlativity weights after the power;
The described correlativity weights that are used for calculating term and each blog articles according to described correlation factors, the blog articles that text is practised fraud is discerned and is fallen the device of weighing processing and further comprises simultaneously:
Be used to utilize that a max cap. is fixed, adjustable length moving window traversal blog articles, and the device of the record maximum length that moving window reached, wherein said capacity is meant the quantity of the different vocabulary that moving window holds, and length is meant the total amount of vocabulary in the moving window;
When being used for the traversal end, maximum length and a threshold value that described moving window is reached compare, and then this blog articles are judged to be the device that has the text cheating if surpass threshold value;
Be used for the device that power is handled is fallen in the described correlativity weights of the blog articles of text cheating that exist.
2, the system that blog articles is sorted according to claim 1 is characterized in that, described correlation factors is the text relevant factor, and described correlativity weights are text relevant weights.
3, the system that blog articles is sorted according to claim 1, it is characterized in that, the described correlation factors that is used for extracting from blog system the device of correlation factors comprises the numerical value correlation factors and the text relevant factor, and described correlativity weights are the comprehensive correlativity weights after exponential quantity correlativity weights superpose with the text relevant weights.
4, according to each described system that blog articles is sorted in the claim 1 to 3, it is characterized in that the searcher that ordering is inquired about in described input according to the user comprises:
Be used to receive the device of the term of user's input; Be used for according to described term the device of the correlativity weights from the index of having set up between this term of inquiry and each blog articles;
Be used for each blog articles relevant with term being sorted, and ranking results fed back to user's device according to the size of described correlativity weights.
5, according to each described system that blog articles is sorted in the claim 1 to 3, it is characterized in that, the searcher that ordering is inquired about in described input according to the user comprises: be used to receive the retrieval string of user's input, and be the device of a plurality of terms with the cutting of described retrieval string;
Be used for from the correlativity weights between each term of the index set up inquiry and each blog articles, and calculate that retrieval is gone here and there and blog articles between the device of multiple correlation weights;
Be used for each blog articles relevant with the retrieval string being sorted, and ranking results fed back to user's device according to the size of described multiple correlation weights.
6, a kind of method that blog articles is sorted comprises and sets up index, and inquires about ordering according to user's input, it is characterized in that the described step of setting up index comprises:
A. extract correlation factors from blog system;
B. calculate the correlativity weights of term and each blog articles according to described correlation factors, the blog articles that text is practised fraud is discerned and is fallen power and handle simultaneously;
C. make up index between term and each blog articles according to falling correlativity weights after the power;
The blog articles of among the described step B text being practised fraud is discerned and is fallen the step of weighing processing and further comprises:
B1. utilize that a max cap. is fixed, adjustable length moving window traversal blog articles, and the record maximum length that moving window reached, wherein said capacity is meant the quantity of the different vocabulary that moving window holds, and length is meant the total amount of vocabulary in the moving window;
When B2. traversal finished, maximum length and a threshold value that described moving window is reached compared, and had the text cheating as if then this blog articles being judged to be above threshold value;
B3. the power processing is fallen in the described correlativity weights of the blog articles of text cheating that exist.
7, the method that blog articles is sorted according to claim 6 is characterized in that, the correlation factors in the described steps A is meant the text relevant factor, and described correlativity weights then are meant the text relevant weights.
8, the method that blog articles is sorted according to claim 6, it is characterized in that, correlation factors in the described steps A comprises the numerical value correlation factors and the text relevant factor, and described correlativity weights are the comprehensive correlativity weights after exponential quantity correlativity weights and the stack of text relevant weights.
According to the described method that blog articles is sorted of arbitrary claim in the claim 6 to 8, it is characterized in that 9, the step that ordering is inquired about in described input according to the user comprises:
D. receive the term of user's input;
E. according to described term, from the index of having set up, inquire about the correlativity weights between this term and each blog articles;
F. the size according to described correlativity weights sorts to each blog articles relevant with term, and ranking results is fed back to the user.
10, according to each described method that blog articles is sorted in the claim 6 to 8, it is characterized in that the step that ordering is inquired about in described input according to the user comprises:
D '. receive the retrieval string of user's input, and be a plurality of terms the cutting of described retrieval string;
E '. the correlativity weights from the index of having set up between each term of inquiry and each blog articles, and calculate the multiple correlation weights of retrieving between string and the blog articles;
F '. the size according to described multiple correlation weights sorts to each blog articles relevant with the retrieval string, and ranking results is fed back to the user.
CNB2007101236257A 2007-09-25 2007-09-25 A kind of method and system that blog articles is sorted Active CN100545847C (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CNB2007101236257A CN100545847C (en) 2007-09-25 2007-09-25 A kind of method and system that blog articles is sorted
PCT/CN2008/072319 WO2009046649A1 (en) 2007-09-25 2008-09-10 Method and device of text sorting and method and device of text cheating recognizing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007101236257A CN100545847C (en) 2007-09-25 2007-09-25 A kind of method and system that blog articles is sorted

Publications (2)

Publication Number Publication Date
CN101127046A CN101127046A (en) 2008-02-20
CN100545847C true CN100545847C (en) 2009-09-30

Family

ID=39095078

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007101236257A Active CN100545847C (en) 2007-09-25 2007-09-25 A kind of method and system that blog articles is sorted

Country Status (2)

Country Link
CN (1) CN100545847C (en)
WO (1) WO2009046649A1 (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100545847C (en) * 2007-09-25 2009-09-30 腾讯科技(深圳)有限公司 A kind of method and system that blog articles is sorted
CN102385585A (en) * 2010-08-27 2012-03-21 阿里巴巴集团控股有限公司 Establishing method of webpage database, webpage searching method and relative device
CN101984422B (en) * 2010-10-18 2013-05-29 百度在线网络技术(北京)有限公司 Fault-tolerant text query method and equipment
CN102841908A (en) * 2011-06-21 2012-12-26 富士通株式会社 Micro-blog content ordering method and micro-blog content ordering device
CN103324637B (en) * 2012-03-23 2017-12-12 深圳市世纪光速信息技术有限公司 A kind of hot information method for digging and system
CN103365845B (en) * 2012-03-26 2018-07-27 腾讯科技(北京)有限公司 A kind of searching method in microblogging and system
CN103049511B (en) * 2012-03-28 2016-02-03 温州大学 The display packing of a kind of microblogging concern list, content of microblog and client thereof
CN103257982A (en) * 2012-06-13 2013-08-21 苏州大学 Blog search result ranking algorithm based on follow relationship
CN102880665A (en) * 2012-09-05 2013-01-16 常州嘴馋了信息科技有限公司 Webpage blog showing system
CN103218443A (en) * 2013-04-22 2013-07-24 中山大学 Blogging webpage retrieval system and retrieval method
CN103810251B (en) * 2014-01-21 2017-05-10 南京财经大学 Method and device for extracting text
CN104899310B (en) * 2015-06-12 2018-01-19 百度在线网络技术(北京)有限公司 Information sorting method, the method and device for generating information sorting model
CN105138573A (en) * 2015-07-28 2015-12-09 沈阳化工大学 PHP based multi-user light blog system
CN106446087A (en) * 2016-09-12 2017-02-22 福建中金在线信息科技有限公司 Method and device for acquiring thematic information

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1529263A (en) * 2003-09-18 2004-09-15 北京邮电大学 Chinese text auto-segmenting and text plagiarism discrimination device and method
US8244720B2 (en) * 2005-09-13 2012-08-14 Google Inc. Ranking blog documents
CN100520767C (en) * 2007-05-31 2009-07-29 腾讯科技(深圳)有限公司 Method and system for judging article importance in network, and sliding window
CN100545847C (en) * 2007-09-25 2009-09-30 腾讯科技(深圳)有限公司 A kind of method and system that blog articles is sorted

Also Published As

Publication number Publication date
WO2009046649A1 (en) 2009-04-16
CN101127046A (en) 2008-02-20

Similar Documents

Publication Publication Date Title
CN100545847C (en) A kind of method and system that blog articles is sorted
CN105955976B (en) A kind of automatic answering system and method
CN105701216B (en) A kind of information-pushing method and device
US9317550B2 (en) Query expansion
CN101320375B (en) Digital book search method based on user click action
CN111241241A (en) Case retrieval method, device and equipment based on knowledge graph and storage medium
CN104199965B (en) Semantic information retrieval method
CN105808590B (en) Search engine implementation method, searching method and device
CN100478962C (en) Method, device and system for searching web page and device for establishing index database
CN107729336A (en) Data processing method, equipment and system
CN101119326A (en) Method and device for managing instant communication conversation recording
CN109271477A (en) A kind of method and system by internet building taxonomy library
CN101477554A (en) User interest based personalized meta search engine and search result processing method
CN110019689A (en) Position matching process and position matching system
CN103838754B (en) Information retrieval device and method
CN103593371A (en) Method and device for recommending search keywords
CN110083774B (en) Method and device for determining application recommendation list, computer equipment and storage medium
CN109933708A (en) Information retrieval method, device, storage medium and computer equipment
CN112115716A (en) Service discovery method, system and equipment based on multi-dimensional word vector context matching
CN104636407A (en) Parameter choice training and search request processing method and device
WO2010037314A1 (en) A method for searching and the device and system thereof
CN104462347B (en) The sorting technique and device of keyword
CN102890690A (en) Target information search method and device
CN104572915A (en) User event relevance calculation method based on content environment enhancement
CN113836898A (en) Automatic order dispatching method for power system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20151221

Address after: The South Road in Guangdong province Shenzhen city Fiyta building 518057 floor 5-10 Nanshan District high tech Zone

Patentee after: Shenzhen Tencent Computer System Co., Ltd.

Address before: 518057 Guangdong city of Shenzhen province high tech Park high-tech South Road Fiyta high-tech building 5-10

Patentee before: Tencent Technology (Shenzhen) Co., Ltd.