WO2009046649A1

WO2009046649A1 - Method and device of text sorting and method and device of text cheating recognizing

Info

Publication number: WO2009046649A1
Application number: PCT/CN2008/072319
Authority: WO
Inventors: Rongfang Shao; Haiquan Xie; Liang Dong
Original assignee: Tencent Technology (Shenzhen) Company Limited
Priority date: 2007-09-25
Filing date: 2008-09-10
Publication date: 2009-04-16
Also published as: CN101127046A; CN100545847C

Abstract

A method and device of text sorting are disclosed. The text sorting includes: recognizing document cheating; modifying local of text cheating in sorting queue according to recognize result. method and device of text cheating recognizing are disclosed.

Description

The present invention relates to the field of computers, and more particularly to a text sorting method and apparatus, and a text cheating recognition method and apparatus. BACKGROUND With the development of the Internet, weblogs (Weblogs, blogs, Chinese "blogs") have become a common network service. At present, a large number of Internet companies have launched their own blog search engines. These blog search engines have different sorting methods for searched blog posts, but they all calculate the most relevant search by processing the search string input by the user. A set of results is returned to the user so that the user can find the blog post that is most relevant to their expectations. The two sorting methods that are currently prevalent are sorting by relevance and sorting by time, and typically sorting by relevance.

The specific process of sorting according to relevance is: first calculating the text relevance weight between the search string and each blog post, and the numerical relevance weight of the blog article, thereby establishing a search string and a blog post according to the relevance weight. When the user performs the search, the search is performed according to the search string input by the user, and the blog articles are sorted according to the size of the correlation weight, and finally the sorted result is sent to the user for display. . When calculating the text relevance weight between the search string and each blog post, the search string is generally decomposed into a plurality of search words, so that the text relevance weight of the search string and the blog is decomposed into the search term and the text of the blog. Sexual weight.

Although the above sorting method can provide users with a certain degree of credible blog article sorting results, for some low-quality articles, because there are only a few words in the whole or in the whole, in the sorting method, these articles However, it is possible to obtain a higher ordering by repeating and stacking words. This is a typical text cheating phenomenon.

This kind of text cheating also affects other text sorting processes besides blog posts, such as the web page sorting process during the search process.

Therefore, a new text sorting method is needed to reduce the impact of text cheating on the sorting result. SUMMARY OF THE INVENTION Embodiments of the present invention provide a text sorting method and apparatus to reduce the impact of text cheat on sorting results.

The embodiment of the invention also provides a text cheating recognition method and device, to identify The text of the malpractice.

A text sorting method provided by the embodiment of the invention includes:

The present invention provides a text sorting apparatus, wherein the text quality is a text sorting basis, and the method includes:

a first module for identifying a text with cheating;

And a second module, configured to correct, according to the recognition result of the first module, a position of the cheating phenomenon in a sorting queue.

A text cheat method provided by the embodiment of the invention includes:

The text is traversed by a moving sliding window; wherein the sliding window moves by: gradually increasing the window length of the sliding window from the initial value, and recording the window capacity of the sliding window each time the window length is increased ; when the window capacity reaches the maximum value, restore the window length of the sliding window to the initial value, and move the sliding window to the word containing only the last traversal;

Repeating the moving process of the sliding window in sequence, until the text traversing process or after the entire text traversing is completed, determining that the text is cheated according to the relationship between the window length and a preset threshold, wherein the window capacity is the sliding window The number of different words accommodated;

The window length is the total number of words accommodated by the sliding window;

The preset threshold is set according to a maximum value of the window capacity.

A text cheat device provided by the embodiment of the invention includes:

The first unit is configured to control the movement of the sliding window on the text and the change of the window length; when the window capacity recorded by the third unit is less than the maximum value, the control window length is gradually increased; the window capacity recorded in the third unit reaches the maximum When the value is restored, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal; and when the fourth unit determines that the text is cheating, the sliding window is stopped on the text. Movement

a second unit, configured to record the window length of the sliding window each time the window length is increased; the third unit is configured to record the window capacity of the sliding window each time the window length is increased; and notify the value of the window capacity The first unit is re-counted from the initial value.

And a fourth unit, configured to determine that the text has a cheating phenomenon according to a relationship between a window length and a preset threshold; wherein the preset threshold is set according to a maximum value of the window capacity.

A text cheat method provided by the embodiment of the invention includes:

The text is traversed by a moving sliding window, wherein the sliding window moves by: gradually increasing the window length of the sliding window from the initial value, and recording the window capacity of the sliding window each time the window length is increased ; when the window length reaches the maximum, the window length is restored to the beginning Start value, and move the sliding window to the word containing only the last traversal;

Repeating the moving process of the sliding window in sequence, until the text traversing process or after the entire text traversing is completed, determining that the text exists in accordance with the relationship between the window capacity and a preset threshold, wherein the window capacity is the sliding window The number of different words accommodated;

A text cheat device provided by the embodiment of the invention includes:

a first unit, configured to control movement of the sliding window on the text and a change in the length of the window; when the length of the window recorded by the second unit is less than the maximum value, the length of the control window is gradually increased; the length of the window recorded in the second unit is maximized When the value is restored, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal; and when the fourth unit determines that the text has a cheating phenomenon, the sliding window is stopped on the text. Movement

And a second unit, configured to record a window length of the sliding window each time the window length is increased, and notify the first unit and the third unit of the value of the window length.

The third unit is configured to record the window capacity of the sliding window each time the window length is increased; and when the window length recorded by the second unit reaches the maximum value, counting from the initial value.

The text sorting method provided by the embodiment of the present invention identifies the text with the cheating behavior, and corrects the sorting result according to the recognition result. For the sorting method that uses the text quality as an important basis for sorting, the text cheating is sorted. The effect of the results can improve the objectivity of the ranking.

The text cheat recognition method provided by the embodiment of the present invention calculates a window length that accommodates a certain window capacity, compares it with a preset threshold, and calculates a window capacity within a certain window length, and compares it with a preset. Compared with the threshold, the process of text cheating recognition is quantified, which makes the text cheating recognition more objective. BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a flow chart of a text sorting method in an embodiment of the present invention.

2 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.

3 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.

4 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.

FIG. 5 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.

Figure 6 is a structural diagram of a text sorting apparatus in an embodiment of the present invention. Fig. 7 is a structural diagram of a text cheat recognition apparatus in the embodiment of the present invention.

FIG. 8 is a structural diagram of a blog article retrieval sorting system in an embodiment of the present invention.

FIG. 9 is a structural diagram of an indexer in a blog article retrieval sorting system according to an embodiment of the present invention.

FIG. 10 is a structural diagram of a searcher in a blog article retrieval sorting system according to an embodiment of the present invention. FIG. 11 is a flowchart of a method for establishing an index in a blog article retrieval order according to an embodiment of the present invention. FIG. 12 is a flowchart of a method for establishing an index in a blog article retrieval order according to an embodiment of the present invention. FIG. 13 is a flowchart of a method for retrieving a blog article in an embodiment of the present invention.

FIG. 14 is a flowchart of a method for retrieving a blog article in an embodiment of the present invention. The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

1 is a flow chart of a text sorting method in an embodiment of the present invention. As shown in Figure 1, the method includes:

Step S101: Identify a text with cheating behavior;

Step S102: Correct the position of the text with the cheating behavior in the sorting queue according to the recognition result.

An embodiment of the present invention provides a method for recognizing a cheat text, and traversing a text to be detected by using a moving sliding window, wherein the process of moving the sliding window is: increasing the window length of the sliding window from an initial value, and each time When the window length is increased, the window capacity of the sliding window is recorded; when the window capacity reaches the maximum value, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal. Repeating the movement process of the sliding window in turn, until the text traversing process or after the entire text traversing, determining that the text is cheating according to the relationship between the window length and a preset threshold; wherein the window capacity is different words accommodated by the window The number, the length of the window is the total number of words in the window, that is, the distance between the left and right borders. At this time, the threshold is set according to the maximum value of the window capacity.

After the entire text traversal is completed, the process of determining whether the text has a cheating phenomenon according to the relationship between the window length and a preset threshold may be: recording the length of the window corresponding to each window capacity reaching the maximum value; Let the threshold comparison, if the threshold is exceeded, determine that the text is cheating. In the process of text traversal, the process of determining whether the text has a cheating phenomenon according to the relationship between the length of the window and a preset threshold may be: comparing the length of the window of each record with a preset threshold, and if the threshold is exceeded, determining that the text is cheating phenomenon. At this time, the threshold is proportional to the maximum value of the window capacity, that is, the larger the maximum window capacity, as a text without cheating, the corresponding window length should be longer. 2 is a flow chart of a method for identifying cheat text in an embodiment of the present invention. In this embodiment, the method for recognizing cheat text may be referred to as a water signature recognition algorithm, which traverses the entire text from left to right using a sliding window of fixed maximum size and variable length, and records that the window has been reached. The maximum length. The larger the maximum window length of a text, the more likely it is to have a low-quality article with text cheating.

In the algorithm, the capacity of the sliding window is C, and the maximum value is set to Cmax; an increasing array of C, = C+1 is used to store different words in the sliding window, and the record is "window vocabulary"; Let the length of the sliding window be L.

Read the first word from the text and continue with the following steps:

Step S201: The recording capacity C=l and the length L=l. At this time, only the words in the sliding window at this time are included in the window vocabulary.

Step S202: Determine whether the next word is successfully read: If yes, execute S203; if no, go to step S210.

Step S203: The right border of the sliding window is shifted to the right, and the read new word is included in the sliding window. Step S204: It is determined whether the word already exists in the window vocabulary: if yes, step S205 is performed; if not, step S206 is performed.

Step S205: The window vocabulary and the capacity C are unchanged, and the length L is incremented. After the step ends, the process proceeds to step S202 to continue reading.

Step S206: The word is added to the window vocabulary, the capacity C is incremented, and the length L is incremented.

Step S207: Determine whether the window capacity C exceeds the set maximum value Cmax: If yes, execute step S208; if no, proceed to step S202 to continue reading.

Step S208: The left border of the window is shifted to the right, and the window is shortened to include only the newly read words.

Step S209: It is judged whether the text has been traversed: If yes, step S210 is performed; if no, then step S202 is continued to continue reading.

Step S210: When the text traversal is completed, one or more lengths L are recorded, and the importance of the text is determined according to the maximum length of the record: If the maximum length L is greater than the set threshold, the text is cheated. Otherwise it indicates that there is no cheating in the text.

3 is a flow chart of a method for identifying cheat text in an embodiment of the present invention. As shown in Figure 3, the method includes:

Read the first word from the text and continue with the following steps:

Step S301: The recording capacity C=l and the length L=l. At this time, only the words in the sliding window at this time are included in the window vocabulary.

Step S302: Read the next word.

Step S303: The right border of the sliding window is shifted to the right, and the read new word is included in the sliding window. Step S304: determining whether the word already exists in the window vocabulary: if yes, executing step S305; If no, step S306 is performed.

Step S305: The window vocabulary and the capacity C are unchanged, and the length L is incremented. After the step ends, the process proceeds to step S307.

Step S306: The word is added to the window vocabulary, the capacity C is incremented, and the length L is incremented. After the step ends, the process proceeds to step S307.

Step S307: It is judged whether the length L exceeds the threshold; if yes, the process proceeds to step S311, otherwise, the process proceeds to step S308.

Step S308: determining whether the window capacity C exceeds the set maximum value Cmax; if yes, executing step S309; if not, proceeding to step S310.

Step S309: The left border of the window is shifted to the right, and the window is shortened to include only the newly read words.

Step S310: Determine whether the text has been traversed: If yes, go to step S312; if no, go to step S302 to continue reading.

Step S311: It is determined that the text has a cheating phenomenon.

Step S312: It is determined that the text does not have a cheating phenomenon.

In the embodiment shown in FIG. 2 and FIG. 3, the threshold values are all closely related to the maximum value of the window capacity C, that is, when the maximum value of the window capacity C is larger, the set threshold value can also be larger. Conversely, the smaller the maximum value of the window capacity C, the corresponding threshold should be reduced accordingly.

The embodiment of the invention further provides a method for recognizing cheat text, which uses a moving sliding window to traverse the text to be detected, wherein the process of sliding the window is: increasing the window length of the sliding window from the initial value, and When the window length is increased by a second time, the window capacity of the sliding window is recorded; when the window length reaches the maximum value, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal. The moving process of the sliding window is repeated in turn until the text traversing process or after the entire text traversing is completed, and the text is judged to be cheating according to the relationship between the window capacity and a preset threshold. At this time, the threshold is set according to the maximum value of the window length.

After the entire text traversal is completed, the process of determining whether the text has a cheating phenomenon according to the relationship between the window capacity and a preset threshold may be: recording the corresponding window capacity when each window length reaches the maximum value; and minimizing the window capacity minimum If the threshold is compared, if the threshold is less than the threshold, the text is judged to be cheating. During the text traversal process, the process of determining whether the text has a cheating phenomenon according to the relationship between the window capacity and a preset threshold may be: comparing the window capacity of each record with the preset threshold, and determining that the text exists if the threshold is less than the threshold. Cheating. At this time, the threshold is proportional to the maximum value of the window length, that is, the smaller the window capacity is, the smaller the probability that the text is cheating, but the maximum window length may be increased, and the corresponding window capacity may be allowed to increase accordingly. .

4 is a flow chart of a method for identifying cheat text in an embodiment of the present invention. In this embodiment, the entire text is traversed from left to right using a sliding window, and the maximum length of the window is set, i.e., the window cannot exceed the maximum length. Thus, at a fixed window length, the smaller the window capacity, the more There may be texts that are cheating.

In the algorithm, let the capacity of the sliding window be C; use an increasing array of C, = C+1 to store different words in the sliding window, record as "window vocabulary"; and set the length of the sliding window to L, Its maximum value is set to Lmax.

Read the first word from the text and continue with the following steps:

Step S401: The recording capacity C=l and the length L=l, at this time, only the words in the sliding window at this time are included in the window vocabulary.

Step S402: Determine whether the next word is successfully read: If yes, execute S403; if no, go to step S410.

Step S403: The right border of the sliding window is shifted to the right, and the read new word is included in the sliding window. Step S404: determining whether the word already exists in the window vocabulary: if yes, executing step S405; if not, executing step S406.

Step S405: The window vocabulary and the capacity C are unchanged, and the length L is incremented. After the step ends, the process proceeds to step S402 to continue reading.

Step S406: The word is added to the window vocabulary, the capacity C is incremented, and the length L is incremented.

Step S407: Determine whether the window length L exceeds the set maximum value Lmax: If yes, execute step S408; if no, proceed to step S402 to continue reading.

Step S408: The left border of the window is shifted to the right, and the window is shortened to include only the newly read words.

Step S409: It is judged whether the text has been traversed: If yes, step S410 is performed; if no, then step S402 is continued to continue reading.

Step S410: When the text traversal is completed, one or more window capacities C are recorded, and the importance of the text is determined according to the minimum capacity of the record: If the minimum capacity C is less than the set threshold, the text is cheated. Otherwise, it indicates that there is no cheating in the text.

FIG. 5 is a flow chart of a method for identifying cheat text in an embodiment of the present invention. As shown in Figure 5, the method includes:

Read the first word from the text and continue with the following steps:

Step S501: The recording capacity C=l and the length L=l. At this time, only the words in the sliding window at this time are included in the window vocabulary.

Step S502: Read the next word.

Step S503: The right border of the sliding window is shifted to the right, and the read new word is included in the sliding window. Step S504: determining whether the word already exists in the window vocabulary: if yes, executing step S505; if not, executing step S506.

Step S505: The window vocabulary and the capacity C are unchanged, and the length L is incremented. After the step ends, the process proceeds to step S507.

Step S506: the word is added to the window vocabulary, the capacity C is incremented, and the length L is incremented. After the end, the process proceeds to step S507.

Step S507: It is judged whether the capacity C is smaller than the threshold; if it is less than the threshold, the process proceeds to step S511, otherwise, the process proceeds to step S508.

Step S508: determining whether the window length L exceeds the set maximum value Lmax; if yes, executing step S509; if not, proceeding to step S510.

Step S509: The left border of the window is shifted to the right, and the window is shortened to include only the newly read words.

Step S510: Determine whether the text has been traversed: If yes, execute step S512; if no, proceed to step S502 to continue reading.

Step S511: It is determined that the text has a cheating phenomenon.

Step S512: It is determined that the text does not have a cheating phenomenon.

In the embodiment shown in FIG. 4 and FIG. 5, the threshold values are all closely related to the maximum value of the window length L, that is, when the maximum value of the window length L is larger, the set threshold value can also be larger. Conversely, the smaller the maximum value of the window length L, the smaller the set threshold should be.

In the above four embodiments, the order of traversing the text is from beginning to end, so when the window length increases from the initial value, the right border of the window starts to move to the right, and when the window length returns to the initial value, the left border of the window is right. shift. In fact, the order of traversing the text can also be from end to end. When the window length increases from the initial value, the left boundary of the window begins to move to the left. When the window length is restored to the initial value, the right edge of the window is shifted to the left. Of course, you can traverse the text in other orders, but the basic principles remain the same.

When the text having the cheating phenomenon is identified according to the above method, the method of correcting the position of the text having the cheating behavior in the queue according to the recognition result may be as follows.

According to the recognition result, all the cheating texts in the queue are uniformly implemented in the same place, such as the position of the text with the cheating behavior in the queue and the two positions are unified. Or in a queue sorted according to certain parameters, the ordering parameter corresponding to all the cheating behaviors may be corrected by a fixed amplitude. For example, in the process of text retrieval, the weight of the correlation between the search term and the text is generally reduced according to the weight of the correlation between the search term and the text; Amplitude. , . ,,,

According to the relationship between the two parameters in the text cheating recognition algorithm - window capacity and window length, more accurate evaluation of the degree of cheating of different texts, different processing results for texts with different degrees of cheating, for more cheating The text, corresponding to a more rigorous processing. For example, the position of the text with severe cheating is adjusted to a more position in the queue, or the order corresponding to the text with severe cheating is corrected by a larger margin.

If you are deciding which of the two texts is more cheating, you can separately record the two texts. In the case of cheating, the two sliding windows correspond to the window capacity and window length at this time. If the two sliding windows of the two texts have the same window capacity, the text corresponding to the sliding window with a large window length has a greater degree of cheating. If the two sliding windows of the two texts are equal in length, the text corresponding to the sliding window with a small window capacity has a greater degree of cheating. The most common method can calculate the ratio of the window capacity and the window length of the two sliding windows in the two texts. Which text corresponds to the sliding window with a smaller ratio of window capacity to window length, and which text is more cheated. Big.

Figure 6 is a structural diagram of a text sorting apparatus in an embodiment of the present invention. As shown in Figure 6, the device includes:

a text recognition module 601, configured to identify text with cheating behavior;

The sorting correction module 602 is configured to correct the position of the text with the cheating behavior in the sorting queue according to the recognition result.

Figure 7 is a structural diagram of a text cheat recognition apparatus in an embodiment of the present invention. As shown in Fig. 7, the apparatus includes a window length control unit 701, a window capacity recording unit 702, a window length recording unit 703, and a threshold comparison unit 704.

The window length control unit 701 is configured to control the movement of the sliding window on the text and the change of the window length; when the window capacity recorded by the window capacity recording unit 703 is less than the maximum value, the control window length is gradually increased; When the window capacity of the 703 record reaches the maximum value, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal; and when the threshold comparison unit 704 determines that the text is cheating, the stop The movement of the sliding window over the text.

The window length recording unit 702 is configured to record the window length of the sliding window each time the window length is increased.

The window capacity recording unit 703 is configured to record the window capacity of the sliding window each time the window length is increased; and notify the window length control unit 701 of the value of the window capacity, and start counting again from the initial value.

The threshold comparison unit 704 is configured to determine that the text is cheating according to the relationship between the window length and a preset threshold; wherein the preset threshold is set according to the maximum value of the window capacity.

The function of the four units of the device can also be:

The window length control unit 701 is configured to control the movement of the sliding window on the text and the change of the window length; when the window length recorded by the window length recording unit 702 is less than the maximum value, the control window length is gradually increased; When the window length of the record 702 reaches the maximum value, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal; and when the threshold comparison unit 704 determines that the text is cheating, the stop The movement of the sliding window over the text. The window length recording unit 702 is configured to record the window length of the sliding window each time the window length is increased, and notify the window length control unit 701 and the window capacity recording unit 703 of the value of the window length.

The window capacity recording unit 703 is configured to record the window capacity of the sliding window each time the window length is increased; and when the window length recorded by the window length recording unit 702 reaches the maximum value, counting from the initial value.

The threshold comparison unit 704 is configured to determine that the text is cheating according to the relationship between the window length and a preset threshold; wherein the preset threshold is set according to the maximum value of the window length.

It is to be noted here that although the internal structure of the text recognition module 601 is not further described in Fig. 6, it is apparent that the description in Fig. 7 can be considered as the internal structure of the text recognition module 601 in Fig. 6.

In the following embodiments, the text is a blog post, and the sorting is a sort of search. The method and the device in the embodiment of the present invention are illustrated. In actual operation, the text in the embodiment of the present invention may also be a webpage text and the like, and all other sorting operations are required. Text, sorted scenes are not limited to search sorting.

Since the retrieval order of the blog article is based on the established index, and the indexing is performed by calculating the text relevance weight of the search term and the blog text, the embodiment of the present invention identifies cheating in the calculation of the text relevance weight. The text, and its weight reduction processing, can establish a more accurate index, thereby improving the objective accuracy of sorting based on this index, and ensuring the quality of text retrieval by users.

FIG. 8 is a structural diagram of a blog article retrieval sorting system in an embodiment of the present invention. As shown in FIG. 8, the system includes a blog system 100, an indexer 200, a retriever 300, an agent 400, and a client 500. It should be noted that the connection relationship between the devices in all the diagrams of the present invention is for the purpose of clearly explaining the information interaction and control process thereof, and therefore should be regarded as a logical connection relationship, and should not be limited to physical connections. among them:

The blogging system 100 is used to provide blog related services for users, including storing and managing blog posts, and the like, and provides a relevance factor for the indexer 200 in the present invention, including text relevance factors (eg, text classification, title, Body, nickname, space name, etc., and numerical correlation factors (eg, activity factor, reload factor, response rate factor, publication time factor, etc.). The core of the blog system 100 can be a web server, but the invention is not limited to its specific form.

The indexer 200 is configured to index based on data in the blog system 100 for the searcher 300 to sort the searched blog posts based on the index.

The retriever 300 queries and sorts the blog articles based on the search terms input by the user. The agent 400 is configured to receive the search string sent by the client 500, divide the search string into search terms, send it to the searcher 300, and forward the retrieved and sorted results of the retriever 300 to the client 500. The client 500 receives the search term or the search string input by the user: if the user inputs the search term, it can directly send it to the retriever 300, and after receiving the blog article sorting result fed back by the crawler 300, the sorting result is drawn. And displayed to the user interface; if the user inputs a search string, it must be sent to the agent 400 for segmentation, and after receiving the blog article sorting result fed back by the agent 400, the sorting result is drawn and displayed to the user interface. on.

The client 500 is typically a variety of terminal devices capable of logging in to the Internet, such as a personal computer (PC), a personal digital assistant (PDA), a mobile phone (MP), etc., and thus the present invention. The scope of protection should not be limited to a particular type of client.

FIG. 9 is a structural diagram of an indexer in a blog article retrieval sorting system according to an embodiment of the present invention. As shown in FIG. 9, the indexer 200 includes: a numerical correlation determination unit 201, a text correlation determination unit 202, a text cheat recognition unit 203, a superposition calculation unit 204, and an index construction unit 205.

The numerical correlation determining unit 201 is configured to calculate a numerical correlation weight of the search term and each blog post based on the numerical correlation factor extracted from the blog system. The text relevance determining unit 202 is configured to calculate a text-related weight of the search term and each blog post based on the text relevance factor extracted from the blog system. The text cheat recognition unit 203 is configured to recognize the blog article in which the text is cheated when the text relevance determination unit 202 calculates the text relevance weight between the search term and the blog post.

202. The superposition calculation unit 204 is configured to perform superposition calculation on the foregoing numerical correlation weight and text relevance right to obtain a comprehensive correlation weight of the search term, and send it to the index construction unit 205. The index construction unit 205 builds an index based on the comprehensive correlation weight.

In another embodiment of the present invention, the indexer 200 includes a text relevance determination unit 202, a text cheat recognition unit 203, and an index construction unit 205. The text relevance determining unit 202 is configured to calculate a text relevance weight of the search term and each blog post based on the text relevance factor extracted from the blog system. The text cheat recognition unit 203 is used to identify a blog post in which the text is cheated, and judges the genre

The text relevance weight constructs an index between the search term and each blog post.

Although the above embodiment can be implemented, since the process of constructing an index only considers the text correlation factor, the accuracy of the index is not high enough. In contrast, the indexer 200 shown in Figure 2 has a higher index accuracy.

Here, it can be understood that the text cue recognition unit 203 includes the text sorting device shown in FIG. 6. FIG. 10 is a structural diagram of a searcher in a blog article retrieval sorting system according to an embodiment of the present invention. As shown in FIG. 10, the retriever 300 includes a query unit 301, a composite correlation calculation unit 302, and a sorting unit 303. In this embodiment, the user initially inputs a search string containing a plurality of search terms, The index is divided into search terms and sent to the searcher 300, and the searcher 300 receives the search words and then processes them. The query unit 301 queries the relevance weight (text relevance weight, or comprehensive relevance weight) between each search term and each blog post from the index that has been established by the indexer, and sends it to the sorting unit. The compound correlation calculation unit 302 calculates the composite correlation weight between the search string and each blog post based on the correlation weight of each search term, and sends it to the sorting unit 303. The sorting unit 303 sorts each blog post related to the search string according to the composite correlation weight.

In another embodiment of the present invention, a retriever 300 is provided that can be directly connected to and communicated with a client, and is suitable for situations in which a user inputs a search term rather than a search string. The retriever 300 at this time includes a query unit 301 and a sorting unit 303. The query unit 301 queries, according to the search term input by the user, the correlation weight (text relevance weight, or comprehensive relevance weight) between the search term and each blog post from the index that has been established by the indexer 200, and It is sent to the sorting unit 303. The sorting unit 303 sorts each blog post related to the search term according to the size of the received correlation weight. It should be noted that since most of the users currently input a search string containing a plurality of search terms, the structure of the retriever 300 shown in Fig. 3 is more widely and typical.

FIG. 11 is a flowchart of a method for establishing an index in a blog article retrieval order according to an embodiment of the present invention. As shown in FIG. 11, the method includes the following steps:

Step S1101: Extract a correlation factor from the blog system, and format the data. The formatting mentioned here includes normalizing some correlation factors and performing some processing on some correlation factors, such as Log processing, to map the values of most correlation factors. In a fixed interval, for example [0, 100]. Of course, some correlation factors take their original values.

The correlation factor referred to in the present invention may include only a text correlation factor, a text correlation factor, and a numerical correlation factor. These correlation factors are used as input parameters when the indexer builds the index as a correlation weight calculation.

Step S1102: Calculate the correlation weight of the search term and each blog post, and identify and degrade the blog article with text cheating.

In one embodiment, only the text relevance factor is considered, which calculates the text relevance weight of the search term based on the text relevance factor, and identifies the blog post that the text is cheated, and then textual relevance of the search term to the blog post. The weight is appropriately degraded.

In another embodiment, the indexer not only considers the text correlation factor, but also considers the numerical correlation factor, respectively calculates the text relevance weight and the numerical correlation weight, and simultaneously identifies the blog post of the text cheat, and then searches for The word and the text relevance weight of the blog article are appropriately reduced, and finally the text correlation weight and the numerical correlation weight are superimposed to obtain the comprehensive correlation weight. It can be seen from the above that the previous embodiment only performs the weight reduction processing on the text correlation weight, and is used. This embodiment also considers the numerical correlation factor

Step increased number According to the accuracy.

Step S1103: Construct an index between the search term and each blog post according to the correlation weight after the weight reduction. The index records the relevance weights of each search term, the blog post corresponding to the search term, the search term and the blog post, so that when the user inputs the search term for searching, the search can be performed according to the data in the index. Blog articles are sorted so that users can quickly find the most relevant blog posts.

FIG. 12 is a flowchart of a method for establishing an index in a blog article retrieval order according to an embodiment of the present invention. As shown in Figure 12, the process specifically includes:

Step S1201: Extract a correlation factor from the blog system, and format the data. The correlation factor at this time includes a text correlation factor and a numerical correlation factor.

Step S1202: The indexer calculates a numerical correlation weight of the search term and each blog post.

In one embodiment, the numerical correlation factor includes an activity factor ^. , the ^reload rate factor ^Wdu . The recovery rate factor ^W , the publication time factor ^ ^ these four. The activity factor ^W TM is calculated by the blog system, and the value range is [0, 100]. It comprehensively considers the user registration frequency of the blog personal space, the frequency of blog post publication, etc., and is the activity level of the blog personal space. Comprehensive metrics, the higher the activity, the higher the priority of the ranking results of blog posts. ^ Reproduced rate factor "is calculated based on the number of repeating duplication system blog articles obtained in the range [0, 100], the higher the rate is reproduced, the higher the priority ranking result blog articles. Reply rate factor ^W It is calculated according to the number of reply times of the blog post, the value range is [0, 100], and the higher the response rate factor ^, the higher the priority of the sorting result of the blog post. The publishing time factor ^W ^ is the publishing time of the blog post. It can be expressed by UNIX time, and the ranking result of the newly published blog post has higher priority. The numerical correlation weight is calculated and normalized by all the correlation factors listed above, and its value is obtained. The range is in the interval [0, 1] and its calculation formula is as follows:

^诵/MAX —VALUE ( _χ ) where ^ is all the correlation calculation factors listed above, '· is the corresponding correction coefficient, used to increase or decrease the effect of the correlation factor, which can be adjusted in the sorting result. The process determines the ideal value of ^ ', and MAX_VALUE is the maximum possible value of the value correlation weight. It should be noted that the above calculation formula is only an example and is not intended to limit the scope of protection of the present invention, and can also be calculated by a similar formula.

Step S1203: The indexer calculates a text relevance weight of the search term and each blog post, and identifies a blog post with text cheats, and performs a demotion process on the blog post with the text cheat. In an embodiment of the invention, the text relevance factor is also the text field available for retrieval.

In one embodiment, the text fields include five categories: a category, a title, a body, a nickname, and a space name. Each field has a fixed weight value W and a correction coefficient λ, as shown in Table 1. Field name correction coefficient weight classification heading text W

CO ^vy co nickname W

y ^y NI space name w

Zo ^vy zo Table 1

The formula for calculating the text relevance weight is as follows:

WJEXT = ^λ α _Α ^χ W _CA + λ _τ1 χ W _T1 + λ _εο χ W _CO + λ _Ν1 χ W _N1 + λ _ζο x W _zo ( ₂ ) where

= 1. It should be noted that the above calculation formula is only an example, and is not intended to limit the scope of protection of the present invention, and can also be calculated by a similar formula.

After obtaining the text relevance weight, further identifying the blog article with text cheating, including: traversing the blog article by using a sliding window, and recording the maximum length reached by the sliding window; comparing the maximum length of the active window with a threshold If the threshold value is exceeded, the blog post is determined to be a text cheat; the blog post with text cheating is appropriately degraded, for example, the amplitude adjustment can be performed, and the text correlation weight is corrected to the previous 60%.

Step S1204: The indexer uses the superposition calculation unit to perform superposition calculation on the numerical correlation weight and the text correlation weight to obtain the comprehensive correlation weight. In one embodiment, the superposition calculation formula is as follows:

Where Λ^ _ is the correction coefficient when the two correlation weights are superimposed, and the size can be flexibly adjusted, and

Step S1205: The indexer stores and stores the data based on the comprehensive correlation weights for searching by the user. Extraction application.

FIG. 13 is a flowchart of a method for retrieving a blog article in an embodiment of the present invention. This embodiment is a case where a user inputs a search term, including:

Step S1301: The retriever receives the search term input by the user in the client.

Step S1302: The retriever extracts a correlation weight of each search term from a blog article from an index that has been constructed by the indexer, and the correlation weight may be a text relevance weight, or may be a text correlation weight and a numerical value. The comprehensive correlation weight after the correlation weight is superimposed.

Step S1303: The searcher sorts the searched blog articles according to the correlation weights, and feeds the sorting result to the client.

FIG. 14 is a flowchart of a method for retrieving a blog article in an embodiment of the present invention. This embodiment is a case where the user inputs a search string, and specifically includes:

Step S1401: The agent divides the search string input by the user in the client into a search term and sends it to the searcher.

Step S1402: The retriever extracts a correlation weight of each search term from a blog article from an index constructed by the indexer, and the correlation weight may be a text relevance weight, or may be a text correlation weight and a numerical correlation. The comprehensive correlation weight after the superposition of sexual weights.

Step S1403: The retriever calculates a composite correlation weight of the search string and the blog article.

In the present invention, the user inputs the relevance of the search string to the blog post, which can be considered as a comprehensive result of the correlation between the single search term and the blog post. Therefore, in one embodiment, the average is added after the cartridge is added. The model calculates the composite correlation weights. Let the search string (^ Q^q^ qz, ..., q _n }, n be the number of index words after the search string is segmented, d is all the blog articles hit by a search word q _n , then the search string Q The formula for calculating the compound correlation weights with blog posts is:

X(Weight( _qi d))

Weight(Q, d) = ^ ( 4 )

n

It should be noted that the above calculation formula is only an example and is not intended to limit the protection range of the present invention, and can also be calculated by a similar formula.

Step S1404: The retriever sorts the searched blog articles according to the composite correlation weights, and sends the sorting result to the agent.

Step S 1405: The agent forwards the sort result to the client, and displays the sort result on the user interface.

The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. Within the scope.

Claims

Claim

A text sorting method, comprising:

Identify texts that are cheating;

Based on the recognition result, the position of the cheating text in the sorting queue is corrected.

2. The method according to claim 1, wherein the identifying the text having the cheating phenomenon comprises:

The method according to claim 2, wherein after the traversal of the entire text, determining that the text is cheating according to the relationship between the length of the window and a preset threshold includes: recording each window capacity arrival The corresponding window length at the maximum value;

Comparing the maximum value of the window length with the preset threshold, and determining that the text is cheating if the threshold is exceeded, wherein the threshold is proportional to the maximum value of the window capacity.

The method according to claim 2, wherein in the text traversal process, determining that the text is cheating according to a relationship between a window length and a preset threshold includes:

The length of the window recorded each time is compared with the preset threshold value, and if the threshold value is exceeded, it is determined that the text has a cheating phenomenon, wherein the threshold value is proportional to the maximum value of the window capacity.

5. The method according to claim 1, wherein the identifying the text having the cheating phenomenon comprises:

The text is traversed by a moving sliding window, wherein the sliding window moves by: gradually increasing the window length of the sliding window from the initial value, and recording the window capacity of the sliding window each time the window length is increased ; when the window length reaches the maximum value, restore the window length to the initial value, and move the sliding window to the word containing only the last traversal;

Repeating the moving process of the sliding window in sequence, until the text traversing process or after the entire text traversing is completed, determining that the text exists in accordance with the relationship between the window capacity and a preset threshold, wherein the window capacity is the sliding window The number of different words accommodated; The window length is the total number of words accommodated by the sliding window;

The preset threshold is set according to a maximum value of the window length.

The method according to claim 5, wherein after the traversal of the entire text, determining that the text is cheating according to the relationship between the window capacity and a preset threshold includes: recording each window length to arrive The corresponding window capacity at the maximum value;

Comparing the minimum value of the window capacity with the preset threshold, and determining that the text is cheating if it is less than the threshold, wherein the threshold is proportional to the maximum value of the window length.

The method according to claim 5, wherein in the text traversal process, determining that the text is cheating according to a relationship between a window capacity and a preset threshold includes:

The window capacity of each record is compared with the preset threshold, and if the threshold is less than the threshold, the text is determined to be cheating, wherein the threshold is proportional to the maximum value of the window length.

The method according to any one of claims 2 to 7, wherein the window capacity of the recording slide window comprises:

Finding the last traversal of the sliding window in the window vocabulary, if the word exists in the window vocabulary, increasing the window capacity by 1; otherwise the window capacity is unchanged, and adding the word to the window vocabulary; Different words in the sliding window are stored in the window vocabulary, and each time the window capacity is restored to the initial value, only the words in the current sliding window are included in the window vocabulary.

The method according to any one of claims 2 to 8, wherein the sliding window traverses the text in an order from a start end to an end end of the text; The value is gradually increased to: gradually shift the right border of the sliding window to the right; the window length is restored to the initial value, and the sliding window is moved to a word containing only the last traversal: the left border of the sliding window Shift right until the sliding window contains only the last traversed word.

The method according to any one of claims 2 to 9, characterized in that the window length and the initial value of the window capacity are both 1.

The method according to claim 1, wherein the correcting the position of the text having the cheating phenomenon in the sorting queue according to the recognition result comprises:

According to the recognition result, the same processing is uniformly applied to all the cheating texts in the queue.

12. The method according to claim 11, wherein the set number of bits in the queue is uniformly corrected for all the texts having the cheating behavior; or

The ordering corresponding to all cheated texts is corrected by the same magnitude.

The method according to any one of claims 2 to 9, wherein the correcting the position of the cheating phenomenon in the sorting queue according to the recognition result comprises:

According to the recognition result, the degree of cheating of the text is evaluated, and different processing results are performed for the texts with different degrees of cheating.

14. The method according to claim 13, wherein the performing different processing results for texts having different degrees of fraud comprises:

Adjust the position of the cheating text to a greater position in the queue, or

The sorting corresponding to the text with severe cheating is corrected by a larger magnitude according to the parameters.

15. The method according to claim 13, wherein the evaluation text cheating degree comprises:

The ratio of the corresponding window capacity to the window length of the two sliding windows in the two texts, and the relationship between the window capacity and the window length during the text cheating recognition process

The method according to claim 12 or 14, wherein the text sorting is sorting of text in the searching process, wherein the sorting according to the parameter is a correlation weight of the search string and the text.

17. A text sorting apparatus, wherein text quality is a basis for text sorting, and is characterized by:

a first module for identifying text with cheating behavior;

And a second module, configured to correct, according to the recognition result of the first module, a position of the text having the cheating behavior in the sorting queue.

18. A method of text cheat, characterized in that it comprises:

The method according to claim 18, wherein after the traversal of the entire text, determining that the text is cheating according to the relationship between the length of the window and a predetermined threshold includes:

Record the length of the window corresponding to each window's capacity reaching the maximum value;

The method according to claim 18, wherein in the text traversing process, determining that the text has a cheating phenomenon according to a relationship between a window length and a preset threshold comprises: Comparing with the preset threshold, if the threshold is exceeded, it is determined that the text has a cheating phenomenon, wherein the threshold is proportional to the maximum value of the window capacity.

The method according to any one of claims 18 to 20, wherein the window capacity of the recording sliding window comprises:

The method according to any one of claims 18 to 21, wherein the sliding window traverses the text in an order from a beginning end to an end end of the text; The value is gradually increased to: gradually shift the right border of the sliding window to the right; the window length is restored to the initial value, and the sliding window is moved to a word containing only the last traversal: the left border of the sliding window Shift right until the sliding window contains only the last traversed word.

The method according to any one of claims 18 to 22, characterized in that the window length and the initial value of the window capacity are both 1.

24. A text cheat device, comprising:

25. A method of text cheat, characterized in that it comprises:

The preset threshold is set according to a maximum value of the window length.

26. The method of claim 25, wherein: After the completion, determining that the text is cheating according to the relationship between the window capacity and a preset threshold includes:

Record the corresponding window capacity each time the window length reaches the minimum value;

The method according to claim 25, wherein in the text traversal process, determining that the text has a cheating phenomenon according to a relationship between a window capacity and a preset threshold includes: Comparing with the preset threshold, if the threshold is exceeded, it is determined that the text has a cheating phenomenon, wherein the threshold is proportional to the maximum value of the window length.

The method according to any one of claims 25 to 27, wherein the window capacity of the recording sliding window comprises:

The method according to any one of claims 25 to 28, wherein the sliding window traverses the text in an order from a beginning end to an end end of the text; The value is gradually increased to: gradually shift the right border of the sliding window to the right; the window length is restored to the initial value, and the sliding window is moved to a word containing only the last traversal: the left border of the sliding window Shift right until the sliding window contains only the last traversed word.

30. Method according to any one of claims 25 to 29, characterized in that the window length and the initial value of the window capacity are both one.

31. A text cheat device, comprising:

And a fourth unit, configured to determine that the text has a cheating phenomenon according to a relationship between a window length and a preset threshold; wherein the preset threshold is set according to a maximum value of the window length.