WO2009046649A1 - Method and device of text sorting and method and device of text cheating recognizing - Google Patents

Method and device of text sorting and method and device of text cheating recognizing Download PDF

Info

Publication number
WO2009046649A1
WO2009046649A1 PCT/CN2008/072319 CN2008072319W WO2009046649A1 WO 2009046649 A1 WO2009046649 A1 WO 2009046649A1 CN 2008072319 W CN2008072319 W CN 2008072319W WO 2009046649 A1 WO2009046649 A1 WO 2009046649A1
Authority
WO
WIPO (PCT)
Prior art keywords
window
text
length
capacity
sliding window
Prior art date
Application number
PCT/CN2008/072319
Other languages
French (fr)
Chinese (zh)
Inventor
Rongfang Shao
Haiquan Xie
Liang Dong
Original Assignee
Tencent Technology (Shenzhen) Company Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology (Shenzhen) Company Limited filed Critical Tencent Technology (Shenzhen) Company Limited
Publication of WO2009046649A1 publication Critical patent/WO2009046649A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to the field of computers, and more particularly to a text sorting method and apparatus, and a text cheating recognition method and apparatus.
  • BACKGROUND With the development of the Internet, weblogs (Weblogs, blogs, Chinese "blogs") have become a common network service. At present, a large number of Internet companies have launched their own blog search engines. These blog search engines have different sorting methods for searched blog posts, but they all calculate the most relevant search by processing the search string input by the user. A set of results is returned to the user so that the user can find the blog post that is most relevant to their expectations.
  • the two sorting methods that are currently prevalent are sorting by relevance and sorting by time, and typically sorting by relevance.
  • the specific process of sorting according to relevance is: first calculating the text relevance weight between the search string and each blog post, and the numerical relevance weight of the blog article, thereby establishing a search string and a blog post according to the relevance weight.
  • the search is performed according to the search string input by the user, and the blog articles are sorted according to the size of the correlation weight, and finally the sorted result is sent to the user for display.
  • the search string is generally decomposed into a plurality of search words, so that the text relevance weight of the search string and the blog is decomposed into the search term and the text of the blog.
  • Sexual weight is applied
  • the above sorting method can provide users with a certain degree of credible blog article sorting results, for some low-quality articles, because there are only a few words in the whole or in the whole, in the sorting method, these articles However, it is possible to obtain a higher ordering by repeating and stacking words. This is a typical text cheating phenomenon.
  • This kind of text cheating also affects other text sorting processes besides blog posts, such as the web page sorting process during the search process.
  • Embodiments of the present invention provide a text sorting method and apparatus to reduce the impact of text cheat on sorting results.
  • the embodiment of the invention also provides a text cheating recognition method and device, to identify The text of the malpractice.
  • the present invention provides a text sorting apparatus, wherein the text quality is a text sorting basis, and the method includes:
  • a second module configured to correct, according to the recognition result of the first module, a position of the cheating phenomenon in a sorting queue.
  • the text is traversed by a moving sliding window; wherein the sliding window moves by: gradually increasing the window length of the sliding window from the initial value, and recording the window capacity of the sliding window each time the window length is increased ; when the window capacity reaches the maximum value, restore the window length of the sliding window to the initial value, and move the sliding window to the word containing only the last traversal;
  • the window length is the total number of words accommodated by the sliding window
  • the preset threshold is set according to a maximum value of the window capacity.
  • the first unit is configured to control the movement of the sliding window on the text and the change of the window length; when the window capacity recorded by the third unit is less than the maximum value, the control window length is gradually increased; the window capacity recorded in the third unit reaches the maximum When the value is restored, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal; and when the fourth unit determines that the text is cheating, the sliding window is stopped on the text. Movement
  • a second unit configured to record the window length of the sliding window each time the window length is increased;
  • the third unit is configured to record the window capacity of the sliding window each time the window length is increased; and notify the value of the window capacity The first unit is re-counted from the initial value.
  • a fourth unit configured to determine that the text has a cheating phenomenon according to a relationship between a window length and a preset threshold; wherein the preset threshold is set according to a maximum value of the window capacity.
  • the text is traversed by a moving sliding window, wherein the sliding window moves by: gradually increasing the window length of the sliding window from the initial value, and recording the window capacity of the sliding window each time the window length is increased ; when the window length reaches the maximum, the window length is restored to the beginning Start value, and move the sliding window to the word containing only the last traversal;
  • the window length is the total number of words accommodated by the sliding window
  • the preset threshold is set according to a maximum value of the window capacity.
  • a first unit configured to control movement of the sliding window on the text and a change in the length of the window; when the length of the window recorded by the second unit is less than the maximum value, the length of the control window is gradually increased; the length of the window recorded in the second unit is maximized
  • the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal; and when the fourth unit determines that the text has a cheating phenomenon, the sliding window is stopped on the text.
  • a second unit configured to record a window length of the sliding window each time the window length is increased, and notify the first unit and the third unit of the value of the window length.
  • the third unit is configured to record the window capacity of the sliding window each time the window length is increased; and when the window length recorded by the second unit reaches the maximum value, counting from the initial value.
  • a fourth unit configured to determine that the text has a cheating phenomenon according to a relationship between a window length and a preset threshold; wherein the preset threshold is set according to a maximum value of the window capacity.
  • the text sorting method provided by the embodiment of the present invention identifies the text with the cheating behavior, and corrects the sorting result according to the recognition result.
  • the text cheating is sorted. The effect of the results can improve the objectivity of the ranking.
  • the text cheat recognition method provided by the embodiment of the present invention calculates a window length that accommodates a certain window capacity, compares it with a preset threshold, and calculates a window capacity within a certain window length, and compares it with a preset. Compared with the threshold, the process of text cheating recognition is quantified, which makes the text cheating recognition more objective.
  • BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a flow chart of a text sorting method in an embodiment of the present invention.
  • FIG. 2 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.
  • FIG. 3 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.
  • FIG. 4 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.
  • FIG. 5 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.
  • Figure 6 is a structural diagram of a text sorting apparatus in an embodiment of the present invention.
  • Fig. 7 is a structural diagram of a text cheat recognition apparatus in the embodiment of the present invention.
  • FIG. 8 is a structural diagram of a blog article retrieval sorting system in an embodiment of the present invention.
  • FIG. 9 is a structural diagram of an indexer in a blog article retrieval sorting system according to an embodiment of the present invention.
  • FIG. 10 is a structural diagram of a searcher in a blog article retrieval sorting system according to an embodiment of the present invention.
  • FIG. 11 is a flowchart of a method for establishing an index in a blog article retrieval order according to an embodiment of the present invention.
  • FIG. 12 is a flowchart of a method for establishing an index in a blog article retrieval order according to an embodiment of the present invention.
  • FIG. 13 is a flowchart of a method for retrieving a blog article in an embodiment of the present invention.
  • FIG. 14 is a flowchart of a method for retrieving a blog article in an embodiment of the present invention.
  • the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
  • FIG. 1 is a flow chart of a text sorting method in an embodiment of the present invention. As shown in Figure 1, the method includes:
  • Step S101 Identify a text with cheating behavior
  • Step S102 Correct the position of the text with the cheating behavior in the sorting queue according to the recognition result.
  • An embodiment of the present invention provides a method for recognizing a cheat text, and traversing a text to be detected by using a moving sliding window, wherein the process of moving the sliding window is: increasing the window length of the sliding window from an initial value, and each time When the window length is increased, the window capacity of the sliding window is recorded; when the window capacity reaches the maximum value, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal.
  • the window capacity is different words accommodated by the window.
  • the number, the length of the window is the total number of words in the window, that is, the distance between the left and right borders.
  • the threshold is set according to the maximum value of the window capacity.
  • the process of determining whether the text has a cheating phenomenon according to the relationship between the window length and a preset threshold may be: recording the length of the window corresponding to each window capacity reaching the maximum value; Let the threshold comparison, if the threshold is exceeded, determine that the text is cheating.
  • the process of determining whether the text has a cheating phenomenon according to the relationship between the length of the window and a preset threshold may be: comparing the length of the window of each record with a preset threshold, and if the threshold is exceeded, determining that the text is cheating phenomenon.
  • the threshold is proportional to the maximum value of the window capacity, that is, the larger the maximum window capacity, as a text without cheating, the corresponding window length should be longer.
  • 2 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.
  • the method for recognizing cheat text may be referred to as a water signature recognition algorithm, which traverses the entire text from left to right using a sliding window of fixed maximum size and variable length, and records that the window has been reached.
  • the maximum length The larger the maximum window length of a text, the more likely it is to have a low-quality article with text cheating.
  • Step S202 Determine whether the next word is successfully read: If yes, execute S203; if no, go to step S210.
  • Step S203 The right border of the sliding window is shifted to the right, and the read new word is included in the sliding window.
  • Step S205 The window vocabulary and the capacity C are unchanged, and the length L is incremented. After the step ends, the process proceeds to step S202 to continue reading.
  • Step S206 The word is added to the window vocabulary, the capacity C is incremented, and the length L is incremented.
  • Step S207 Determine whether the window capacity C exceeds the set maximum value Cmax: If yes, execute step S208; if no, proceed to step S202 to continue reading.
  • Step S208 The left border of the window is shifted to the right, and the window is shortened to include only the newly read words.
  • Step S209 It is judged whether the text has been traversed: If yes, step S210 is performed; if no, then step S202 is continued to continue reading.
  • Step S210 When the text traversal is completed, one or more lengths L are recorded, and the importance of the text is determined according to the maximum length of the record: If the maximum length L is greater than the set threshold, the text is cheated. Otherwise it indicates that there is no cheating in the text.
  • FIG. 3 is a flow chart of a method for identifying cheat text in an embodiment of the present invention. As shown in Figure 3, the method includes:
  • Step S302 Read the next word.
  • Step S303 The right border of the sliding window is shifted to the right, and the read new word is included in the sliding window.
  • Step S305 The window vocabulary and the capacity C are unchanged, and the length L is incremented. After the step ends, the process proceeds to step S307.
  • Step S306 The word is added to the window vocabulary, the capacity C is incremented, and the length L is incremented. After the step ends, the process proceeds to step S307.
  • Step S307 It is judged whether the length L exceeds the threshold; if yes, the process proceeds to step S311, otherwise, the process proceeds to step S308.
  • Step S308 determining whether the window capacity C exceeds the set maximum value Cmax; if yes, executing step S309; if not, proceeding to step S310.
  • Step S309 The left border of the window is shifted to the right, and the window is shortened to include only the newly read words.
  • Step S310 Determine whether the text has been traversed: If yes, go to step S312; if no, go to step S302 to continue reading.
  • Step S311 It is determined that the text has a cheating phenomenon.
  • Step S312 It is determined that the text does not have a cheating phenomenon.
  • the threshold values are all closely related to the maximum value of the window capacity C, that is, when the maximum value of the window capacity C is larger, the set threshold value can also be larger. Conversely, the smaller the maximum value of the window capacity C, the corresponding threshold should be reduced accordingly.
  • the embodiment of the invention further provides a method for recognizing cheat text, which uses a moving sliding window to traverse the text to be detected, wherein the process of sliding the window is: increasing the window length of the sliding window from the initial value, and When the window length is increased by a second time, the window capacity of the sliding window is recorded; when the window length reaches the maximum value, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal.
  • the moving process of the sliding window is repeated in turn until the text traversing process or after the entire text traversing is completed, and the text is judged to be cheating according to the relationship between the window capacity and a preset threshold. At this time, the threshold is set according to the maximum value of the window length.
  • the process of determining whether the text has a cheating phenomenon according to the relationship between the window capacity and a preset threshold may be: recording the corresponding window capacity when each window length reaches the maximum value; and minimizing the window capacity minimum If the threshold is compared, if the threshold is less than the threshold, the text is judged to be cheating.
  • the process of determining whether the text has a cheating phenomenon according to the relationship between the window capacity and a preset threshold may be: comparing the window capacity of each record with the preset threshold, and determining that the text exists if the threshold is less than the threshold. Cheating.
  • the threshold is proportional to the maximum value of the window length, that is, the smaller the window capacity is, the smaller the probability that the text is cheating, but the maximum window length may be increased, and the corresponding window capacity may be allowed to increase accordingly. .
  • FIG. 4 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.
  • the entire text is traversed from left to right using a sliding window, and the maximum length of the window is set, i.e., the window cannot exceed the maximum length.
  • the window capacity the more There may be texts that are cheating.
  • Step S402 Determine whether the next word is successfully read: If yes, execute S403; if no, go to step S410.
  • Step S403 The right border of the sliding window is shifted to the right, and the read new word is included in the sliding window.
  • Step S405 The window vocabulary and the capacity C are unchanged, and the length L is incremented. After the step ends, the process proceeds to step S402 to continue reading.
  • Step S406 The word is added to the window vocabulary, the capacity C is incremented, and the length L is incremented.
  • Step S407 Determine whether the window length L exceeds the set maximum value Lmax: If yes, execute step S408; if no, proceed to step S402 to continue reading.
  • Step S408 The left border of the window is shifted to the right, and the window is shortened to include only the newly read words.
  • Step S409 It is judged whether the text has been traversed: If yes, step S410 is performed; if no, then step S402 is continued to continue reading.
  • Step S410 When the text traversal is completed, one or more window capacities C are recorded, and the importance of the text is determined according to the minimum capacity of the record: If the minimum capacity C is less than the set threshold, the text is cheated. Otherwise, it indicates that there is no cheating in the text.
  • FIG. 5 is a flow chart of a method for identifying cheat text in an embodiment of the present invention. As shown in Figure 5, the method includes:
  • Step S502 Read the next word.
  • Step S503 The right border of the sliding window is shifted to the right, and the read new word is included in the sliding window.
  • Step S504 determining whether the word already exists in the window vocabulary: if yes, executing step S505; if not, executing step S506.
  • Step S505 The window vocabulary and the capacity C are unchanged, and the length L is incremented. After the step ends, the process proceeds to step S507.
  • Step S506 the word is added to the window vocabulary, the capacity C is incremented, and the length L is incremented. After the end, the process proceeds to step S507.
  • Step S507 It is judged whether the capacity C is smaller than the threshold; if it is less than the threshold, the process proceeds to step S511, otherwise, the process proceeds to step S508.
  • Step S508 determining whether the window length L exceeds the set maximum value Lmax; if yes, executing step S509; if not, proceeding to step S510.
  • Step S509 The left border of the window is shifted to the right, and the window is shortened to include only the newly read words.
  • Step S510 Determine whether the text has been traversed: If yes, execute step S512; if no, proceed to step S502 to continue reading.
  • Step S511 It is determined that the text has a cheating phenomenon.
  • Step S512 It is determined that the text does not have a cheating phenomenon.
  • the threshold values are all closely related to the maximum value of the window length L, that is, when the maximum value of the window length L is larger, the set threshold value can also be larger. Conversely, the smaller the maximum value of the window length L, the smaller the set threshold should be.
  • the order of traversing the text is from beginning to end, so when the window length increases from the initial value, the right border of the window starts to move to the right, and when the window length returns to the initial value, the left border of the window is right. shift.
  • the order of traversing the text can also be from end to end.
  • the window length increases from the initial value
  • the left boundary of the window begins to move to the left.
  • the window length is restored to the initial value
  • the right edge of the window is shifted to the left.
  • the method of correcting the position of the text having the cheating behavior in the queue according to the recognition result may be as follows.
  • the ordering parameter corresponding to all the cheating behaviors may be corrected by a fixed amplitude.
  • the weight of the correlation between the search term and the text is generally reduced according to the weight of the correlation between the search term and the text; Amplitude. , . ,,,
  • the text cheating recognition algorithm window capacity and window length, more accurate evaluation of the degree of cheating of different texts, different processing results for texts with different degrees of cheating, for more cheating
  • the text corresponding to a more rigorous processing. For example, the position of the text with severe cheating is adjusted to a more position in the queue, or the order corresponding to the text with severe cheating is corrected by a larger margin.
  • the two sliding windows correspond to the window capacity and window length at this time. If the two sliding windows of the two texts have the same window capacity, the text corresponding to the sliding window with a large window length has a greater degree of cheating. If the two sliding windows of the two texts are equal in length, the text corresponding to the sliding window with a small window capacity has a greater degree of cheating.
  • the most common method can calculate the ratio of the window capacity and the window length of the two sliding windows in the two texts. Which text corresponds to the sliding window with a smaller ratio of window capacity to window length, and which text is more cheated. Big.
  • Figure 6 is a structural diagram of a text sorting apparatus in an embodiment of the present invention. As shown in Figure 6, the device includes:
  • a text recognition module 601 configured to identify text with cheating behavior
  • the sorting correction module 602 is configured to correct the position of the text with the cheating behavior in the sorting queue according to the recognition result.
  • Figure 7 is a structural diagram of a text cheat recognition apparatus in an embodiment of the present invention. As shown in Fig. 7, the apparatus includes a window length control unit 701, a window capacity recording unit 702, a window length recording unit 703, and a threshold comparison unit 704.
  • the window length control unit 701 is configured to control the movement of the sliding window on the text and the change of the window length; when the window capacity recorded by the window capacity recording unit 703 is less than the maximum value, the control window length is gradually increased; When the window capacity of the 703 record reaches the maximum value, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal; and when the threshold comparison unit 704 determines that the text is cheating, the stop The movement of the sliding window over the text.
  • the window length recording unit 702 is configured to record the window length of the sliding window each time the window length is increased.
  • the window capacity recording unit 703 is configured to record the window capacity of the sliding window each time the window length is increased; and notify the window length control unit 701 of the value of the window capacity, and start counting again from the initial value.
  • the threshold comparison unit 704 is configured to determine that the text is cheating according to the relationship between the window length and a preset threshold; wherein the preset threshold is set according to the maximum value of the window capacity.
  • the function of the four units of the device can also be:
  • the window length control unit 701 is configured to control the movement of the sliding window on the text and the change of the window length; when the window length recorded by the window length recording unit 702 is less than the maximum value, the control window length is gradually increased; When the window length of the record 702 reaches the maximum value, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal; and when the threshold comparison unit 704 determines that the text is cheating, the stop The movement of the sliding window over the text.
  • the window length recording unit 702 is configured to record the window length of the sliding window each time the window length is increased, and notify the window length control unit 701 and the window capacity recording unit 703 of the value of the window length.
  • the window capacity recording unit 703 is configured to record the window capacity of the sliding window each time the window length is increased; and when the window length recorded by the window length recording unit 702 reaches the maximum value, counting from the initial value.
  • the threshold comparison unit 704 is configured to determine that the text is cheating according to the relationship between the window length and a preset threshold; wherein the preset threshold is set according to the maximum value of the window length.
  • the text is a blog post, and the sorting is a sort of search.
  • the method and the device in the embodiment of the present invention are illustrated.
  • the text in the embodiment of the present invention may also be a webpage text and the like, and all other sorting operations are required. Text, sorted scenes are not limited to search sorting.
  • the embodiment of the present invention identifies cheating in the calculation of the text relevance weight.
  • the text, and its weight reduction processing can establish a more accurate index, thereby improving the objective accuracy of sorting based on this index, and ensuring the quality of text retrieval by users.
  • FIG. 8 is a structural diagram of a blog article retrieval sorting system in an embodiment of the present invention.
  • the system includes a blog system 100, an indexer 200, a retriever 300, an agent 400, and a client 500.
  • the connection relationship between the devices in all the diagrams of the present invention is for the purpose of clearly explaining the information interaction and control process thereof, and therefore should be regarded as a logical connection relationship, and should not be limited to physical connections. among them:
  • the blogging system 100 is used to provide blog related services for users, including storing and managing blog posts, and the like, and provides a relevance factor for the indexer 200 in the present invention, including text relevance factors (eg, text classification, title, Body, nickname, space name, etc., and numerical correlation factors (eg, activity factor, reload factor, response rate factor, publication time factor, etc.).
  • text relevance factors eg, text classification, title, Body, nickname, space name, etc.
  • numerical correlation factors eg, activity factor, reload factor, response rate factor, publication time factor, etc.
  • the core of the blog system 100 can be a web server, but the invention is not limited to its specific form.
  • the indexer 200 is configured to index based on data in the blog system 100 for the searcher 300 to sort the searched blog posts based on the index.
  • the retriever 300 queries and sorts the blog articles based on the search terms input by the user.
  • the agent 400 is configured to receive the search string sent by the client 500, divide the search string into search terms, send it to the searcher 300, and forward the retrieved and sorted results of the retriever 300 to the client 500.
  • the client 500 receives the search term or the search string input by the user: if the user inputs the search term, it can directly send it to the retriever 300, and after receiving the blog article sorting result fed back by the crawler 300, the sorting result is drawn. And displayed to the user interface; if the user inputs a search string, it must be sent to the agent 400 for segmentation, and after receiving the blog article sorting result fed back by the agent 400, the sorting result is drawn and displayed to the user interface. on.
  • the client 500 is typically a variety of terminal devices capable of logging in to the Internet, such as a personal computer (PC), a personal digital assistant (PDA), a mobile phone (MP), etc., and thus the present invention.
  • PC personal computer
  • PDA personal digital assistant
  • MP mobile phone
  • the scope of protection should not be limited to a particular type of client.
  • FIG. 9 is a structural diagram of an indexer in a blog article retrieval sorting system according to an embodiment of the present invention.
  • the indexer 200 includes: a numerical correlation determination unit 201, a text correlation determination unit 202, a text cheat recognition unit 203, a superposition calculation unit 204, and an index construction unit 205.
  • the numerical correlation determining unit 201 is configured to calculate a numerical correlation weight of the search term and each blog post based on the numerical correlation factor extracted from the blog system.
  • the text relevance determining unit 202 is configured to calculate a text-related weight of the search term and each blog post based on the text relevance factor extracted from the blog system.
  • the text cheat recognition unit 203 is configured to recognize the blog article in which the text is cheated when the text relevance determination unit 202 calculates the text relevance weight between the search term and the blog post.
  • the superposition calculation unit 204 is configured to perform superposition calculation on the foregoing numerical correlation weight and text relevance right to obtain a comprehensive correlation weight of the search term, and send it to the index construction unit 205.
  • the index construction unit 205 builds an index based on the comprehensive correlation weight.
  • the indexer 200 includes a text relevance determination unit 202, a text cheat recognition unit 203, and an index construction unit 205.
  • the text relevance determining unit 202 is configured to calculate a text relevance weight of the search term and each blog post based on the text relevance factor extracted from the blog system.
  • the text cheat recognition unit 203 is used to identify a blog post in which the text is cheated, and judges the genre The text relevance weight constructs an index between the search term and each blog post.
  • the process of constructing an index only considers the text correlation factor, the accuracy of the index is not high enough.
  • the indexer 200 shown in Figure 2 has a higher index accuracy.
  • FIG. 10 is a structural diagram of a searcher in a blog article retrieval sorting system according to an embodiment of the present invention.
  • the retriever 300 includes a query unit 301, a composite correlation calculation unit 302, and a sorting unit 303.
  • the user initially inputs a search string containing a plurality of search terms,
  • the index is divided into search terms and sent to the searcher 300, and the searcher 300 receives the search words and then processes them.
  • the query unit 301 queries the relevance weight (text relevance weight, or comprehensive relevance weight) between each search term and each blog post from the index that has been established by the indexer, and sends it to the sorting unit.
  • the compound correlation calculation unit 302 calculates the composite correlation weight between the search string and each blog post based on the correlation weight of each search term, and sends it to the sorting unit 303.
  • the sorting unit 303 sorts each blog post related to the search string according to the composite correlation weight.
  • a retriever 300 is provided that can be directly connected to and communicated with a client, and is suitable for situations in which a user inputs a search term rather than a search string.
  • the retriever 300 at this time includes a query unit 301 and a sorting unit 303.
  • the query unit 301 queries, according to the search term input by the user, the correlation weight (text relevance weight, or comprehensive relevance weight) between the search term and each blog post from the index that has been established by the indexer 200, and It is sent to the sorting unit 303.
  • the sorting unit 303 sorts each blog post related to the search term according to the size of the received correlation weight. It should be noted that since most of the users currently input a search string containing a plurality of search terms, the structure of the retriever 300 shown in Fig. 3 is more widely and typical.
  • FIG. 11 is a flowchart of a method for establishing an index in a blog article retrieval order according to an embodiment of the present invention. As shown in FIG. 11, the method includes the following steps:
  • Step S1101 Extract a correlation factor from the blog system, and format the data.
  • the formatting mentioned here includes normalizing some correlation factors and performing some processing on some correlation factors, such as Log processing, to map the values of most correlation factors. In a fixed interval, for example [0, 100]. Of course, some correlation factors take their original values.
  • the correlation factor referred to in the present invention may include only a text correlation factor, a text correlation factor, and a numerical correlation factor. These correlation factors are used as input parameters when the indexer builds the index as a correlation weight calculation.
  • Step S1102 Calculate the correlation weight of the search term and each blog post, and identify and degrade the blog article with text cheating.
  • only the text relevance factor is considered, which calculates the text relevance weight of the search term based on the text relevance factor, and identifies the blog post that the text is cheated, and then textual relevance of the search term to the blog post.
  • the weight is appropriately degraded.
  • the indexer not only considers the text correlation factor, but also considers the numerical correlation factor, respectively calculates the text relevance weight and the numerical correlation weight, and simultaneously identifies the blog post of the text cheat, and then searches for The word and the text relevance weight of the blog article are appropriately reduced, and finally the text correlation weight and the numerical correlation weight are superimposed to obtain the comprehensive correlation weight.
  • the previous embodiment only performs the weight reduction processing on the text correlation weight, and is used.
  • This embodiment also considers the numerical correlation factor Step increased number According to the accuracy.
  • Step S1103 Construct an index between the search term and each blog post according to the correlation weight after the weight reduction.
  • the index records the relevance weights of each search term, the blog post corresponding to the search term, the search term and the blog post, so that when the user inputs the search term for searching, the search can be performed according to the data in the index.
  • Blog articles are sorted so that users can quickly find the most relevant blog posts.
  • FIG. 12 is a flowchart of a method for establishing an index in a blog article retrieval order according to an embodiment of the present invention. As shown in Figure 12, the process specifically includes:
  • Step S1201 Extract a correlation factor from the blog system, and format the data.
  • the correlation factor at this time includes a text correlation factor and a numerical correlation factor.
  • Step S1202 The indexer calculates a numerical correlation weight of the search term and each blog post.
  • the numerical correlation factor includes an activity factor ⁇ . , the reload rate factor Wdu .
  • the recovery rate factor W the publication time factor ⁇ ⁇ these four.
  • the activity factor W TM is calculated by the blog system, and the value range is [0, 100]. It comprehensively considers the user registration frequency of the blog personal space, the frequency of blog post publication, etc., and is the activity level of the blog personal space. Comprehensive metrics, the higher the activity, the higher the priority of the ranking results of blog posts.
  • ⁇ Reproduced rate factor "is calculated based on the number of repeating duplication system blog articles obtained in the range [0, 100], the higher the rate is reproduced, the higher the priority ranking result blog articles.
  • Reply rate factor W It is calculated according to the number of reply times of the blog post, the value range is [0, 100], and the higher the response rate factor ⁇ , the higher the priority of the sorting result of the blog post.
  • the publishing time factor W ⁇ is the publishing time of the blog post. It can be expressed by UNIX time, and the ranking result of the newly published blog post has higher priority.
  • the numerical correlation weight is calculated and normalized by all the correlation factors listed above, and its value is obtained.
  • the range is in the interval [0, 1] and its calculation formula is as follows:
  • Step S1203 The indexer calculates a text relevance weight of the search term and each blog post, and identifies a blog post with text cheats, and performs a demotion process on the blog post with the text cheat.
  • the text relevance factor is also the text field available for retrieval.
  • the text fields include five categories: a category, a title, a body, a nickname, and a space name.
  • Each field has a fixed weight value W and a correction coefficient ⁇ , as shown in Table 1.
  • further identifying the blog article with text cheating including: traversing the blog article by using a sliding window, and recording the maximum length reached by the sliding window; comparing the maximum length of the active window with a threshold If the threshold value is exceeded, the blog post is determined to be a text cheat; the blog post with text cheating is appropriately degraded, for example, the amplitude adjustment can be performed, and the text correlation weight is corrected to the previous 60%.
  • Step S1204 The indexer uses the superposition calculation unit to perform superposition calculation on the numerical correlation weight and the text correlation weight to obtain the comprehensive correlation weight.
  • Step S1205 The indexer stores and stores the data based on the comprehensive correlation weights for searching by the user. Extraction application.
  • FIG. 13 is a flowchart of a method for retrieving a blog article in an embodiment of the present invention. This embodiment is a case where a user inputs a search term, including:
  • Step S1301 The retriever receives the search term input by the user in the client.
  • Step S1302 The retriever extracts a correlation weight of each search term from a blog article from an index that has been constructed by the indexer, and the correlation weight may be a text relevance weight, or may be a text correlation weight and a numerical value. The comprehensive correlation weight after the correlation weight is superimposed.
  • Step S1303 The searcher sorts the searched blog articles according to the correlation weights, and feeds the sorting result to the client.
  • FIG. 14 is a flowchart of a method for retrieving a blog article in an embodiment of the present invention. This embodiment is a case where the user inputs a search string, and specifically includes:
  • Step S1401 The agent divides the search string input by the user in the client into a search term and sends it to the searcher.
  • Step S1402 The retriever extracts a correlation weight of each search term from a blog article from an index constructed by the indexer, and the correlation weight may be a text relevance weight, or may be a text correlation weight and a numerical correlation.
  • the comprehensive correlation weight after the superposition of sexual weights.
  • Step S1403 The retriever calculates a composite correlation weight of the search string and the blog article.
  • the user inputs the relevance of the search string to the blog post, which can be considered as a comprehensive result of the correlation between the single search term and the blog post. Therefore, in one embodiment, the average is added after the cartridge is added.
  • the model calculates the composite correlation weights. Let the search string ( ⁇ Q ⁇ q ⁇ qz, ..., q n ⁇ , n be the number of index words after the search string is segmented, d is all the blog articles hit by a search word q n , then the search string Q
  • the formula for calculating the compound correlation weights with blog posts is:
  • Step S1404 The retriever sorts the searched blog articles according to the composite correlation weights, and sends the sorting result to the agent.
  • Step S 1405 The agent forwards the sort result to the client, and displays the sort result on the user interface.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and device of text sorting are disclosed. The text sorting includes: recognizing document cheating; modifying local of text cheating in sorting queue according to recognize result. method and device of text cheating recognizing are disclosed.

Description

文本排序方法及装置、 文本作弊识别方法及装置 技术领域 本发明涉及计算机领域, 更具体地说, 涉及一种文本排序方法及装置, 以及一种文本作弊识别方法及装置。 背景技术 随着互联网的发展, 网络日志 (Weblog, 筒作 Blog, 中文即 "博客" ) 已成为一种常见的网络服务。 目前已有大量互联网公司推出各自的博客搜索 引擎, 这些博客搜索引擎对检索到的博客文章所采取的排序方法不尽相同, 但都是通过对用户输入的检索串进行计算处理, 找到最相关的一组结果, 返 回给用户, 从而使用户可以找到与自己期望最相关的博客文章。 目前普遍存 在的两种排序方式是, 按相关度排序和按时间排序, 而比较典型的是按照相 关度进行排序。  The present invention relates to the field of computers, and more particularly to a text sorting method and apparatus, and a text cheating recognition method and apparatus. BACKGROUND With the development of the Internet, weblogs (Weblogs, blogs, Chinese "blogs") have become a common network service. At present, a large number of Internet companies have launched their own blog search engines. These blog search engines have different sorting methods for searched blog posts, but they all calculate the most relevant search by processing the search string input by the user. A set of results is returned to the user so that the user can find the blog post that is most relevant to their expectations. The two sorting methods that are currently prevalent are sorting by relevance and sorting by time, and typically sorting by relevance.
按照相关度进行排序的具体过程是: 首先计算检索串与各博客文章之间 的文本相关性权值, 以及博客文章的数值相关性权值, 从而根据相关性权值 建立检索串与博客文章之间的索引; 当用户进行检索时, 则根据用户输入的 检索串到建立的索引中进行搜索, 并按照相关性权值的大小对各博客文章进 行排序, 最后将排序后的结果发送给用户显示。 其中在计算检索串与各博客 文章之间的文本相关性权值时, 一般将检索串分解成多个检索词, 使得检索 串与博客的文本相关性权值分解为检索词与博客的文本相关性权值。  The specific process of sorting according to relevance is: first calculating the text relevance weight between the search string and each blog post, and the numerical relevance weight of the blog article, thereby establishing a search string and a blog post according to the relevance weight. When the user performs the search, the search is performed according to the search string input by the user, and the blog articles are sorted according to the size of the correlation weight, and finally the sorted result is sent to the user for display. . When calculating the text relevance weight between the search string and each blog post, the search string is generally decomposed into a plurality of search words, so that the text relevance weight of the search string and the blog is decomposed into the search term and the text of the blog. Sexual weight.
上述排序方法虽然能为用户提供一定程度上可信的博客文章排序结果, 但对于某一些低质量文章,由于其通篇或者局部只有几个词翻来覆去地出现, 在这种排序方法中, 这些文章却能通过词语的重复和堆砌而获得较靠前的排 序, 这是一种典型的文本作弊现象。  Although the above sorting method can provide users with a certain degree of credible blog article sorting results, for some low-quality articles, because there are only a few words in the whole or in the whole, in the sorting method, these articles However, it is possible to obtain a higher ordering by repeating and stacking words. This is a typical text cheating phenomenon.
这种文本作弊现象同样还会影响除博客文章以外的其他文本排序过程, 如搜索过程中的网页排序过程等。  This kind of text cheating also affects other text sorting processes besides blog posts, such as the web page sorting process during the search process.
因此需要一种新的文本排序方法, 以减小文本作弊对排序结果的影响。 发明内容 本发明实施例提供了一种文本排序方法和装置, 以减小文本作弊对排序 结果的影响。  Therefore, a new text sorting method is needed to reduce the impact of text cheating on the sorting result. SUMMARY OF THE INVENTION Embodiments of the present invention provide a text sorting method and apparatus to reduce the impact of text cheat on sorting results.
本发明实施例还提供了一种文本作弊识别方法和装置, 以识别出具有作 弊行为的文本。 The embodiment of the invention also provides a text cheating recognition method and device, to identify The text of the malpractice.
本发明实施例提供的一种文本排序方法, 其特征在于, 包括:  A text sorting method provided by the embodiment of the invention includes:
识别具有作弊行为的文本; 本发明实施例提供的一种 本排序装置,、其中文本质量为文本排序的依 据, 其特征在于, 包括:  The present invention provides a text sorting apparatus, wherein the text quality is a text sorting basis, and the method includes:
第一模块, 用于识别具有作弊现象的文本;  a first module for identifying a text with cheating;
第二模块, 用于根据所述第一模块的识别结果, 修正所述具有作弊现象 的文本在排序队列中的位置。  And a second module, configured to correct, according to the recognition result of the first module, a position of the cheating phenomenon in a sorting queue.
本发明实施例提供的一种文本作弊方法, 其特征在于, 包括:  A text cheat method provided by the embodiment of the invention includes:
利用一个移动的滑动窗口遍历所述文本; 其中所述滑动窗口移动的过程 为: 将滑动窗口的窗口长度从初始值逐步增大, 并在每次增大窗口长度时, 记录滑动窗口的窗口容量; 在窗口容量到达最大值时, 将滑动窗口的窗口长 度恢复为初始值, 并将滑动窗口移动到只包含最后遍历的词;  The text is traversed by a moving sliding window; wherein the sliding window moves by: gradually increasing the window length of the sliding window from the initial value, and recording the window capacity of the sliding window each time the window length is increased ; when the window capacity reaches the maximum value, restore the window length of the sliding window to the initial value, and move the sliding window to the word containing only the last traversal;
依次重复所述滑动窗口的移动过程, 直到文本遍历过程中或者整个文本 遍历完毕后, 根据窗口长度和一预设阈值的关系判定出所述文本存在作弊现 其中所述窗口容量为所述滑动窗口容纳的不同词的个数;  Repeating the moving process of the sliding window in sequence, until the text traversing process or after the entire text traversing is completed, determining that the text is cheated according to the relationship between the window length and a preset threshold, wherein the window capacity is the sliding window The number of different words accommodated;
所述窗口长度为所述滑动窗口容纳的所有词的总数;  The window length is the total number of words accommodated by the sliding window;
所述预设阈值根据所述窗口容量的最大值而设置。  The preset threshold is set according to a maximum value of the window capacity.
本发明实施例提供的一种文本作弊装置, 其特征在于, 包括:  A text cheat device provided by the embodiment of the invention includes:
第一单元, 用于控制滑动窗口在文本上的移动以及窗口长度的改变; 在 第三单元记录的窗口容量小于最大值时, 控制窗口长度逐步增大; 在第三单 元记录的窗口容量达到最大值时, 将滑动窗口的窗口长度恢复为初始值, 并 将滑动窗口移动到只包含最后遍历的词; 并在第四单元判断出所述文本存在 作弊现象时, 停止所述滑动窗口在文本上的移动;  The first unit is configured to control the movement of the sliding window on the text and the change of the window length; when the window capacity recorded by the third unit is less than the maximum value, the control window length is gradually increased; the window capacity recorded in the third unit reaches the maximum When the value is restored, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal; and when the fourth unit determines that the text is cheating, the sliding window is stopped on the text. Movement
第二单元, 用于在每次增大窗口长度时, 记录滑动窗口的窗口长度; 第三单元, 用于每次增大窗口长度时, 记录滑动窗口的窗口容量; 并将 窗口容量的值通知所述第一单元, 并重新从初始值开始计数。  a second unit, configured to record the window length of the sliding window each time the window length is increased; the third unit is configured to record the window capacity of the sliding window each time the window length is increased; and notify the value of the window capacity The first unit is re-counted from the initial value.
第四单元, 用于根据窗口长度和一预设阈值的关系判定出所述文本存在 作弊现象; 其中所述预设阈值根据所述窗口容量的最大值而设置。  And a fourth unit, configured to determine that the text has a cheating phenomenon according to a relationship between a window length and a preset threshold; wherein the preset threshold is set according to a maximum value of the window capacity.
本发明实施例提供的一种文本作弊方法, 其特征在于, 包括:  A text cheat method provided by the embodiment of the invention includes:
利用一个移动的滑动窗口遍历所述文本, 其中所述滑动窗口移动的过程 为: 将滑动窗口的窗口长度从初始值逐步增大, 并在每次增大窗口长度时, 记录滑动窗口的窗口容量; 在窗口长度到达最大值时, 将窗口长度恢复为初 始值, 并将滑动窗口移动到只包含最后遍历的词; The text is traversed by a moving sliding window, wherein the sliding window moves by: gradually increasing the window length of the sliding window from the initial value, and recording the window capacity of the sliding window each time the window length is increased ; when the window length reaches the maximum, the window length is restored to the beginning Start value, and move the sliding window to the word containing only the last traversal;
依次重复所述滑动窗口的移动过程, 直到文本遍历过程中或者整个文本 遍历完毕后, 根据窗口容量和一预设阈值的关系判定出所述文本存在作弊现 其中所述窗口容量为所述滑动窗口容纳的不同词的个数;  Repeating the moving process of the sliding window in sequence, until the text traversing process or after the entire text traversing is completed, determining that the text exists in accordance with the relationship between the window capacity and a preset threshold, wherein the window capacity is the sliding window The number of different words accommodated;
所述窗口长度为所述滑动窗口容纳的所有词的总数;  The window length is the total number of words accommodated by the sliding window;
所述预设阈值根据所述窗口容量的最大值而设置。  The preset threshold is set according to a maximum value of the window capacity.
本发明实施例提供的一种文本作弊装置, 其特征在于, 包括:  A text cheat device provided by the embodiment of the invention includes:
第一单元, 用于控制滑动窗口在文本上的移动以及窗口长度的改变; 在 第二单元记录的窗口长度小于最大值时, 控制窗口长度逐步增大; 在第 二单元记录的窗口长度达到最大值时,将滑动窗口的窗口长度恢复为初始值, 并将滑动窗口移动到只包含最后遍历的词; 并在第四单元判断出所述文本存 在作弊现象时, 停止所述滑动窗口在文本上的移动;  a first unit, configured to control movement of the sliding window on the text and a change in the length of the window; when the length of the window recorded by the second unit is less than the maximum value, the length of the control window is gradually increased; the length of the window recorded in the second unit is maximized When the value is restored, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal; and when the fourth unit determines that the text has a cheating phenomenon, the sliding window is stopped on the text. Movement
第二单元, 用于在每次增大窗口长度时, 记录滑动窗口的窗口长度, 并 将窗口长度的值通知给所述第一单元以及第三单元。  And a second unit, configured to record a window length of the sliding window each time the window length is increased, and notify the first unit and the third unit of the value of the window length.
第三单元, 用于每次增大窗口长度时, 记录滑动窗口的窗口容量; 并在 第二单元记录的窗口长度达到最大值时, 从初始值开始计数。  The third unit is configured to record the window capacity of the sliding window each time the window length is increased; and when the window length recorded by the second unit reaches the maximum value, counting from the initial value.
第四单元, 用于根据窗口长度和一预设阈值的关系判定出所述文本存在 作弊现象; 其中所述预设阈值根据所述窗口容量的最大值而设置。  And a fourth unit, configured to determine that the text has a cheating phenomenon according to a relationship between a window length and a preset threshold; wherein the preset threshold is set according to a maximum value of the window capacity.
利用本发明实施例提供的文本排序方法, 将具有作弊行为的文本识别出 来, 并根据识别结果来修正排序结果, 对于将文本质量作为排序的一个重要 依据的排序方法中, 削减了文本作弊对排序结果的影响, 可以提高排序的客 观性。  The text sorting method provided by the embodiment of the present invention identifies the text with the cheating behavior, and corrects the sorting result according to the recognition result. For the sorting method that uses the text quality as an important basis for sorting, the text cheating is sorted. The effect of the results can improve the objectivity of the ranking.
利用本发明实施例提供的文本作弊识别方法, 通过计算容纳一定窗口容 量的窗口长度, 并将其与一预设阈值相比, 以及计算一定窗口长度内的窗口 容量, 并将其与一预设阈值相比, 将文本作弊识别的过程进行量化, 使得文 本作弊识别更加的客观。 附图简要说明 图 1是本发明实施例中文本排序方法的流程图。  The text cheat recognition method provided by the embodiment of the present invention calculates a window length that accommodates a certain window capacity, compares it with a preset threshold, and calculates a window capacity within a certain window length, and compares it with a preset. Compared with the threshold, the process of text cheating recognition is quantified, which makes the text cheating recognition more objective. BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a flow chart of a text sorting method in an embodiment of the present invention.
图 2是本发明实施例中识别作弊文本的方法流程图。  2 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.
图 3是本发明实施例中识别作弊文本的方法流程图。  3 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.
图 4是本发明实施例中识别作弊文本的方法流程图。  4 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.
图 5是本发明实施例中识别作弊文本的方法流程图。  FIG. 5 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.
图 6是本发明实施例中文本排序装置的结构图。 图 7是本发明实施例中文本作弊识别装置的结构图。 Figure 6 is a structural diagram of a text sorting apparatus in an embodiment of the present invention. Fig. 7 is a structural diagram of a text cheat recognition apparatus in the embodiment of the present invention.
图 8是本发明实施例中博客文章检索排序系统的结构图。  FIG. 8 is a structural diagram of a blog article retrieval sorting system in an embodiment of the present invention.
图 9是本发明实施例中博客文章检索排序系统中索引器的结构图。  FIG. 9 is a structural diagram of an indexer in a blog article retrieval sorting system according to an embodiment of the present invention.
图 10是本发明实施例中博客文章检索排序系统中检索器的结构图。 图 11是本发明实施例中博客文章检索排序中建立索引的方法流程图。 图 12是本发明实施例中博客文章检索排序中建立索引的方法流程图。 图 13是本发明实施例中博客文章检索排序方法的流程图。  FIG. 10 is a structural diagram of a searcher in a blog article retrieval sorting system according to an embodiment of the present invention. FIG. 11 is a flowchart of a method for establishing an index in a blog article retrieval order according to an embodiment of the present invention. FIG. 12 is a flowchart of a method for establishing an index in a blog article retrieval order according to an embodiment of the present invention. FIG. 13 is a flowchart of a method for retrieving a blog article in an embodiment of the present invention.
图 14是本发明实施例中博客文章检索排序方法的流程图。 实施本发明的方式 为了使本发明的目的、 技术方案及优点更加清楚明白, 以下结合附图及 实施例, 对本发明进行进一步详细说明。 应当理解, 此处所描述的具体实施 例仅用以解释本发明, 并不用于限定本发明。  FIG. 14 is a flowchart of a method for retrieving a blog article in an embodiment of the present invention. The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
图 1是本发明实施例中文本排序方法的流程图。 如图 1所示, 该方法包 括:  1 is a flow chart of a text sorting method in an embodiment of the present invention. As shown in Figure 1, the method includes:
步骤 S101 : 识别具有作弊行为的文本;  Step S101: Identify a text with cheating behavior;
步骤 S102: 根据识别结果, 修正具有作弊行为的文本在排序队列中的位 置。  Step S102: Correct the position of the text with the cheating behavior in the sorting queue according to the recognition result.
本发明实施例提供了一种识别作弊文本的方法, 利用一个移动的滑动窗 口遍历待检测文本, 其中滑动窗口移动的过程为: 将滑动窗口的窗口长度从 初始值逐步增大, 并在每次增大窗口长度时, 记录滑动窗口的窗口容量; 在 窗口容量到达最大值时, 将滑动窗口的窗口长度恢复为初始值, 并将滑动窗 口移动到只包含最后遍历的词。 依次重复滑动窗口的移动过程, 直到文本遍 历过程中或者整个文本遍历完毕后, 根据窗口长度和一预设阈值的关系判定 出该文本存在作弊现象; 其中窗口容量为该窗口容纳的不同的词的个数, 窗 口长度为窗口内词的总个数, 即左右边界之间的距离。 此时该阈值根据窗口 容量的最大值而设置。  An embodiment of the present invention provides a method for recognizing a cheat text, and traversing a text to be detected by using a moving sliding window, wherein the process of moving the sliding window is: increasing the window length of the sliding window from an initial value, and each time When the window length is increased, the window capacity of the sliding window is recorded; when the window capacity reaches the maximum value, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal. Repeating the movement process of the sliding window in turn, until the text traversing process or after the entire text traversing, determining that the text is cheating according to the relationship between the window length and a preset threshold; wherein the window capacity is different words accommodated by the window The number, the length of the window is the total number of words in the window, that is, the distance between the left and right borders. At this time, the threshold is set according to the maximum value of the window capacity.
在整个文本遍历完毕后, 根据窗口长度和一预设阈值的关系判定文本是 否存在作弊现象的过程可以是: 记录每次窗口容量到达最大值时对应的窗口 长度; 将窗口长度的最大值与预设阈值比较, 若超过阈值则判定该文本存在 作弊现象。 在文本遍历过程中, 根据窗口长度和一预设阈值的关系判定文本 是否文本存在作弊现象的过程可以是: 将每次记录的窗口长度与预设阈值比 较, 若超过阈值则判定该文本存在作弊现象。 此时该阈值与窗口容量的最大 值成正比, 即最大窗口容量越大, 作为一个不存在作弊现象的文本而言, 此 时相应的窗口长度应该越长。 图 2是本发明实施例中识别作弊文本的方法流程图。 在该实施例中, 这 种识别作弊文本的方法可以称为水帖识别算法, 该算法利用一个最大容量固 定、 长度可变的滑动窗口从左到右遍历整个文本, 并记录该窗口曾达到的最 大长度。 一个文本的最大窗口长度越大, 其越可能是存在文本作弊现象的低 质量文章。 After the entire text traversal is completed, the process of determining whether the text has a cheating phenomenon according to the relationship between the window length and a preset threshold may be: recording the length of the window corresponding to each window capacity reaching the maximum value; Let the threshold comparison, if the threshold is exceeded, determine that the text is cheating. In the process of text traversal, the process of determining whether the text has a cheating phenomenon according to the relationship between the length of the window and a preset threshold may be: comparing the length of the window of each record with a preset threshold, and if the threshold is exceeded, determining that the text is cheating phenomenon. At this time, the threshold is proportional to the maximum value of the window capacity, that is, the larger the maximum window capacity, as a text without cheating, the corresponding window length should be longer. 2 is a flow chart of a method for identifying cheat text in an embodiment of the present invention. In this embodiment, the method for recognizing cheat text may be referred to as a water signature recognition algorithm, which traverses the entire text from left to right using a sliding window of fixed maximum size and variable length, and records that the window has been reached. The maximum length. The larger the maximum window length of a text, the more likely it is to have a low-quality article with text cheating.
在该算法中,设滑动窗口的容量为 C, 其最大值设定为 Cmax;用一个 C, = C+1的递增数组存放该滑动窗口内不同的词, 记录为 "窗口词表" ; 并设 滑动窗口的长度为 L。  In the algorithm, the capacity of the sliding window is C, and the maximum value is set to Cmax; an increasing array of C, = C+1 is used to store different words in the sliding window, and the record is "window vocabulary"; Let the length of the sliding window be L.
从文本中读取第一词, 并继续执行以下步骤:  Read the first word from the text and continue with the following steps:
步骤 S201 : 记录容量 C=l , 长度 L=l , 此时窗口词表中仅包括此时滑动 窗口内的词。  Step S201: The recording capacity C=l and the length L=l. At this time, only the words in the sliding window at this time are included in the window vocabulary.
步骤 S202: 判断是否成功读取到下一个词: 若是, 则执行 S203; 若否, 则转步骤 S210。  Step S202: Determine whether the next word is successfully read: If yes, execute S203; if no, go to step S210.
步骤 S203:滑动窗口的右边界右移,将读取到的新词包含在滑动窗口内。 步骤 S204:判断该词是否已存在于窗口词表中:若是,则执行步骤 S205; 若否, 则执行步骤 S206。  Step S203: The right border of the sliding window is shifted to the right, and the read new word is included in the sliding window. Step S204: It is determined whether the word already exists in the window vocabulary: if yes, step S205 is performed; if not, step S206 is performed.
步骤 S205: 窗口词表及容量 C不变, 长度 L递增, 该步骤结束后转步 骤 S202继续读取。  Step S205: The window vocabulary and the capacity C are unchanged, and the length L is incremented. After the step ends, the process proceeds to step S202 to continue reading.
步骤 S206: 该词加入到窗口词表, 容量 C递增, 长度 L递增。  Step S206: The word is added to the window vocabulary, the capacity C is incremented, and the length L is incremented.
步骤 S207: 判断窗口容量 C是否超过设定的最大值 Cmax: 若是, 则执 行步骤 S208; 若否, 则转步骤 S202继续读取。  Step S207: Determine whether the window capacity C exceeds the set maximum value Cmax: If yes, execute step S208; if no, proceed to step S202 to continue reading.
步骤 S208: 窗口的左边界右移, 窗口缩短至只包含最新读取的词。  Step S208: The left border of the window is shifted to the right, and the window is shortened to include only the newly read words.
步骤 S209: 判断该文本是否已遍历完毕: 若是, 则执行步骤 S210; 若 否, 则转步骤 S202继续读取。  Step S209: It is judged whether the text has been traversed: If yes, step S210 is performed; if no, then step S202 is continued to continue reading.
步骤 S210: 当文本遍历完毕时, 会记录一个或多个的长度 L, 根据记录 的最大长度, 判断该文本的重要性: 若最大长度 L大于设定的阈值, 则说明 该文本存在作弊现象, 否则表明该文本不存在作弊现象。  Step S210: When the text traversal is completed, one or more lengths L are recorded, and the importance of the text is determined according to the maximum length of the record: If the maximum length L is greater than the set threshold, the text is cheated. Otherwise it indicates that there is no cheating in the text.
图 3是本发明实施例中识别作弊文本的方法流程图。 如图 3所示, 该方 法包括:  3 is a flow chart of a method for identifying cheat text in an embodiment of the present invention. As shown in Figure 3, the method includes:
从文本中读取第一词, 并继续执行以下步骤:  Read the first word from the text and continue with the following steps:
步骤 S301 : 记录容量 C=l , 长度 L=l , 此时窗口词表中仅包括此时滑动 窗口内的词。  Step S301: The recording capacity C=l and the length L=l. At this time, only the words in the sliding window at this time are included in the window vocabulary.
步骤 S302: 读取下一个词。  Step S302: Read the next word.
步骤 S303:滑动窗口的右边界右移,将读取到的新词包含在滑动窗口内。 步骤 S304:判断该词是否已存在于窗口词表中:若是,则执行步骤 S305; 若否, 则执行步骤 S306。 Step S303: The right border of the sliding window is shifted to the right, and the read new word is included in the sliding window. Step S304: determining whether the word already exists in the window vocabulary: if yes, executing step S305; If no, step S306 is performed.
步骤 S305: 窗口词表及容量 C不变, 长度 L递增, 该步骤结束后转步 骤 S307。  Step S305: The window vocabulary and the capacity C are unchanged, and the length L is incremented. After the step ends, the process proceeds to step S307.
步骤 S306: 该词加入到窗口词表, 容量 C递增, 长度 L递增, 该步骤 结束后转步骤 S307。  Step S306: The word is added to the window vocabulary, the capacity C is incremented, and the length L is incremented. After the step ends, the process proceeds to step S307.
步骤 S307: 判断长度 L是否超过阈值; 若超过, 则转向执行步骤 S311 , 否则转向执行步骤 S308。  Step S307: It is judged whether the length L exceeds the threshold; if yes, the process proceeds to step S311, otherwise, the process proceeds to step S308.
步骤 S308: 判断窗口容量 C是否超过设定的最大值 Cmax; 若是, 则执 行步骤 S309; 若否, 则转步骤 S310。  Step S308: determining whether the window capacity C exceeds the set maximum value Cmax; if yes, executing step S309; if not, proceeding to step S310.
步骤 S309: 窗口的左边界右移, 窗口缩短至只包含最新读取的词。  Step S309: The left border of the window is shifted to the right, and the window is shortened to include only the newly read words.
步骤 S310: 判断该文本是否已遍历完毕: 若是, 则执行步骤 S312; 若 否, 则转步骤 S302继续读取。  Step S310: Determine whether the text has been traversed: If yes, go to step S312; if no, go to step S302 to continue reading.
步骤 S311 : 判定该文本存在作弊现象。  Step S311: It is determined that the text has a cheating phenomenon.
步骤 S312: 判定该文本不存在作弊现象。  Step S312: It is determined that the text does not have a cheating phenomenon.
在图 2和图 3所示的实施例中, 阈值的设定均与窗口容量 C的最大值有 密切的关系, 即当窗口容量 C的最大值越大, 则设定的阈值也可以越大, 反 之, 窗口容量 C的最大值越小, 设定的阈值也应该相应减小。  In the embodiment shown in FIG. 2 and FIG. 3, the threshold values are all closely related to the maximum value of the window capacity C, that is, when the maximum value of the window capacity C is larger, the set threshold value can also be larger. Conversely, the smaller the maximum value of the window capacity C, the corresponding threshold should be reduced accordingly.
本发明实施例还提供了一种识别作弊文本的方法, 利用一个移动的滑动 窗口遍历待检测文本, 其中滑动窗口移动的过程为: 将滑动窗口的窗口长度 从初始值逐步增大, 并在每次增大窗口长度时, 记录滑动窗口的窗口容量; 在窗口长度到达最大值时, 将滑动窗口的窗口长度恢复为初始值, 并将滑动 窗口移动到只包含最后遍历的词。 依次重复滑动窗口的移动过程, 直到文本 遍历过程中或者整个文本遍历完毕后, 根据窗口容量和一预设阈值的关系判 定出该文本存在作弊现象。 此时该阈值根据窗口长度的最大值而设置。  The embodiment of the invention further provides a method for recognizing cheat text, which uses a moving sliding window to traverse the text to be detected, wherein the process of sliding the window is: increasing the window length of the sliding window from the initial value, and When the window length is increased by a second time, the window capacity of the sliding window is recorded; when the window length reaches the maximum value, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal. The moving process of the sliding window is repeated in turn until the text traversing process or after the entire text traversing is completed, and the text is judged to be cheating according to the relationship between the window capacity and a preset threshold. At this time, the threshold is set according to the maximum value of the window length.
在整个文本遍历完毕后, 根据窗口容量和一预设阈值的关系判定文本是 否存在作弊现象的过程可以是: 记录每次窗口长度到达最大值时对应的窗口 容量; 将窗口容量的最小值与预设阈值比较, 若小于阈值则判定文本存在作 弊现象。 在文本遍历过程中, 根据窗口容量和一预设阈值的关系判定文本是 否文本存在作弊现象的过程可以是: 将每次记录的窗口容量与该预设阈值比 较, 若小于阈值则判定该文本存在作弊现象。 此时该阈值与窗口长度的最大 值成正比, 即虽然窗口容量越小越表明文本存在作弊现象的概率越小, 但在 最大窗口长度增大的情况下, 可以允许相应的窗口容量也相应增加。  After the entire text traversal is completed, the process of determining whether the text has a cheating phenomenon according to the relationship between the window capacity and a preset threshold may be: recording the corresponding window capacity when each window length reaches the maximum value; and minimizing the window capacity minimum If the threshold is compared, if the threshold is less than the threshold, the text is judged to be cheating. During the text traversal process, the process of determining whether the text has a cheating phenomenon according to the relationship between the window capacity and a preset threshold may be: comparing the window capacity of each record with the preset threshold, and determining that the text exists if the threshold is less than the threshold. Cheating. At this time, the threshold is proportional to the maximum value of the window length, that is, the smaller the window capacity is, the smaller the probability that the text is cheating, but the maximum window length may be increased, and the corresponding window capacity may be allowed to increase accordingly. .
图 4是本发明实施例中识别作弊文本的方法流程图。 在该实施例中, 利 用一个滑动窗口从左到右遍历整个文本, 并设定该窗口的最大长度, 即该窗 口不能超过长度最大值。 这样, 在固定的窗口长度下, 窗口容量越小, 其越 可能是存在作弊现象的文本。 4 is a flow chart of a method for identifying cheat text in an embodiment of the present invention. In this embodiment, the entire text is traversed from left to right using a sliding window, and the maximum length of the window is set, i.e., the window cannot exceed the maximum length. Thus, at a fixed window length, the smaller the window capacity, the more There may be texts that are cheating.
在该算法中, 设滑动窗口的容量为 C; 用一个 C, = C+1的递增数组存放 该滑动窗口内不同的词, 记录为 "窗口词表" ; 并设滑动窗口的长度为 L, 其 最大值设定为 Lmax。  In the algorithm, let the capacity of the sliding window be C; use an increasing array of C, = C+1 to store different words in the sliding window, record as "window vocabulary"; and set the length of the sliding window to L, Its maximum value is set to Lmax.
从文本中读取第一词, 并继续执行以下步骤:  Read the first word from the text and continue with the following steps:
步骤 S401 : 记录容量 C=l , 长度 L=l , 此时窗口词表中仅包括此时滑动 窗口内的词。  Step S401: The recording capacity C=l and the length L=l, at this time, only the words in the sliding window at this time are included in the window vocabulary.
步骤 S402: 判断是否成功读取到下一个词: 若是, 则执行 S403; 若否, 则转步骤 S410。  Step S402: Determine whether the next word is successfully read: If yes, execute S403; if no, go to step S410.
步骤 S403:滑动窗口的右边界右移,将读取到的新词包含在滑动窗口内。 步骤 S404:判断该词是否已存在于窗口词表中:若是,则执行步骤 S405; 若否, 则执行步骤 S406。  Step S403: The right border of the sliding window is shifted to the right, and the read new word is included in the sliding window. Step S404: determining whether the word already exists in the window vocabulary: if yes, executing step S405; if not, executing step S406.
步骤 S405: 窗口词表及容量 C不变, 长度 L递增, 该步骤结束后转步 骤 S402继续读取。  Step S405: The window vocabulary and the capacity C are unchanged, and the length L is incremented. After the step ends, the process proceeds to step S402 to continue reading.
步骤 S406: 该词加入到窗口词表, 容量 C递增, 长度 L递增。  Step S406: The word is added to the window vocabulary, the capacity C is incremented, and the length L is incremented.
步骤 S407: 判断窗口长度 L是否超过设定的最大值 Lmax: 若是, 则执 行步骤 S408; 若否, 则转步骤 S402继续读取。  Step S407: Determine whether the window length L exceeds the set maximum value Lmax: If yes, execute step S408; if no, proceed to step S402 to continue reading.
步骤 S408: 窗口的左边界右移, 窗口缩短至只包含最新读取的词。  Step S408: The left border of the window is shifted to the right, and the window is shortened to include only the newly read words.
步骤 S409: 判断该文本是否已遍历完毕: 若是, 则执行步骤 S410; 若 否, 则转步骤 S402继续读取。  Step S409: It is judged whether the text has been traversed: If yes, step S410 is performed; if no, then step S402 is continued to continue reading.
步骤 S410: 当文本遍历完毕时, 会记录一个或多个的窗口容量 C, 根据 记录的最小容量, 判断该文本的重要性: 若最小容量 C小于设定的阈值, 则 说明该文本存在作弊现象, 否则表明该文本不存在作弊现象。  Step S410: When the text traversal is completed, one or more window capacities C are recorded, and the importance of the text is determined according to the minimum capacity of the record: If the minimum capacity C is less than the set threshold, the text is cheated. Otherwise, it indicates that there is no cheating in the text.
图 5是本发明实施例中识别作弊文本的方法流程图。 如图 5所示, 该方 法包括:  FIG. 5 is a flow chart of a method for identifying cheat text in an embodiment of the present invention. As shown in Figure 5, the method includes:
从文本中读取第一词, 并继续执行以下步骤:  Read the first word from the text and continue with the following steps:
步骤 S501 : 记录容量 C=l , 长度 L=l , 此时窗口词表中仅包括此时滑动 窗口内的词。  Step S501: The recording capacity C=l and the length L=l. At this time, only the words in the sliding window at this time are included in the window vocabulary.
步骤 S502: 读取下一个词。  Step S502: Read the next word.
步骤 S503:滑动窗口的右边界右移,将读取到的新词包含在滑动窗口内。 步骤 S504:判断该词是否已存在于窗口词表中:若是,则执行步骤 S505; 若否, 则执行步骤 S506。  Step S503: The right border of the sliding window is shifted to the right, and the read new word is included in the sliding window. Step S504: determining whether the word already exists in the window vocabulary: if yes, executing step S505; if not, executing step S506.
步骤 S505: 窗口词表及容量 C不变, 长度 L递增, 该步骤结束后转步 骤 S507。  Step S505: The window vocabulary and the capacity C are unchanged, and the length L is incremented. After the step ends, the process proceeds to step S507.
步骤 S506: 该词加入到窗口词表, 容量 C递增, 长度 L递增, 该步骤 结束后转步骤 S507。 Step S506: the word is added to the window vocabulary, the capacity C is incremented, and the length L is incremented. After the end, the process proceeds to step S507.
步骤 S507: 判断容量 C是否小于阈值; 若小于阈值, 则转向执行步骤 S511 , 否则转向执行步骤 S508。  Step S507: It is judged whether the capacity C is smaller than the threshold; if it is less than the threshold, the process proceeds to step S511, otherwise, the process proceeds to step S508.
步骤 S508: 判断窗口长度 L是否超过设定的最大值 Lmax; 若是, 则执 行步骤 S509; 若否, 则转步骤 S510。  Step S508: determining whether the window length L exceeds the set maximum value Lmax; if yes, executing step S509; if not, proceeding to step S510.
步骤 S509: 窗口的左边界右移, 窗口缩短至只包含最新读取的词。  Step S509: The left border of the window is shifted to the right, and the window is shortened to include only the newly read words.
步骤 S510: 判断该文本是否已遍历完毕: 若是, 则执行步骤 S512; 若 否, 则转步骤 S502继续读取。  Step S510: Determine whether the text has been traversed: If yes, execute step S512; if no, proceed to step S502 to continue reading.
步骤 S511 : 判定该文本存在作弊现象。  Step S511: It is determined that the text has a cheating phenomenon.
步骤 S512: 判定该文本不存在作弊现象。  Step S512: It is determined that the text does not have a cheating phenomenon.
在图 4和图 5所示的实施例中, 阈值的设定均与窗口长度 L的最大值有 密切的关系, 即当窗口长度 L的最大值越大, 则设定的阈值也可以越大, 反 之, 窗口长度 L的最大值越小, 设定的阈值也应该相应减小。  In the embodiment shown in FIG. 4 and FIG. 5, the threshold values are all closely related to the maximum value of the window length L, that is, when the maximum value of the window length L is larger, the set threshold value can also be larger. Conversely, the smaller the maximum value of the window length L, the smaller the set threshold should be.
在以上四个实施例中, 遍历文本的顺序为从头到尾, 所以窗口长度从初 始值开始增加时, 窗口的右边界开始逐步右移,在窗口长度恢复到初始值时, 窗口的左边界右移。 实际上遍历文本的顺序也可以是从尾到头, 那么在窗口 长度从初始值开始增加时, 窗口的左边界开始逐步左移, 在窗口长度从恢复 到初始值时, 窗口的右边界左移。 当然还可以按照其他顺序来遍历文本, 但 是基本原则不变。  In the above four embodiments, the order of traversing the text is from beginning to end, so when the window length increases from the initial value, the right border of the window starts to move to the right, and when the window length returns to the initial value, the left border of the window is right. shift. In fact, the order of traversing the text can also be from end to end. When the window length increases from the initial value, the left boundary of the window begins to move to the left. When the window length is restored to the initial value, the right edge of the window is shifted to the left. Of course, you can traverse the text in other orders, but the basic principles remain the same.
当根据上述方法识别出具有作弊现象的文本后, 根据识别结果, 修正具 有作弊行为的文本在队列中的位置的方法可以有以下几种。  When the text having the cheating phenomenon is identified according to the above method, the method of correcting the position of the text having the cheating behavior in the queue according to the recognition result may be as follows.
根据识别结果, 对队列中所有具有作弊现象的文本统一实施相同的处 如将斤、有具有作弊行为的文本在队列中的位置统一后调两个位置。 或者对于 依据一定的参数来排序的队列中, 可以是将所有具有作弊行为的文本所对应 的排序依据参数修正一个固定的幅度。 如在文本的检索排序过程中, 通常会 根据检索词与文本之间的相关性权值, 按照检索词与文本之间的相关性权值 性权值统一降低 60;。的幅度。 、 。 、、、  According to the recognition result, all the cheating texts in the queue are uniformly implemented in the same place, such as the position of the text with the cheating behavior in the queue and the two positions are unified. Or in a queue sorted according to certain parameters, the ordering parameter corresponding to all the cheating behaviors may be corrected by a fixed amplitude. For example, in the process of text retrieval, the weight of the correlation between the search term and the text is generally reduced according to the weight of the correlation between the search term and the text; Amplitude. , . ,,,
根据文本作弊识别算法中两个参数——窗口容量和窗口长度之间的关 系, 更加准确的评估不同文本作弊的程度, 对于具有不同作弊程度的文本实 施不同的处理结果, 对于作弊程度更大的文本, 对应进行更为严格的处理方 式。 如将作弊程度严重的文本在队列中的位置后调更多的位置, 或者对作弊 程度严重的文本所对应的排序依据参数修正一个更大的幅度。  According to the relationship between the two parameters in the text cheating recognition algorithm - window capacity and window length, more accurate evaluation of the degree of cheating of different texts, different processing results for texts with different degrees of cheating, for more cheating The text, corresponding to a more rigorous processing. For example, the position of the text with severe cheating is adjusted to a more position in the queue, or the order corresponding to the text with severe cheating is corrected by a larger margin.
如在判定两个文本哪个作弊程度更大时, 可以分别记录下判定这两个文 本存在作弊行为时, 两个滑动窗口此时对应的窗口容量和窗口长度。 如果两 个文本中两个滑动窗口此时对应的窗口容量相等, 则窗口长度大的滑动窗口 所对应的文本具有更大的作弊程度。 如果两个文本中两个滑动窗口此时对应 的窗口长度相等, 则窗口容量小的滑动窗口所对应的文本具有更大的作弊程 度。 最通常的方法可以计算两个文本中两个滑动窗口此时对应的窗口容量和 窗口长度的比值, 哪个文本对应的滑动窗口具有较小的窗口容量和窗口长度 的比值, 哪个文本的作弊程度更大。 If you are deciding which of the two texts is more cheating, you can separately record the two texts. In the case of cheating, the two sliding windows correspond to the window capacity and window length at this time. If the two sliding windows of the two texts have the same window capacity, the text corresponding to the sliding window with a large window length has a greater degree of cheating. If the two sliding windows of the two texts are equal in length, the text corresponding to the sliding window with a small window capacity has a greater degree of cheating. The most common method can calculate the ratio of the window capacity and the window length of the two sliding windows in the two texts. Which text corresponds to the sliding window with a smaller ratio of window capacity to window length, and which text is more cheated. Big.
图 6是本发明实施例中文本排序装置的结构图。 如图 6所示, 该装置包 括:  Figure 6 is a structural diagram of a text sorting apparatus in an embodiment of the present invention. As shown in Figure 6, the device includes:
文本识别模块 601 , 用于识别具有作弊行为的文本;  a text recognition module 601, configured to identify text with cheating behavior;
排序修正模块 602, 用于根据识别结果, 修正具有作弊行为的文本在排 序队列中的位置。  The sorting correction module 602 is configured to correct the position of the text with the cheating behavior in the sorting queue according to the recognition result.
图 7是本发明实施例中文本作弊识别装置的结构图。 如图 7所示, 该装 置包括窗口长度控制单元 701、 窗口容量记录单元 702、 窗口长度记录单元 703以及阈值比较单元 704。  Figure 7 is a structural diagram of a text cheat recognition apparatus in an embodiment of the present invention. As shown in Fig. 7, the apparatus includes a window length control unit 701, a window capacity recording unit 702, a window length recording unit 703, and a threshold comparison unit 704.
其中窗口长度控制单元 701 , 用于控制滑动窗口在文本上的移动以及窗 口长度的改变; 在窗口容量记录单元 703记录的窗口容量小于最大值时, 控 制窗口长度逐步增大; 在窗口容量记录单元 703记录的窗口容量达到最大值 时, 将滑动窗口的窗口长度恢复为初始值, 并将滑动窗口移动到只包含最后 遍历的词; 并在阈值比较单元 704判断出该文本存在作弊现象时, 停止滑动 窗口在文本上的移动。  The window length control unit 701 is configured to control the movement of the sliding window on the text and the change of the window length; when the window capacity recorded by the window capacity recording unit 703 is less than the maximum value, the control window length is gradually increased; When the window capacity of the 703 record reaches the maximum value, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal; and when the threshold comparison unit 704 determines that the text is cheating, the stop The movement of the sliding window over the text.
窗口长度记录单元 702 , 用于在每次增大窗口长度时, 记录滑动窗口的 窗口长度。  The window length recording unit 702 is configured to record the window length of the sliding window each time the window length is increased.
窗口容量记录单元 703 , 用于每次增大窗口长度时, 记录滑动窗口的窗 口容量; 并将窗口容量的值通知窗口长度控制单元 701 , 并重新从初始值开 始计数。  The window capacity recording unit 703 is configured to record the window capacity of the sliding window each time the window length is increased; and notify the window length control unit 701 of the value of the window capacity, and start counting again from the initial value.
阈值比较单元 704, 用于根据窗口长度和一预设阈值的关系判定出文本 存在作弊现象; 其中预设阈值根据窗口容量的最大值而设置。  The threshold comparison unit 704 is configured to determine that the text is cheating according to the relationship between the window length and a preset threshold; wherein the preset threshold is set according to the maximum value of the window capacity.
其中该装置四个单元的功能, 还可以是:  The function of the four units of the device can also be:
其中窗口长度控制单元 701 , 用于控制滑动窗口在文本上的移动以及窗 口长度的改变; 在窗口长度记录单元 702记录的窗口长度小于最大值时, 控 制窗口长度逐步增大; 在窗口长度记录单元 702记录的窗口长度达到最大值 时, 将滑动窗口的窗口长度恢复为初始值, 并将滑动窗口移动到只包含最后 遍历的词; 并在阈值比较单元 704判断出该文本存在作弊现象时, 停止滑动 窗口在文本上的移动。 窗口长度记录单元 702 , 用于在每次增大窗口长度时, 记录滑动窗口的 窗口长度, 并将窗口长度的值通知给窗口长度控制单元 701 以及窗口容量记 录单元 703。 The window length control unit 701 is configured to control the movement of the sliding window on the text and the change of the window length; when the window length recorded by the window length recording unit 702 is less than the maximum value, the control window length is gradually increased; When the window length of the record 702 reaches the maximum value, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal; and when the threshold comparison unit 704 determines that the text is cheating, the stop The movement of the sliding window over the text. The window length recording unit 702 is configured to record the window length of the sliding window each time the window length is increased, and notify the window length control unit 701 and the window capacity recording unit 703 of the value of the window length.
窗口容量记录单元 703 , 用于每次增大窗口长度时, 记录滑动窗口的窗 口容量; 并在窗口长度记录单元 702记录的窗口长度达到最大值时, 从初始 值开始计数。  The window capacity recording unit 703 is configured to record the window capacity of the sliding window each time the window length is increased; and when the window length recorded by the window length recording unit 702 reaches the maximum value, counting from the initial value.
阈值比较单元 704, 用于根据窗口长度和一预设阈值的关系判定出文本 存在作弊现象; 其中预设阈值根据窗口长度的最大值而设置。  The threshold comparison unit 704 is configured to determine that the text is cheating according to the relationship between the window length and a preset threshold; wherein the preset threshold is set according to the maximum value of the window length.
这里要注意的是, 虽然图 6中没有进一步描述文本识别模块 601的内部 结构, 但很显然, 图 7中的描述可以认为是图 6中文本识别模块 601的内部 结构。  It is to be noted here that although the internal structure of the text recognition module 601 is not further described in Fig. 6, it is apparent that the description in Fig. 7 can be considered as the internal structure of the text recognition module 601 in Fig. 6.
以下实施例以文本为博客文章, 排序为检索排序为例说明本发明实施例 中的方法和装置, 而实际操作中, 本发明实施例中的文本还可以是网页文本 等其他一切需要排序操作的文本, 排序的场景也不仅限于检索排序。  In the following embodiments, the text is a blog post, and the sorting is a sort of search. The method and the device in the embodiment of the present invention are illustrated. In actual operation, the text in the embodiment of the present invention may also be a webpage text and the like, and all other sorting operations are required. Text, sorted scenes are not limited to search sorting.
由于对博客文章的检索排序是基于建立的索引, 而建立索引是通过计算 检索词与博客文的文本相关性权值而进行的, 本发明实施例在文本相关性权 值的计算中识别出作弊的文本, 并对其进行降权处理, 因此可建立更为准确 的索引, 从而提高了基于此索引进行排序的客观准确性, 保证了用户进行文 本检索的质量。  Since the retrieval order of the blog article is based on the established index, and the indexing is performed by calculating the text relevance weight of the search term and the blog text, the embodiment of the present invention identifies cheating in the calculation of the text relevance weight. The text, and its weight reduction processing, can establish a more accurate index, thereby improving the objective accuracy of sorting based on this index, and ensuring the quality of text retrieval by users.
图 8是本发明实施例中博客文章检索排序系统的结构图。 如图 8所示, 该系统包括博客系统 100、索引器 200、检索器 300、代理器 400和客户端 500。 应当说明的是, 本发明所有图示中各设备之间的连接关系是为了清楚阐释其 信息交互及控制过程的需要, 因此应当视为逻辑上的连接关系, 而不应仅限 于物理连接。 其中:  FIG. 8 is a structural diagram of a blog article retrieval sorting system in an embodiment of the present invention. As shown in FIG. 8, the system includes a blog system 100, an indexer 200, a retriever 300, an agent 400, and a client 500. It should be noted that the connection relationship between the devices in all the diagrams of the present invention is for the purpose of clearly explaining the information interaction and control process thereof, and therefore should be regarded as a logical connection relationship, and should not be limited to physical connections. among them:
博客系统 100用于为用户提供博客相关服务, 包括对博客文章进行存储 和管理等, 并在本发明中为索引器 200提供相关性因子, 包括文本相关性因 子 (例如, 文本的分类、 标题、 正文、 昵称、 空间名等) , 及数值相关性因 子 (例如, 活跃度因子、 转载率因子、 回复率因子、 发表时间因子等) 。 该 博客系统 100的核心可为一个网站服务器,但是本发明并不限定其具体形式。  The blogging system 100 is used to provide blog related services for users, including storing and managing blog posts, and the like, and provides a relevance factor for the indexer 200 in the present invention, including text relevance factors (eg, text classification, title, Body, nickname, space name, etc., and numerical correlation factors (eg, activity factor, reload factor, response rate factor, publication time factor, etc.). The core of the blog system 100 can be a web server, but the invention is not limited to its specific form.
索引器 200用于根据博客系统 100中的数据建立索引, 供检索器 300基 于该索引对所搜索的博客文章进行排序。  The indexer 200 is configured to index based on data in the blog system 100 for the searcher 300 to sort the searched blog posts based on the index.
检索器 300根据用户输入的检索词进行查询并对博客文章进行排序。 代理器 400用于接收客户端 500发送的检索串, 并将检索串切分为检索 词, 发送给检索器 300 , 以及将检索器 300检索并排序后的结果转发给客户 端 500。 客户端 500接收用户输入的检索词或者检索串:若用户输入的是检索词, 可直接将其发送给检索器 300, 并在接收到检索器 300反馈的博客文章排序 结果后, 将排序结果绘制并显示到用户界面上; 若用户输入的是检索串, 则 须发送给代理器 400进行切分, 并在接收到代理器 400反馈的博客文章排序 结果后, 将排序结果绘制并显示到用户界面上。 The retriever 300 queries and sorts the blog articles based on the search terms input by the user. The agent 400 is configured to receive the search string sent by the client 500, divide the search string into search terms, send it to the searcher 300, and forward the retrieved and sorted results of the retriever 300 to the client 500. The client 500 receives the search term or the search string input by the user: if the user inputs the search term, it can directly send it to the retriever 300, and after receiving the blog article sorting result fed back by the crawler 300, the sorting result is drawn. And displayed to the user interface; if the user inputs a search string, it must be sent to the agent 400 for segmentation, and after receiving the blog article sorting result fed back by the agent 400, the sorting result is drawn and displayed to the user interface. on.
客户端 500典型的可为各种能够登录互联网的终端设备, 例如个人计算 机( Personal Computer, PC )、个人数字助理( Personal Digital Assistant, PDA )、 移动电话(Mobile Phone, MP )等, 因此本发明的保护范围不应限定为某种 特定类型的客户端。  The client 500 is typically a variety of terminal devices capable of logging in to the Internet, such as a personal computer (PC), a personal digital assistant (PDA), a mobile phone (MP), etc., and thus the present invention. The scope of protection should not be limited to a particular type of client.
图 9是本发明实施例中博客文章检索排序系统中索引器的结构图。 如图 9所示, 该索引器 200包括: 数值相关性判定单元 201、 文本相关性判定单元 202、 文本作弊识别单元 203、 叠加计算单元 204和索引构建单元 205。  FIG. 9 is a structural diagram of an indexer in a blog article retrieval sorting system according to an embodiment of the present invention. As shown in FIG. 9, the indexer 200 includes: a numerical correlation determination unit 201, a text correlation determination unit 202, a text cheat recognition unit 203, a superposition calculation unit 204, and an index construction unit 205.
数值相关性判定单元 201用于根据从博客系统提取的数值相关性因子, 计算检索词与各博客文章的数值相关性权值。 文本相关性判定单元 202用于 根据从博客系统提取的文本相关性因子, 计算检索词与各博客文章的文本相 关性权值。 文本作弊识别单元 203用于在文本相关性判定单元 202计算检索 词与博客文章之间的文本相关性权值时, 对文本作弊的博客文章进行识别和  The numerical correlation determining unit 201 is configured to calculate a numerical correlation weight of the search term and each blog post based on the numerical correlation factor extracted from the blog system. The text relevance determining unit 202 is configured to calculate a text-related weight of the search term and each blog post based on the text relevance factor extracted from the blog system. The text cheat recognition unit 203 is configured to recognize the blog article in which the text is cheated when the text relevance determination unit 202 calculates the text relevance weight between the search term and the blog post.
202。叠加计算单元 204用于对前述的数值相关性权值和文本相关性权 进行 叠加计算, 得到该检索词的综合相关性权值, 并送入索引构建单元 205。 索 引构建单元 205根据该综合相关性权值构建索引。 202. The superposition calculation unit 204 is configured to perform superposition calculation on the foregoing numerical correlation weight and text relevance right to obtain a comprehensive correlation weight of the search term, and send it to the index construction unit 205. The index construction unit 205 builds an index based on the comprehensive correlation weight.
在本发明的另一实施例中, 索引器 200包括文本相关性判定单元 202、 文本作弊识别单元 203和索引构建单元 205。 文本相关性判定单元 202用于 根据从博客系统提取的文本相关性因子, 计算检索词与各博客文章的文本相 关性权值。 文本作弊识别单元 203用于识别文本作弊的博客文章, 并对其文 性判
Figure imgf000013_0001
的文本相关性权值构建检 索词与各博客文章之间的索引。
In another embodiment of the present invention, the indexer 200 includes a text relevance determination unit 202, a text cheat recognition unit 203, and an index construction unit 205. The text relevance determining unit 202 is configured to calculate a text relevance weight of the search term and each blog post based on the text relevance factor extracted from the blog system. The text cheat recognition unit 203 is used to identify a blog post in which the text is cheated, and judges the genre
Figure imgf000013_0001
The text relevance weight constructs an index between the search term and each blog post.
以上实施例虽然可以实现, 但是由于构建索引的过程仅考虑文本相关性 因子, 索引的准确度不够高。 因此相比之下, 图 2所示的索引器 200的索引 准确度更高。  Although the above embodiment can be implemented, since the process of constructing an index only considers the text correlation factor, the accuracy of the index is not high enough. In contrast, the indexer 200 shown in Figure 2 has a higher index accuracy.
这里,可以理解文本作弊识别单元 203中包括图 6所示的文本排序装置。 图 10 是本发明实施例中博客文章检索排序系统中检索器的结构图。 如 图 10所示, 该检索器 300包括查询单元 301、 复合相关性计算单元 302、 排 序单元 303。 在该实施例中, 用户最初输入的是包含多个检索词的检索串, 由代理器切分为检索词后送入检索器 300, 检索器 300收到检索词后则进行 处理。 查询单元 301从索引器已建立的索引中查询各检索词与各博客文章之 间的相关性权值(文本相关性权值, 或综合相关性权值), 并送入排序单元。 复合相关性计算单元 302则根据各检索词的相关性权值, 计算检索串与各博 客文章之间的复合相关性权值, 并送入排序单元 303。 排序单元 303根据复 合相关性权值, 对与检索串相关的各博客文章进行排序。 Here, it can be understood that the text cue recognition unit 203 includes the text sorting device shown in FIG. 6. FIG. 10 is a structural diagram of a searcher in a blog article retrieval sorting system according to an embodiment of the present invention. As shown in FIG. 10, the retriever 300 includes a query unit 301, a composite correlation calculation unit 302, and a sorting unit 303. In this embodiment, the user initially inputs a search string containing a plurality of search terms, The index is divided into search terms and sent to the searcher 300, and the searcher 300 receives the search words and then processes them. The query unit 301 queries the relevance weight (text relevance weight, or comprehensive relevance weight) between each search term and each blog post from the index that has been established by the indexer, and sends it to the sorting unit. The compound correlation calculation unit 302 calculates the composite correlation weight between the search string and each blog post based on the correlation weight of each search term, and sends it to the sorting unit 303. The sorting unit 303 sorts each blog post related to the search string according to the composite correlation weight.
在本发明的另一实施例中提出了一种检索器 300 , 可与客户端直接相连 并进行通信, 适用于用户输入检索词而非检索串的情形。 此时检索器 300包 括查询单元 301、 排序单元 303。 查询单元 301根据用户输入的检索词, 从索 引器 200已建立的索引中查询该检索词与各博客文章之间的相关性权值 (文 本相关性权值, 或综合相关性权值) , 并送入排序单元 303。 排序单元 303 根据所收到的相关性权值的大小, 对与检索词相关的各博客文章进行排序。 应当说明的是, 由于目前用户大多输入的都是包含多个检索词的检索串, 因 此图 3所示的检索器 300结构应用更为广泛和典型。  In another embodiment of the present invention, a retriever 300 is provided that can be directly connected to and communicated with a client, and is suitable for situations in which a user inputs a search term rather than a search string. The retriever 300 at this time includes a query unit 301 and a sorting unit 303. The query unit 301 queries, according to the search term input by the user, the correlation weight (text relevance weight, or comprehensive relevance weight) between the search term and each blog post from the index that has been established by the indexer 200, and It is sent to the sorting unit 303. The sorting unit 303 sorts each blog post related to the search term according to the size of the received correlation weight. It should be noted that since most of the users currently input a search string containing a plurality of search terms, the structure of the retriever 300 shown in Fig. 3 is more widely and typical.
图 11 是本发明实施例中博客文章检索排序中建立索引的方法流程图。 如图 11所示, 该方法包括以下步骤:  FIG. 11 is a flowchart of a method for establishing an index in a blog article retrieval order according to an embodiment of the present invention. As shown in FIG. 11, the method includes the following steps:
步骤 S1101 : 从博客系统中提取相关性因子, 并对这些数据进行格式化。 其中这里所说的格式化包括对某些相关性因子进行归一化, 以及将某些相关 性因子做算法上的处理, 如做 Log处理, 以将大部分的相关性因子的取值都 映射在一个固定区间中, 例如 [0, 100]。 当然某些相关性因子取其原始值。  Step S1101: Extract a correlation factor from the blog system, and format the data. The formatting mentioned here includes normalizing some correlation factors and performing some processing on some correlation factors, such as Log processing, to map the values of most correlation factors. In a fixed interval, for example [0, 100]. Of course, some correlation factors take their original values.
本发明中所称的相关性因子, 可以只包括文本相关性因子, 也可以是文 本相关性因子以及数值相关性因子。 这些相关性因子在索引器构建索引时, 将作为相关性权值计算时的输入参数。  The correlation factor referred to in the present invention may include only a text correlation factor, a text correlation factor, and a numerical correlation factor. These correlation factors are used as input parameters when the indexer builds the index as a correlation weight calculation.
步骤 S1102: 计算检索词与各博客文章的相关性权值, 同时对具有文本 作弊现象的博客文章进行识别和降权处理。  Step S1102: Calculate the correlation weight of the search term and each blog post, and identify and degrade the blog article with text cheating.
在一个实施例中, 仅考虑文本相关性因子, 其根据文本相关性因子计算 检索词的文本相关性权值, 并识别出文本作弊的博客文章, 然后对检索词与 该博客文章的文本相关性权值进行适当的降权处理。  In one embodiment, only the text relevance factor is considered, which calculates the text relevance weight of the search term based on the text relevance factor, and identifies the blog post that the text is cheated, and then textual relevance of the search term to the blog post. The weight is appropriately degraded.
在另一实施例中, 索引器不仅考虑文本相关性因, 还考虑了数值相关性 因子, 分别计算文本相关性权值和数值相关性权值, 同时识别出文本作弊的 博客文章, 然后对检索词与该博客文章的文本相关性权值进行适当的降权处 理, 最后再将文本相关性权值和数值相关性权值进行叠加计算, 得到综合相 关性权值。 由此可知, 前一实施例只是对文本相关性权值进行降权处理, 而 了 用。 本实施例由于将数值相关因子也考
Figure imgf000014_0001
步提高了数 据的准确性。
In another embodiment, the indexer not only considers the text correlation factor, but also considers the numerical correlation factor, respectively calculates the text relevance weight and the numerical correlation weight, and simultaneously identifies the blog post of the text cheat, and then searches for The word and the text relevance weight of the blog article are appropriately reduced, and finally the text correlation weight and the numerical correlation weight are superimposed to obtain the comprehensive correlation weight. It can be seen from the above that the previous embodiment only performs the weight reduction processing on the text correlation weight, and is used. This embodiment also considers the numerical correlation factor
Figure imgf000014_0001
Step increased number According to the accuracy.
步骤 S1103: 根据降权后的相关性权值构建检索词与各博客文章之间的 索引。 该索引记录了各个检索词、 与检索词对应的博客文章、 检索词与博客 文章之间的相关性权值, 从而可在用户输入检索词进行搜索时, 可按照索引 中的数据对搜索到的博客文章进行排序, 使用户可以迅速找到最相关的博客 文章。  Step S1103: Construct an index between the search term and each blog post according to the correlation weight after the weight reduction. The index records the relevance weights of each search term, the blog post corresponding to the search term, the search term and the blog post, so that when the user inputs the search term for searching, the search can be performed according to the data in the index. Blog articles are sorted so that users can quickly find the most relevant blog posts.
图 12 是本发明实施例中博客文章检索排序中建立索引的方法流程图。 如图 12所示, 该流程具体包括:  FIG. 12 is a flowchart of a method for establishing an index in a blog article retrieval order according to an embodiment of the present invention. As shown in Figure 12, the process specifically includes:
步骤 S1201 : 从博客系统中提取相关性因子, 并对这些数据进行格式化。 此时相关性因子包括文本相关性因子及数值相关性因子。  Step S1201: Extract a correlation factor from the blog system, and format the data. The correlation factor at this time includes a text correlation factor and a numerical correlation factor.
步骤 S1202: 索引器计算检索词与各博客文章的数值相关性权值。  Step S1202: The indexer calculates a numerical correlation weight of the search term and each blog post.
在一个实施例中, 数值相关性因子包括活跃度因子^。、 转载率因子 Wdu . 回复率因子 W 、 发表时间因子^ ^这四种。 活跃度因子 W™由博客系 统计算得出, 取值范围在 [0, 100] , 其综合考虑了博客个人空间的用户登录 频度、博客文章发表频度等因素,是博客个人空间活跃程度的综合衡量指标, 活跃度越高, 博客文章的排序结果优先度越高。 转载率因子^ "是根据排重 系统中得到的博客文章重复数计算得出, 取值范围在 [0, 100] , 转载率越高, 博客文章的排序结果优先度越高。 回复率因子 W 是根据博客文章的回复次 数计算得出, 取值范围在 [0, 100] , 回复率因子^ 越高, 博客文章的排序结 果优先度越高。 发表时间因子 W^是博客文章的发表时间, 可采用 UNIX时 间来表示, 越新发表的博客文章的排序结果优先度越高。 数值相关性权值则 由上面列出的所有相关性因子经过线型计算并归一化得出, 其取值范围在区 间 [0, 1] , 其计算公式如下: In one embodiment, the numerical correlation factor includes an activity factor ^. , the reload rate factor Wdu . The recovery rate factor W , the publication time factor ^ ^ these four. The activity factor W TM is calculated by the blog system, and the value range is [0, 100]. It comprehensively considers the user registration frequency of the blog personal space, the frequency of blog post publication, etc., and is the activity level of the blog personal space. Comprehensive metrics, the higher the activity, the higher the priority of the ranking results of blog posts. ^ Reproduced rate factor "is calculated based on the number of repeating duplication system blog articles obtained in the range [0, 100], the higher the rate is reproduced, the higher the priority ranking result blog articles. Reply rate factor W It is calculated according to the number of reply times of the blog post, the value range is [0, 100], and the higher the response rate factor ^, the higher the priority of the sorting result of the blog post. The publishing time factor W ^ is the publishing time of the blog post. It can be expressed by UNIX time, and the ranking result of the newly published blog post has higher priority. The numerical correlation weight is calculated and normalized by all the correlation factors listed above, and its value is obtained. The range is in the interval [0, 1] and its calculation formula is as follows:
^誦 /MAX —VALUE ( χ ) 其中 ^为前面列出的所有相关性计算因子, '·为对应的修正系数, 用来 增加或减小相关性因子的作用, 可在对排序结果进行调整的过程中确定^ '的 比较理想的取值, MAX_VALUE为该数值相关性权值的可能的最大取值。应 当说明的是,上述计算公式只是一个示例, 并不用以限定本发明的保护范围, 还可通过类似的公式进行计算。 ^诵/MAX —VALUE ( χ ) where ^ is all the correlation calculation factors listed above, '· is the corresponding correction coefficient, used to increase or decrease the effect of the correlation factor, which can be adjusted in the sorting result. The process determines the ideal value of ^ ', and MAX_VALUE is the maximum possible value of the value correlation weight. It should be noted that the above calculation formula is only an example and is not intended to limit the scope of protection of the present invention, and can also be calculated by a similar formula.
步骤 S1203: 索引器计算检索词与各博客文章的文本相关性权值, 并识 别具有文本作弊的博客文章, 对具有文本作弊的博客文章进行降权处理。 在 本发明实施例中, 文本相关性因子也就是可用来检索的文本字段。  Step S1203: The indexer calculates a text relevance weight of the search term and each blog post, and identifies a blog post with text cheats, and performs a demotion process on the blog post with the text cheat. In an embodiment of the invention, the text relevance factor is also the text field available for retrieval.
在一个实施例中, 这些文本字段包括分类、 标题、 正文、 昵称、 空间名 这 5个,每个字段有一个固定的权重值 W和一个修正系数 λ ,依次如表一所 字段名 修正系数 权重 分类 标题 正文 W In one embodiment, the text fields include five categories: a category, a title, a body, a nickname, and a space name. Each field has a fixed weight value W and a correction coefficient λ, as shown in Table 1. Field name correction coefficient weight classification heading text W
CO vyco 昵称 W CO vy co nickname W
yyNI 空间名 w y y NI space name w
zo vyzo 表一 Zo vy zo Table 1
文本相关性权值的计算公式如下:  The formula for calculating the text relevance weight is as follows:
WJEXT = λαΑ χ WCA + λτ1 χ WT1 + λεο χ WCO + λΝ1 χ WN1 + λζο x Wzo ( 2 ) 其中,
Figure imgf000016_0001
= 1。 应当说明的是, 上述计算公式只 个示例, 并不用以限定本发明的保护范围, 还可通过类似的公式进行计
WJEXT = λ α Α χ W CA + λ τ1 χ W T1 + λ εο χ W CO + λ Ν1 χ W N1 + λ ζο x W zo ( 2 ) where
Figure imgf000016_0001
= 1. It should be noted that the above calculation formula is only an example, and is not intended to limit the scope of protection of the present invention, and can also be calculated by a similar formula.
当得到文本相关性权值后, 进一步识别具有文本作弊现象的博客文章, 包括: 利用滑动窗口遍历博客文章, 并记录该滑动窗口所达到的最大长度; 将活动窗口的最大长度与一个阈值进行比较, 若超过阈值则将该博客文章判 定为文本作弊; 将具有文本作弊现象的博客文章进行适当的降权处理, 例如 可进行幅度调整, 将文本相关性权值的大小修正为之前的 60 % After obtaining the text relevance weight, further identifying the blog article with text cheating, including: traversing the blog article by using a sliding window, and recording the maximum length reached by the sliding window; comparing the maximum length of the active window with a threshold If the threshold value is exceeded, the blog post is determined to be a text cheat; the blog post with text cheating is appropriately degraded, for example, the amplitude adjustment can be performed, and the text correlation weight is corrected to the previous 60%.
步骤 S1204: 索引器利用其叠加计算单元对数值相关性权值和文本相关 性权值进行叠加计算, 得到综合相关性权值。 在一个实施例中, 叠加计算公 式如下:
Figure imgf000016_0002
其中, Λ^ _分别是两种相关性权值进行叠加时的修正系数, 大小 可以灵活调整, 且
Figure imgf000016_0003
= 1。 应当说明的是, 上述计算公式只是一个示 例, 并不用以限定本发明的保护范围, 还可通过类似的公式进行计算。
Step S1204: The indexer uses the superposition calculation unit to perform superposition calculation on the numerical correlation weight and the text correlation weight to obtain the comprehensive correlation weight. In one embodiment, the superposition calculation formula is as follows:
Figure imgf000016_0002
Where Λ^ _ is the correction coefficient when the two correlation weights are superimposed, and the size can be flexibly adjusted, and
Figure imgf000016_0003
= 1. It should be noted that the above calculation formula is only an example, and is not intended to limit the scope of protection of the present invention, and can also be calculated by a similar formula.
步骤 S1205 : 索引器根据综合相关性权值并进行存储, 以供用户搜索时 的提取应用。 Step S1205: The indexer stores and stores the data based on the comprehensive correlation weights for searching by the user. Extraction application.
图 13 是本发明实施例中博客文章检索排序方法的流程图。 该实施例是 用户输入检索词的情形, 包括:  FIG. 13 is a flowchart of a method for retrieving a blog article in an embodiment of the present invention. This embodiment is a case where a user inputs a search term, including:
步骤 S1301 : 检索器接收到客户端中用户输入的检索词。  Step S1301: The retriever receives the search term input by the user in the client.
步骤 S 1302: 检索器从索引器已构建的索引中提取各检索词与博客文章 的相关性权值, 该相关性权值可能是文本相关性权值, 也可能是文本相关性 权值与数值相关性权值叠加后的综合相关性权值。  Step S1302: The retriever extracts a correlation weight of each search term from a blog article from an index that has been constructed by the indexer, and the correlation weight may be a text relevance weight, or may be a text correlation weight and a numerical value. The comprehensive correlation weight after the correlation weight is superimposed.
步骤 S1303 : 检索器根据相关性权值对搜索到的博客文章进行排序, 并 将排序结果反馈给客户端。  Step S1303: The searcher sorts the searched blog articles according to the correlation weights, and feeds the sorting result to the client.
图 14 是本发明实施例中博客文章检索排序方法的流程图。 该实施例是 用户输入检索串的情形, 具体包括:  FIG. 14 is a flowchart of a method for retrieving a blog article in an embodiment of the present invention. This embodiment is a case where the user inputs a search string, and specifically includes:
步骤 S1401 : 代理器将客户端中用户输入的检索串切分为检索词, 并送 入检索器。  Step S1401: The agent divides the search string input by the user in the client into a search term and sends it to the searcher.
步骤 S 1402: 检索器从索引器构建的索引中提取各检索词与博客文章的 相关性权值, 该相关性权值可能是文本相关性权值, 也可能是文本相关性权 值与数值相关性权值叠加后的综合相关性权值。  Step S1402: The retriever extracts a correlation weight of each search term from a blog article from an index constructed by the indexer, and the correlation weight may be a text relevance weight, or may be a text correlation weight and a numerical correlation. The comprehensive correlation weight after the superposition of sexual weights.
步骤 S1403 : 检索器计算检索串与博客文章的复合相关性权值。  Step S1403: The retriever calculates a composite correlation weight of the search string and the blog article.
在本发明中, 用户输入检索串与博客文章的相关性, 可认为是单个检索 词与该博客文章的相关性的综合结果, 因此在一个实施例中, 采用筒单相加 后求平均值的模型来计算复合相关性权值。设对于检索串(^ Q^q^ qz,……, qn}, n为检索串切分后的索引词个数, d为一个检索词 qn命中的所有博客文 章, 那么该检索串 Q与博客文章之间的复合相关性权值的计算公式为: In the present invention, the user inputs the relevance of the search string to the blog post, which can be considered as a comprehensive result of the correlation between the single search term and the blog post. Therefore, in one embodiment, the average is added after the cartridge is added. The model calculates the composite correlation weights. Let the search string (^ Q^q^ qz, ..., q n }, n be the number of index words after the search string is segmented, d is all the blog articles hit by a search word q n , then the search string Q The formula for calculating the compound correlation weights with blog posts is:
X(Weight(qi d)) X(Weight( qi d))
Weight(Q, d) = ^ ( 4 )  Weight(Q, d) = ^ ( 4 )
n  n
应当说明的是, 上述计算公式只是一个示例, 并不用以限定本发明的保 护范围, 还可通过类似的公式进行计算。  It should be noted that the above calculation formula is only an example and is not intended to limit the protection range of the present invention, and can also be calculated by a similar formula.
步骤 S1404:检索器根据复合相关性权值对搜索到的博客文章进行排序, 并将排序结果送入代理器。  Step S1404: The retriever sorts the searched blog articles according to the composite correlation weights, and sends the sorting result to the agent.
步骤 S 1405 : 代理器将排序结果转发给客户端, 并将排序结果显示到用 户界面上。  Step S 1405: The agent forwards the sort result to the client, and displays the sort result on the user interface.
以上所述仅为本发明的较佳实施例而已, 并不用以限制本发明, 凡在本 发明的精神和原则之内所作的任何修改、 等同替换和改进等, 均应包含在本 发明的保护范围之内。  The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. Within the scope.

Claims

权利要求书 Claim
1、 一种文本排序方法, 其特征在于, 包括: A text sorting method, comprising:
识别具有作弊现象的文本;  Identify texts that are cheating;
根据识别结果, 修正所述具有作弊现象的文本在排序队列中的位置。 Based on the recognition result, the position of the cheating text in the sorting queue is corrected.
2、 根据权利要求 1 所述的方法, 其特征在于, 所述识别具有作弊现象 的文本包括: 2. The method according to claim 1, wherein the identifying the text having the cheating phenomenon comprises:
利用一个移动的滑动窗口遍历所述文本; 其中所述滑动窗口移动的过程 为: 将滑动窗口的窗口长度从初始值逐步增大, 并在每次增大窗口长度时, 记录滑动窗口的窗口容量; 在窗口容量到达最大值时, 将滑动窗口的窗口长 度恢复为初始值, 并将滑动窗口移动到只包含最后遍历的词;  The text is traversed by a moving sliding window; wherein the sliding window moves by: gradually increasing the window length of the sliding window from the initial value, and recording the window capacity of the sliding window each time the window length is increased ; when the window capacity reaches the maximum value, restore the window length of the sliding window to the initial value, and move the sliding window to the word containing only the last traversal;
依次重复所述滑动窗口的移动过程, 直到文本遍历过程中或者整个文本 遍历完毕后, 根据窗口长度和一预设阈值的关系判定出所述文本存在作弊现 其中所述窗口容量为所述滑动窗口容纳的不同词的个数;  Repeating the moving process of the sliding window in sequence, until the text traversing process or after the entire text traversing is completed, determining that the text is cheated according to the relationship between the window length and a preset threshold, wherein the window capacity is the sliding window The number of different words accommodated;
所述窗口长度为所述滑动窗口容纳的所有词的总数;  The window length is the total number of words accommodated by the sliding window;
所述预设阈值根据所述窗口容量的最大值而设置。  The preset threshold is set according to a maximum value of the window capacity.
3、 根据权利要求 2 所述的方法, 其特征在于, 所述在整个文本遍历完 毕后,根据窗口长度和一预设阈值的关系判定出所述文本存在作弊现象包括: 记录每次窗口容量到达最大值时对应的窗口长度;  The method according to claim 2, wherein after the traversal of the entire text, determining that the text is cheating according to the relationship between the length of the window and a preset threshold includes: recording each window capacity arrival The corresponding window length at the maximum value;
将所述窗口长度的最大值与所述预设阈值比较, 若超过阈值则判定所述 文本存在作弊现象, 其中所述阈值与所述窗口容量的最大值成正比。  Comparing the maximum value of the window length with the preset threshold, and determining that the text is cheating if the threshold is exceeded, wherein the threshold is proportional to the maximum value of the window capacity.
4、 根据权利要求 2所述的方法, 其特征在于, 所述在文本遍历过程中, 根据窗口长度和一预设阈值的关系判定出所述文本存在作弊现象包括:  The method according to claim 2, wherein in the text traversal process, determining that the text is cheating according to a relationship between a window length and a preset threshold includes:
将每次记录的窗口长度与所述预设阈值比较, 若超过阈值则判定所述文 本存在作弊现象, 其中所述阈值与所述窗口容量的最大值成正比。  The length of the window recorded each time is compared with the preset threshold value, and if the threshold value is exceeded, it is determined that the text has a cheating phenomenon, wherein the threshold value is proportional to the maximum value of the window capacity.
5、 根据权利要求 1 所述的方法, 其特征在于, 所述识别具有作弊现象 的文本包括:  5. The method according to claim 1, wherein the identifying the text having the cheating phenomenon comprises:
利用一个移动的滑动窗口遍历所述文本, 其中所述滑动窗口移动的过程 为: 将滑动窗口的窗口长度从初始值逐步增大, 并在每次增大窗口长度时, 记录滑动窗口的窗口容量; 在窗口长度到达最大值时, 将窗口长度恢复为初 始值, 并将滑动窗口移动到只包含最后遍历的词;  The text is traversed by a moving sliding window, wherein the sliding window moves by: gradually increasing the window length of the sliding window from the initial value, and recording the window capacity of the sliding window each time the window length is increased ; when the window length reaches the maximum value, restore the window length to the initial value, and move the sliding window to the word containing only the last traversal;
依次重复所述滑动窗口的移动过程, 直到文本遍历过程中或者整个文本 遍历完毕后, 根据窗口容量和一预设阈值的关系判定出所述文本存在作弊现 其中所述窗口容量为所述滑动窗口容纳的不同词的个数; 所述窗口长度为所述滑动窗口容纳的所有词的总数; Repeating the moving process of the sliding window in sequence, until the text traversing process or after the entire text traversing is completed, determining that the text exists in accordance with the relationship between the window capacity and a preset threshold, wherein the window capacity is the sliding window The number of different words accommodated; The window length is the total number of words accommodated by the sliding window;
所述预设阈值根据所述窗口长度的最大值而设置。  The preset threshold is set according to a maximum value of the window length.
6、 根据权利要求 5 所述的方法, 其特征在于, 所述在整个文本遍历完 毕后,根据窗口容量和一预设阈值的关系判定出所述文本存在作弊现象包括: 记录每次窗口长度到达最大值时对应的窗口容量;  The method according to claim 5, wherein after the traversal of the entire text, determining that the text is cheating according to the relationship between the window capacity and a preset threshold includes: recording each window length to arrive The corresponding window capacity at the maximum value;
将所述窗口容量的最小值与所述预设阈值比较, 若小于阈值则判定所述 文本存在作弊现象, 其中所述阈值与所述窗口长度的最大值成正比。  Comparing the minimum value of the window capacity with the preset threshold, and determining that the text is cheating if it is less than the threshold, wherein the threshold is proportional to the maximum value of the window length.
7、 根据权利要求 5所述的方法, 其特征在于, 所述在文本遍历过程中, 根据窗口容量和一预设阈值的关系判定出所述文本存在作弊现象包括:  The method according to claim 5, wherein in the text traversal process, determining that the text is cheating according to a relationship between a window capacity and a preset threshold includes:
将每次记录的窗口容量与所述预设阈值比较, 若小于阈值则判定所述文 本存在作弊现象, 其中所述阈值与所述窗口长度的最大值成正比。  The window capacity of each record is compared with the preset threshold, and if the threshold is less than the threshold, the text is determined to be cheating, wherein the threshold is proportional to the maximum value of the window length.
8、 根据权利要求 2至 7 中任一所述的方法, 其特征在于, 所述记录滑 动窗口的窗口容量包括:  The method according to any one of claims 2 to 7, wherein the window capacity of the recording slide window comprises:
在窗口词表中查找滑动窗口最后遍历的一词, 若窗口词表中存在所述 词, 将窗口容量增加 1 ; 否则窗口容量不变, 并将所述词加入到窗口词表中; 所述窗口词表中存储有滑动窗口内不同的词, 且在每次窗口容量恢复初 始值时, 所述窗口词表中仅包括当前滑动窗口内的词。  Finding the last traversal of the sliding window in the window vocabulary, if the word exists in the window vocabulary, increasing the window capacity by 1; otherwise the window capacity is unchanged, and adding the word to the window vocabulary; Different words in the sliding window are stored in the window vocabulary, and each time the window capacity is restored to the initial value, only the words in the current sliding window are included in the window vocabulary.
9、 根据权利要求 2至 8 中任一所述的方法, 其特征在于, 所述滑动窗 口遍历所述文本的顺序为从文本的开始端到结束端; 所述将滑动窗口的窗口 长度从初始值逐步增大为: 将所述滑动窗口的右边界逐步右移; 所述将窗口 长度恢复到初始值, 并将滑动窗口移动到只包含最后遍历的词为: 将所述滑 动窗口的左边界右移, 直到所述滑动窗口只包含最后遍历的词。  The method according to any one of claims 2 to 8, wherein the sliding window traverses the text in an order from a start end to an end end of the text; The value is gradually increased to: gradually shift the right border of the sliding window to the right; the window length is restored to the initial value, and the sliding window is moved to a word containing only the last traversal: the left border of the sliding window Shift right until the sliding window contains only the last traversed word.
10、 根据权利要求 2至 9中任一所述的方法, 其特征在于, 所述窗口长 度以及所述窗口容量的初始值均为 1。  The method according to any one of claims 2 to 9, characterized in that the window length and the initial value of the window capacity are both 1.
11、 根据权利要求 1所述的方法, 其特征在于, 所述根据识别结果, 修 正所述具有作弊现象的文本在排序队列中的位置包括:  The method according to claim 1, wherein the correcting the position of the text having the cheating phenomenon in the sorting queue according to the recognition result comprises:
根据识别结果, 对队列中所有具有作弊现象的文本统一实施相同的处 理。  According to the recognition result, the same processing is uniformly applied to all the cheating texts in the queue.
12、 根据权利要求 11 所述的方法, 其特征在于, 对所有具有作弊行为 的文本在队列中的位置统一修正设定的位数; 或  12. The method according to claim 11, wherein the set number of bits in the queue is uniformly corrected for all the texts having the cheating behavior; or
对所有具有作弊现象的文本所对应的排序依据参数修正一个相同的幅 度。  The ordering corresponding to all cheated texts is corrected by the same magnitude.
13、 根据权利要求 2至 9任一所述的方法, 其特征在于, 所述根据识别 结果, 修正所述具有作弊现象的文本在排序队列中的位置包括:  The method according to any one of claims 2 to 9, wherein the correcting the position of the cheating phenomenon in the sorting queue according to the recognition result comprises:
根据识别结果, 评估文本作弊程度, 对于具有不同作弊程度的文本实施 不同的处理结果。 According to the recognition result, the degree of cheating of the text is evaluated, and different processing results are performed for the texts with different degrees of cheating.
14、 根据权利要求 13 所述的方法, 其特征在于, 所述对于具有不同作 弊程度的文本实施不同的处理结果包括: 14. The method according to claim 13, wherein the performing different processing results for texts having different degrees of fraud comprises:
将作弊程度严重的文本在队列中的位置后调更多的位置, 或  Adjust the position of the cheating text to a greater position in the queue, or
对作弊程度严重的文本所对应的排序依据参数修正一个更大的幅度。 The sorting corresponding to the text with severe cheating is corrected by a larger magnitude according to the parameters.
15、 根据权利要求 13 所述的方法, 其特征在于, 所述评估文本作弊程 度包括: 15. The method according to claim 13, wherein the evaluation text cheating degree comprises:
两个文本中两个滑动窗口此时对应的窗口容量和窗口长度的比值, 计算文本作弊识别过程中窗口容量和窗口长度之间的关系  The ratio of the corresponding window capacity to the window length of the two sliding windows in the two texts, and the relationship between the window capacity and the window length during the text cheating recognition process
16、 根据权利要求 12或 14所述的方法, 其特征在于, 所述文本排序为 检索过程中对文本的排序,其中排序依据参数为检索串与文本的相关性权值。  The method according to claim 12 or 14, wherein the text sorting is sorting of text in the searching process, wherein the sorting according to the parameter is a correlation weight of the search string and the text.
17、 一种文本排序装置, 其中文本质量为文本排序的依据, 其特征在于, 包括:  17. A text sorting apparatus, wherein text quality is a basis for text sorting, and is characterized by:
第一模块, 用于识别具有作弊行为的文本;  a first module for identifying text with cheating behavior;
第二模块, 用于根据所述第一模块的识别结果, 修正所述具有作弊行为 的文本在排序队列中的位置。  And a second module, configured to correct, according to the recognition result of the first module, a position of the text having the cheating behavior in the sorting queue.
18、 一种文本作弊方法, 其特征在于, 包括:  18. A method of text cheat, characterized in that it comprises:
利用一个移动的滑动窗口遍历所述文本; 其中所述滑动窗口移动的过程 为: 将滑动窗口的窗口长度从初始值逐步增大, 并在每次增大窗口长度时, 记录滑动窗口的窗口容量; 在窗口容量到达最大值时, 将滑动窗口的窗口长 度恢复为初始值, 并将滑动窗口移动到只包含最后遍历的词;  The text is traversed by a moving sliding window; wherein the sliding window moves by: gradually increasing the window length of the sliding window from the initial value, and recording the window capacity of the sliding window each time the window length is increased ; when the window capacity reaches the maximum value, restore the window length of the sliding window to the initial value, and move the sliding window to the word containing only the last traversal;
依次重复所述滑动窗口的移动过程, 直到文本遍历过程中或者整个文本 遍历完毕后, 根据窗口长度和一预设阈值的关系判定出所述文本存在作弊现 其中所述窗口容量为所述滑动窗口容纳的不同词的个数;  Repeating the moving process of the sliding window in sequence, until the text traversing process or after the entire text traversing is completed, determining that the text is cheated according to the relationship between the window length and a preset threshold, wherein the window capacity is the sliding window The number of different words accommodated;
所述窗口长度为所述滑动窗口容纳的所有词的总数;  The window length is the total number of words accommodated by the sliding window;
所述预设阈值根据所述窗口容量的最大值而设置。  The preset threshold is set according to a maximum value of the window capacity.
19、 根据权利要求 18 所述的方法, 其特征在于, 所述在整个文本遍历 完毕后, 根据窗口长度和一预设阈值的关系判定出所述文本存在作弊现象包 括:  The method according to claim 18, wherein after the traversal of the entire text, determining that the text is cheating according to the relationship between the length of the window and a predetermined threshold includes:
记录每次窗口容量到达最大值时对应的窗口长度;  Record the length of the window corresponding to each window's capacity reaching the maximum value;
将所述窗口长度的最大值与所述预设阈值比较, 若超过阈值则判定所述 文本存在作弊现象, 其中所述阈值与所述窗口容量的最大值成正比。  Comparing the maximum value of the window length with the preset threshold, and determining that the text is cheating if the threshold is exceeded, wherein the threshold is proportional to the maximum value of the window capacity.
20、 根据权利要求 18 所述的方法, 其特征在于, 所述在文本遍历过程 中, 根据窗口长度和一预设阈值的关系判定出所述文本存在作弊现象包括: 将每次记录的窗口长度与所述预设阈值比较, 若超过阈值则判定所述文 本存在作弊现象, 其中所述阈值与所述窗口容量的最大值成正比。 The method according to claim 18, wherein in the text traversing process, determining that the text has a cheating phenomenon according to a relationship between a window length and a preset threshold comprises: Comparing with the preset threshold, if the threshold is exceeded, it is determined that the text has a cheating phenomenon, wherein the threshold is proportional to the maximum value of the window capacity.
21、 根据权利要求 18至 20中任一所述的方法, 其特征在于, 所述记录 滑动窗口的窗口容量包括: The method according to any one of claims 18 to 20, wherein the window capacity of the recording sliding window comprises:
在窗口词表中查找滑动窗口最后遍历的一词, 若窗口词表中存在所述 词, 将窗口容量增加 1 ; 否则窗口容量不变, 并将所述词加入到窗口词表中; 所述窗口词表中存储有滑动窗口内不同的词, 且在每次窗口容量恢复初 始值时, 所述窗口词表中仅包括当前滑动窗口内的词。  Finding the last traversal of the sliding window in the window vocabulary, if the word exists in the window vocabulary, increasing the window capacity by 1; otherwise the window capacity is unchanged, and adding the word to the window vocabulary; Different words in the sliding window are stored in the window vocabulary, and each time the window capacity is restored to the initial value, only the words in the current sliding window are included in the window vocabulary.
22、 根据权利要求 18至 21中任一所述的方法, 其特征在于, 所述滑动 窗口遍历所述文本的顺序为从文本的开始端到结束端; 所述将滑动窗口的窗 口长度从初始值逐步增大为: 将所述滑动窗口的右边界逐步右移; 所述将窗 口长度恢复到初始值, 并将滑动窗口移动到只包含最后遍历的词为: 将所述 滑动窗口的左边界右移, 直到所述滑动窗口只包含最后遍历的词。  The method according to any one of claims 18 to 21, wherein the sliding window traverses the text in an order from a beginning end to an end end of the text; The value is gradually increased to: gradually shift the right border of the sliding window to the right; the window length is restored to the initial value, and the sliding window is moved to a word containing only the last traversal: the left border of the sliding window Shift right until the sliding window contains only the last traversed word.
23、 根据权利要求 18至 22中任一所述的方法, 其特征在于, 所述窗口 长度以及所述窗口容量的初始值均为 1。  The method according to any one of claims 18 to 22, characterized in that the window length and the initial value of the window capacity are both 1.
24、 一种文本作弊装置, 其特征在于, 包括:  24. A text cheat device, comprising:
第一单元, 用于控制滑动窗口在文本上的移动以及窗口长度的改变; 在 第三单元记录的窗口容量小于最大值时, 控制窗口长度逐步增大; 在第三单 元记录的窗口容量达到最大值时, 将滑动窗口的窗口长度恢复为初始值, 并 将滑动窗口移动到只包含最后遍历的词; 并在第四单元判断出所述文本存在 作弊现象时, 停止所述滑动窗口在文本上的移动;  The first unit is configured to control the movement of the sliding window on the text and the change of the window length; when the window capacity recorded by the third unit is less than the maximum value, the control window length is gradually increased; the window capacity recorded in the third unit reaches the maximum When the value is restored, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal; and when the fourth unit determines that the text is cheating, the sliding window is stopped on the text. Movement
第二单元, 用于在每次增大窗口长度时, 记录滑动窗口的窗口长度; 第三单元, 用于每次增大窗口长度时, 记录滑动窗口的窗口容量; 并将 窗口容量的值通知所述第一单元, 并重新从初始值开始计数。  a second unit, configured to record the window length of the sliding window each time the window length is increased; the third unit is configured to record the window capacity of the sliding window each time the window length is increased; and notify the value of the window capacity The first unit is re-counted from the initial value.
第四单元, 用于根据窗口长度和一预设阈值的关系判定出所述文本存在 作弊现象; 其中所述预设阈值根据所述窗口容量的最大值而设置。  And a fourth unit, configured to determine that the text has a cheating phenomenon according to a relationship between a window length and a preset threshold; wherein the preset threshold is set according to a maximum value of the window capacity.
25、 一种文本作弊方法, 其特征在于, 包括:  25. A method of text cheat, characterized in that it comprises:
利用一个移动的滑动窗口遍历所述文本, 其中所述滑动窗口移动的过程 为: 将滑动窗口的窗口长度从初始值逐步增大, 并在每次增大窗口长度时, 记录滑动窗口的窗口容量; 在窗口长度到达最大值时, 将窗口长度恢复为初 始值, 并将滑动窗口移动到只包含最后遍历的词;  The text is traversed by a moving sliding window, wherein the sliding window moves by: gradually increasing the window length of the sliding window from the initial value, and recording the window capacity of the sliding window each time the window length is increased ; when the window length reaches the maximum value, restore the window length to the initial value, and move the sliding window to the word containing only the last traversal;
依次重复所述滑动窗口的移动过程, 直到文本遍历过程中或者整个文本 遍历完毕后, 根据窗口容量和一预设阈值的关系判定出所述文本存在作弊现 其中所述窗口容量为所述滑动窗口容纳的不同词的个数;  Repeating the moving process of the sliding window in sequence, until the text traversing process or after the entire text traversing is completed, determining that the text exists in accordance with the relationship between the window capacity and a preset threshold, wherein the window capacity is the sliding window The number of different words accommodated;
所述窗口长度为所述滑动窗口容纳的所有词的总数;  The window length is the total number of words accommodated by the sliding window;
所述预设阈值根据所述窗口长度的最大值而设置。  The preset threshold is set according to a maximum value of the window length.
26、 根据权利要求 25 所述的方法, 其特征在于, 所述在 完毕后, 根据窗口容量和一预设阈值的关系判定出所述文本存在作弊现象包 括: 26. The method of claim 25, wherein: After the completion, determining that the text is cheating according to the relationship between the window capacity and a preset threshold includes:
记录每次窗口长度到达最小值时对应的窗口容量;  Record the corresponding window capacity each time the window length reaches the minimum value;
将所述窗口容量的最小值与所述预设阈值比较, 若小于阈值则判定所述 文本存在作弊现象, 其中所述阈值与所述窗口长度的最大值成正比。  Comparing the minimum value of the window capacity with the preset threshold, and determining that the text is cheating if it is less than the threshold, wherein the threshold is proportional to the maximum value of the window length.
27、 根据权利要求 25 所述的方法, 其特征在于, 所述在文本遍历过程 中, 根据窗口容量和一预设阈值的关系判定出所述文本存在作弊现象包括: 将每次记录的窗口容量与所述预设阈值比较, 若超过阈值则判定所述文 本存在作弊现象, 其中所述阈值与所述窗口长度的最大值成正比。  The method according to claim 25, wherein in the text traversal process, determining that the text has a cheating phenomenon according to a relationship between a window capacity and a preset threshold includes: Comparing with the preset threshold, if the threshold is exceeded, it is determined that the text has a cheating phenomenon, wherein the threshold is proportional to the maximum value of the window length.
28、 根据权利要求 25至 27中任一所述的方法, 其特征在于, 所述记录 滑动窗口的窗口容量包括:  The method according to any one of claims 25 to 27, wherein the window capacity of the recording sliding window comprises:
在窗口词表中查找滑动窗口最后遍历的一词, 若窗口词表中存在所述 词, 将窗口容量增加 1 ; 否则窗口容量不变, 并将所述词加入到窗口词表中; 所述窗口词表中存储有滑动窗口内不同的词, 且在每次窗口容量恢复初 始值时, 所述窗口词表中仅包括当前滑动窗口内的词。  Finding the last traversal of the sliding window in the window vocabulary, if the word exists in the window vocabulary, increasing the window capacity by 1; otherwise the window capacity is unchanged, and adding the word to the window vocabulary; Different words in the sliding window are stored in the window vocabulary, and each time the window capacity is restored to the initial value, only the words in the current sliding window are included in the window vocabulary.
29、 根据权利要求 25至 28中任一所述的方法, 其特征在于, 所述滑动 窗口遍历所述文本的顺序为从文本的开始端到结束端; 所述将滑动窗口的窗 口长度从初始值逐步增大为: 将所述滑动窗口的右边界逐步右移; 所述将窗 口长度恢复到初始值, 并将滑动窗口移动到只包含最后遍历的词为: 将所述 滑动窗口的左边界右移, 直到所述滑动窗口只包含最后遍历的词。  The method according to any one of claims 25 to 28, wherein the sliding window traverses the text in an order from a beginning end to an end end of the text; The value is gradually increased to: gradually shift the right border of the sliding window to the right; the window length is restored to the initial value, and the sliding window is moved to a word containing only the last traversal: the left border of the sliding window Shift right until the sliding window contains only the last traversed word.
30、 根据权利要求 25至 29中任一所述的方法, 其特征在于, 所述窗口 长度以及所述窗口容量的初始值均为 1。  30. Method according to any one of claims 25 to 29, characterized in that the window length and the initial value of the window capacity are both one.
31、 一种文本作弊装置, 其特征在于, 包括:  31. A text cheat device, comprising:
第一单元, 用于控制滑动窗口在文本上的移动以及窗口长度的改变; 在 第二单元记录的窗口长度小于最大值时, 控制窗口长度逐步增大; 在第 二单元记录的窗口长度达到最大值时,将滑动窗口的窗口长度恢复为初始值, 并将滑动窗口移动到只包含最后遍历的词; 并在第四单元判断出所述文本存 在作弊现象时, 停止所述滑动窗口在文本上的移动;  a first unit, configured to control movement of the sliding window on the text and a change in the length of the window; when the length of the window recorded by the second unit is less than the maximum value, the length of the control window is gradually increased; the length of the window recorded in the second unit is maximized When the value is restored, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal; and when the fourth unit determines that the text has a cheating phenomenon, the sliding window is stopped on the text. Movement
第二单元, 用于在每次增大窗口长度时, 记录滑动窗口的窗口长度, 并 将窗口长度的值通知给所述第一单元以及第三单元。  And a second unit, configured to record a window length of the sliding window each time the window length is increased, and notify the first unit and the third unit of the value of the window length.
第三单元, 用于每次增大窗口长度时, 记录滑动窗口的窗口容量; 并在 第二单元记录的窗口长度达到最大值时, 从初始值开始计数。  The third unit is configured to record the window capacity of the sliding window each time the window length is increased; and when the window length recorded by the second unit reaches the maximum value, counting from the initial value.
第四单元, 用于根据窗口长度和一预设阈值的关系判定出所述文本存在 作弊现象; 其中所述预设阈值根据所述窗口长度的最大值而设置。  And a fourth unit, configured to determine that the text has a cheating phenomenon according to a relationship between a window length and a preset threshold; wherein the preset threshold is set according to a maximum value of the window length.
PCT/CN2008/072319 2007-09-25 2008-09-10 Method and device of text sorting and method and device of text cheating recognizing WO2009046649A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200710123625.7 2007-09-25
CNB2007101236257A CN100545847C (en) 2007-09-25 2007-09-25 A kind of method and system that blog articles is sorted

Publications (1)

Publication Number Publication Date
WO2009046649A1 true WO2009046649A1 (en) 2009-04-16

Family

ID=39095078

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2008/072319 WO2009046649A1 (en) 2007-09-25 2008-09-10 Method and device of text sorting and method and device of text cheating recognizing

Country Status (2)

Country Link
CN (1) CN100545847C (en)
WO (1) WO2009046649A1 (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100545847C (en) * 2007-09-25 2009-09-30 腾讯科技(深圳)有限公司 A kind of method and system that blog articles is sorted
CN102385585A (en) * 2010-08-27 2012-03-21 阿里巴巴集团控股有限公司 Establishing method of webpage database, webpage searching method and relative device
CN101984422B (en) * 2010-10-18 2013-05-29 百度在线网络技术(北京)有限公司 Fault-tolerant text query method and equipment
CN102841908A (en) * 2011-06-21 2012-12-26 富士通株式会社 Micro-blog content ordering method and micro-blog content ordering device
CN103324637B (en) * 2012-03-23 2017-12-12 深圳市世纪光速信息技术有限公司 A kind of hot information method for digging and system
CN103365845B (en) * 2012-03-26 2018-07-27 腾讯科技(北京)有限公司 A kind of searching method in microblogging and system
CN103049511B (en) * 2012-03-28 2016-02-03 温州大学 The display packing of a kind of microblogging concern list, content of microblog and client thereof
CN103257982A (en) * 2012-06-13 2013-08-21 苏州大学 Blog search result ordering algorithm based on attention relationship
CN102880665A (en) * 2012-09-05 2013-01-16 常州嘴馋了信息科技有限公司 Webpage blog showing system
CN103218443A (en) * 2013-04-22 2013-07-24 中山大学 Blogging webpage retrieval system and retrieval method
CN103810251B (en) * 2014-01-21 2017-05-10 南京财经大学 Method and device for extracting text
CN104899310B (en) * 2015-06-12 2018-01-19 百度在线网络技术(北京)有限公司 Information sorting method, the method and device for generating information sorting model
CN105138573A (en) * 2015-07-28 2015-12-09 沈阳化工大学 PHP based multi-user light blog system
CN106446087A (en) * 2016-09-12 2017-02-22 福建中金在线信息科技有限公司 Method and device for acquiring thematic information
CN113011167B (en) * 2021-02-09 2024-04-23 腾讯科技(深圳)有限公司 Cheating identification method, device, equipment and storage medium based on artificial intelligence

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1529263A (en) * 2003-09-18 2004-09-15 北京邮电大学 Chinese text auto-segmenting and text plagiarism discrimination device and method
WO2007033202A1 (en) * 2005-09-13 2007-03-22 Google Inc. Ranking blog documents
CN101071419A (en) * 2007-05-31 2007-11-14 腾讯科技(深圳)有限公司 Method and system for judging article importance in network, and sliding window
CN101127046A (en) * 2007-09-25 2008-02-20 腾讯科技(深圳)有限公司 Method and system for sequencing to blog article

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1529263A (en) * 2003-09-18 2004-09-15 北京邮电大学 Chinese text auto-segmenting and text plagiarism discrimination device and method
WO2007033202A1 (en) * 2005-09-13 2007-03-22 Google Inc. Ranking blog documents
CN101071419A (en) * 2007-05-31 2007-11-14 腾讯科技(深圳)有限公司 Method and system for judging article importance in network, and sliding window
CN101127046A (en) * 2007-09-25 2008-02-20 腾讯科技(深圳)有限公司 Method and system for sequencing to blog article

Also Published As

Publication number Publication date
CN101127046A (en) 2008-02-20
CN100545847C (en) 2009-09-30

Similar Documents

Publication Publication Date Title
WO2009046649A1 (en) Method and device of text sorting and method and device of text cheating recognizing
KR101557294B1 (en) Search results ranking using editing distance and document information
JP5984917B2 (en) Method and apparatus for providing suggested words
US9384214B2 (en) Image similarity from disparate sources
CN108388582B (en) Method, system and apparatus for identifying related entities
US8356035B1 (en) Association of terms with images using image similarity
US20100082653A1 (en) Event media search
US8527564B2 (en) Image object retrieval based on aggregation of visual annotations
US10152478B2 (en) Apparatus, system and method for string disambiguation and entity ranking
CN111310023B (en) Personalized search method and system based on memory network
CN110569496A (en) Entity linking method, device and storage medium
US9529908B2 (en) Tiering of posting lists in search engine index
CN111078931B (en) Song list pushing method, device, computer equipment and storage medium
WO2008084930A1 (en) Method for offering result of search and system for executing the method
JP2020512651A (en) Search method, device, and non-transitory computer-readable storage medium
CN104778284A (en) Spatial image inquiring method and system
CN111708942A (en) Multimedia resource pushing method, device, server and storage medium
JP4375626B2 (en) Search service system and method for providing input order of keywords by category
CN106033417B (en) Method and device for sequencing series of video search
CN111950267B (en) Text triplet extraction method and device, electronic equipment and storage medium
KR101175194B1 (en) Method, apparatus, server, and computer-readable recording medium for searching image
KR101649146B1 (en) Method and server for searching
KR101615164B1 (en) Query processing method and apparatus based on n-gram
CN107609006B (en) Search optimization method based on local log research
CN110008407A (en) A kind of information retrieval method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08800831

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 1644/CHENP/2010

Country of ref document: IN

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112 (1) EPC (EPO FORM 1205A DATED 01/09/2010)

122 Ep: pct application non-entry in european phase

Ref document number: 08800831

Country of ref document: EP

Kind code of ref document: A1