WO2009046649A1 - Procédé et dispositif de tri de textes et procédé et dispositif de reconnaissance de fraude dans des textes - Google Patents

Procédé et dispositif de tri de textes et procédé et dispositif de reconnaissance de fraude dans des textes Download PDF

Info

Publication number
WO2009046649A1
WO2009046649A1 PCT/CN2008/072319 CN2008072319W WO2009046649A1 WO 2009046649 A1 WO2009046649 A1 WO 2009046649A1 CN 2008072319 W CN2008072319 W CN 2008072319W WO 2009046649 A1 WO2009046649 A1 WO 2009046649A1
Authority
WO
WIPO (PCT)
Prior art keywords
window
text
length
capacity
sliding window
Prior art date
Application number
PCT/CN2008/072319
Other languages
English (en)
Chinese (zh)
Inventor
Rongfang Shao
Haiquan Xie
Liang Dong
Original Assignee
Tencent Technology (Shenzhen) Company Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology (Shenzhen) Company Limited filed Critical Tencent Technology (Shenzhen) Company Limited
Publication of WO2009046649A1 publication Critical patent/WO2009046649A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to the field of computers, and more particularly to a text sorting method and apparatus, and a text cheating recognition method and apparatus.
  • BACKGROUND With the development of the Internet, weblogs (Weblogs, blogs, Chinese "blogs") have become a common network service. At present, a large number of Internet companies have launched their own blog search engines. These blog search engines have different sorting methods for searched blog posts, but they all calculate the most relevant search by processing the search string input by the user. A set of results is returned to the user so that the user can find the blog post that is most relevant to their expectations.
  • the two sorting methods that are currently prevalent are sorting by relevance and sorting by time, and typically sorting by relevance.
  • the specific process of sorting according to relevance is: first calculating the text relevance weight between the search string and each blog post, and the numerical relevance weight of the blog article, thereby establishing a search string and a blog post according to the relevance weight.
  • the search is performed according to the search string input by the user, and the blog articles are sorted according to the size of the correlation weight, and finally the sorted result is sent to the user for display.
  • the search string is generally decomposed into a plurality of search words, so that the text relevance weight of the search string and the blog is decomposed into the search term and the text of the blog.
  • Sexual weight is applied
  • the above sorting method can provide users with a certain degree of credible blog article sorting results, for some low-quality articles, because there are only a few words in the whole or in the whole, in the sorting method, these articles However, it is possible to obtain a higher ordering by repeating and stacking words. This is a typical text cheating phenomenon.
  • This kind of text cheating also affects other text sorting processes besides blog posts, such as the web page sorting process during the search process.
  • Embodiments of the present invention provide a text sorting method and apparatus to reduce the impact of text cheat on sorting results.
  • the embodiment of the invention also provides a text cheating recognition method and device, to identify The text of the malpractice.
  • the present invention provides a text sorting apparatus, wherein the text quality is a text sorting basis, and the method includes:
  • a second module configured to correct, according to the recognition result of the first module, a position of the cheating phenomenon in a sorting queue.
  • the text is traversed by a moving sliding window; wherein the sliding window moves by: gradually increasing the window length of the sliding window from the initial value, and recording the window capacity of the sliding window each time the window length is increased ; when the window capacity reaches the maximum value, restore the window length of the sliding window to the initial value, and move the sliding window to the word containing only the last traversal;
  • the window length is the total number of words accommodated by the sliding window
  • the preset threshold is set according to a maximum value of the window capacity.
  • the first unit is configured to control the movement of the sliding window on the text and the change of the window length; when the window capacity recorded by the third unit is less than the maximum value, the control window length is gradually increased; the window capacity recorded in the third unit reaches the maximum When the value is restored, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal; and when the fourth unit determines that the text is cheating, the sliding window is stopped on the text. Movement
  • a second unit configured to record the window length of the sliding window each time the window length is increased;
  • the third unit is configured to record the window capacity of the sliding window each time the window length is increased; and notify the value of the window capacity The first unit is re-counted from the initial value.
  • a fourth unit configured to determine that the text has a cheating phenomenon according to a relationship between a window length and a preset threshold; wherein the preset threshold is set according to a maximum value of the window capacity.
  • the text is traversed by a moving sliding window, wherein the sliding window moves by: gradually increasing the window length of the sliding window from the initial value, and recording the window capacity of the sliding window each time the window length is increased ; when the window length reaches the maximum, the window length is restored to the beginning Start value, and move the sliding window to the word containing only the last traversal;
  • the window length is the total number of words accommodated by the sliding window
  • the preset threshold is set according to a maximum value of the window capacity.
  • a first unit configured to control movement of the sliding window on the text and a change in the length of the window; when the length of the window recorded by the second unit is less than the maximum value, the length of the control window is gradually increased; the length of the window recorded in the second unit is maximized
  • the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal; and when the fourth unit determines that the text has a cheating phenomenon, the sliding window is stopped on the text.
  • a second unit configured to record a window length of the sliding window each time the window length is increased, and notify the first unit and the third unit of the value of the window length.
  • the third unit is configured to record the window capacity of the sliding window each time the window length is increased; and when the window length recorded by the second unit reaches the maximum value, counting from the initial value.
  • a fourth unit configured to determine that the text has a cheating phenomenon according to a relationship between a window length and a preset threshold; wherein the preset threshold is set according to a maximum value of the window capacity.
  • the text sorting method provided by the embodiment of the present invention identifies the text with the cheating behavior, and corrects the sorting result according to the recognition result.
  • the text cheating is sorted. The effect of the results can improve the objectivity of the ranking.
  • the text cheat recognition method provided by the embodiment of the present invention calculates a window length that accommodates a certain window capacity, compares it with a preset threshold, and calculates a window capacity within a certain window length, and compares it with a preset. Compared with the threshold, the process of text cheating recognition is quantified, which makes the text cheating recognition more objective.
  • BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a flow chart of a text sorting method in an embodiment of the present invention.
  • FIG. 2 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.
  • FIG. 3 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.
  • FIG. 4 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.
  • FIG. 5 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.
  • Figure 6 is a structural diagram of a text sorting apparatus in an embodiment of the present invention.
  • Fig. 7 is a structural diagram of a text cheat recognition apparatus in the embodiment of the present invention.
  • FIG. 8 is a structural diagram of a blog article retrieval sorting system in an embodiment of the present invention.
  • FIG. 9 is a structural diagram of an indexer in a blog article retrieval sorting system according to an embodiment of the present invention.
  • FIG. 10 is a structural diagram of a searcher in a blog article retrieval sorting system according to an embodiment of the present invention.
  • FIG. 11 is a flowchart of a method for establishing an index in a blog article retrieval order according to an embodiment of the present invention.
  • FIG. 12 is a flowchart of a method for establishing an index in a blog article retrieval order according to an embodiment of the present invention.
  • FIG. 13 is a flowchart of a method for retrieving a blog article in an embodiment of the present invention.
  • FIG. 14 is a flowchart of a method for retrieving a blog article in an embodiment of the present invention.
  • the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
  • FIG. 1 is a flow chart of a text sorting method in an embodiment of the present invention. As shown in Figure 1, the method includes:
  • Step S101 Identify a text with cheating behavior
  • Step S102 Correct the position of the text with the cheating behavior in the sorting queue according to the recognition result.
  • An embodiment of the present invention provides a method for recognizing a cheat text, and traversing a text to be detected by using a moving sliding window, wherein the process of moving the sliding window is: increasing the window length of the sliding window from an initial value, and each time When the window length is increased, the window capacity of the sliding window is recorded; when the window capacity reaches the maximum value, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal.
  • the window capacity is different words accommodated by the window.
  • the number, the length of the window is the total number of words in the window, that is, the distance between the left and right borders.
  • the threshold is set according to the maximum value of the window capacity.
  • the process of determining whether the text has a cheating phenomenon according to the relationship between the window length and a preset threshold may be: recording the length of the window corresponding to each window capacity reaching the maximum value; Let the threshold comparison, if the threshold is exceeded, determine that the text is cheating.
  • the process of determining whether the text has a cheating phenomenon according to the relationship between the length of the window and a preset threshold may be: comparing the length of the window of each record with a preset threshold, and if the threshold is exceeded, determining that the text is cheating phenomenon.
  • the threshold is proportional to the maximum value of the window capacity, that is, the larger the maximum window capacity, as a text without cheating, the corresponding window length should be longer.
  • 2 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.
  • the method for recognizing cheat text may be referred to as a water signature recognition algorithm, which traverses the entire text from left to right using a sliding window of fixed maximum size and variable length, and records that the window has been reached.
  • the maximum length The larger the maximum window length of a text, the more likely it is to have a low-quality article with text cheating.
  • Step S202 Determine whether the next word is successfully read: If yes, execute S203; if no, go to step S210.
  • Step S203 The right border of the sliding window is shifted to the right, and the read new word is included in the sliding window.
  • Step S205 The window vocabulary and the capacity C are unchanged, and the length L is incremented. After the step ends, the process proceeds to step S202 to continue reading.
  • Step S206 The word is added to the window vocabulary, the capacity C is incremented, and the length L is incremented.
  • Step S207 Determine whether the window capacity C exceeds the set maximum value Cmax: If yes, execute step S208; if no, proceed to step S202 to continue reading.
  • Step S208 The left border of the window is shifted to the right, and the window is shortened to include only the newly read words.
  • Step S209 It is judged whether the text has been traversed: If yes, step S210 is performed; if no, then step S202 is continued to continue reading.
  • Step S210 When the text traversal is completed, one or more lengths L are recorded, and the importance of the text is determined according to the maximum length of the record: If the maximum length L is greater than the set threshold, the text is cheated. Otherwise it indicates that there is no cheating in the text.
  • FIG. 3 is a flow chart of a method for identifying cheat text in an embodiment of the present invention. As shown in Figure 3, the method includes:
  • Step S302 Read the next word.
  • Step S303 The right border of the sliding window is shifted to the right, and the read new word is included in the sliding window.
  • Step S305 The window vocabulary and the capacity C are unchanged, and the length L is incremented. After the step ends, the process proceeds to step S307.
  • Step S306 The word is added to the window vocabulary, the capacity C is incremented, and the length L is incremented. After the step ends, the process proceeds to step S307.
  • Step S307 It is judged whether the length L exceeds the threshold; if yes, the process proceeds to step S311, otherwise, the process proceeds to step S308.
  • Step S308 determining whether the window capacity C exceeds the set maximum value Cmax; if yes, executing step S309; if not, proceeding to step S310.
  • Step S309 The left border of the window is shifted to the right, and the window is shortened to include only the newly read words.
  • Step S310 Determine whether the text has been traversed: If yes, go to step S312; if no, go to step S302 to continue reading.
  • Step S311 It is determined that the text has a cheating phenomenon.
  • Step S312 It is determined that the text does not have a cheating phenomenon.
  • the threshold values are all closely related to the maximum value of the window capacity C, that is, when the maximum value of the window capacity C is larger, the set threshold value can also be larger. Conversely, the smaller the maximum value of the window capacity C, the corresponding threshold should be reduced accordingly.
  • the embodiment of the invention further provides a method for recognizing cheat text, which uses a moving sliding window to traverse the text to be detected, wherein the process of sliding the window is: increasing the window length of the sliding window from the initial value, and When the window length is increased by a second time, the window capacity of the sliding window is recorded; when the window length reaches the maximum value, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal.
  • the moving process of the sliding window is repeated in turn until the text traversing process or after the entire text traversing is completed, and the text is judged to be cheating according to the relationship between the window capacity and a preset threshold. At this time, the threshold is set according to the maximum value of the window length.
  • the process of determining whether the text has a cheating phenomenon according to the relationship between the window capacity and a preset threshold may be: recording the corresponding window capacity when each window length reaches the maximum value; and minimizing the window capacity minimum If the threshold is compared, if the threshold is less than the threshold, the text is judged to be cheating.
  • the process of determining whether the text has a cheating phenomenon according to the relationship between the window capacity and a preset threshold may be: comparing the window capacity of each record with the preset threshold, and determining that the text exists if the threshold is less than the threshold. Cheating.
  • the threshold is proportional to the maximum value of the window length, that is, the smaller the window capacity is, the smaller the probability that the text is cheating, but the maximum window length may be increased, and the corresponding window capacity may be allowed to increase accordingly. .
  • FIG. 4 is a flow chart of a method for identifying cheat text in an embodiment of the present invention.
  • the entire text is traversed from left to right using a sliding window, and the maximum length of the window is set, i.e., the window cannot exceed the maximum length.
  • the window capacity the more There may be texts that are cheating.
  • Step S402 Determine whether the next word is successfully read: If yes, execute S403; if no, go to step S410.
  • Step S403 The right border of the sliding window is shifted to the right, and the read new word is included in the sliding window.
  • Step S405 The window vocabulary and the capacity C are unchanged, and the length L is incremented. After the step ends, the process proceeds to step S402 to continue reading.
  • Step S406 The word is added to the window vocabulary, the capacity C is incremented, and the length L is incremented.
  • Step S407 Determine whether the window length L exceeds the set maximum value Lmax: If yes, execute step S408; if no, proceed to step S402 to continue reading.
  • Step S408 The left border of the window is shifted to the right, and the window is shortened to include only the newly read words.
  • Step S409 It is judged whether the text has been traversed: If yes, step S410 is performed; if no, then step S402 is continued to continue reading.
  • Step S410 When the text traversal is completed, one or more window capacities C are recorded, and the importance of the text is determined according to the minimum capacity of the record: If the minimum capacity C is less than the set threshold, the text is cheated. Otherwise, it indicates that there is no cheating in the text.
  • FIG. 5 is a flow chart of a method for identifying cheat text in an embodiment of the present invention. As shown in Figure 5, the method includes:
  • Step S502 Read the next word.
  • Step S503 The right border of the sliding window is shifted to the right, and the read new word is included in the sliding window.
  • Step S504 determining whether the word already exists in the window vocabulary: if yes, executing step S505; if not, executing step S506.
  • Step S505 The window vocabulary and the capacity C are unchanged, and the length L is incremented. After the step ends, the process proceeds to step S507.
  • Step S506 the word is added to the window vocabulary, the capacity C is incremented, and the length L is incremented. After the end, the process proceeds to step S507.
  • Step S507 It is judged whether the capacity C is smaller than the threshold; if it is less than the threshold, the process proceeds to step S511, otherwise, the process proceeds to step S508.
  • Step S508 determining whether the window length L exceeds the set maximum value Lmax; if yes, executing step S509; if not, proceeding to step S510.
  • Step S509 The left border of the window is shifted to the right, and the window is shortened to include only the newly read words.
  • Step S510 Determine whether the text has been traversed: If yes, execute step S512; if no, proceed to step S502 to continue reading.
  • Step S511 It is determined that the text has a cheating phenomenon.
  • Step S512 It is determined that the text does not have a cheating phenomenon.
  • the threshold values are all closely related to the maximum value of the window length L, that is, when the maximum value of the window length L is larger, the set threshold value can also be larger. Conversely, the smaller the maximum value of the window length L, the smaller the set threshold should be.
  • the order of traversing the text is from beginning to end, so when the window length increases from the initial value, the right border of the window starts to move to the right, and when the window length returns to the initial value, the left border of the window is right. shift.
  • the order of traversing the text can also be from end to end.
  • the window length increases from the initial value
  • the left boundary of the window begins to move to the left.
  • the window length is restored to the initial value
  • the right edge of the window is shifted to the left.
  • the method of correcting the position of the text having the cheating behavior in the queue according to the recognition result may be as follows.
  • the ordering parameter corresponding to all the cheating behaviors may be corrected by a fixed amplitude.
  • the weight of the correlation between the search term and the text is generally reduced according to the weight of the correlation between the search term and the text; Amplitude. , . ,,,
  • the text cheating recognition algorithm window capacity and window length, more accurate evaluation of the degree of cheating of different texts, different processing results for texts with different degrees of cheating, for more cheating
  • the text corresponding to a more rigorous processing. For example, the position of the text with severe cheating is adjusted to a more position in the queue, or the order corresponding to the text with severe cheating is corrected by a larger margin.
  • the two sliding windows correspond to the window capacity and window length at this time. If the two sliding windows of the two texts have the same window capacity, the text corresponding to the sliding window with a large window length has a greater degree of cheating. If the two sliding windows of the two texts are equal in length, the text corresponding to the sliding window with a small window capacity has a greater degree of cheating.
  • the most common method can calculate the ratio of the window capacity and the window length of the two sliding windows in the two texts. Which text corresponds to the sliding window with a smaller ratio of window capacity to window length, and which text is more cheated. Big.
  • Figure 6 is a structural diagram of a text sorting apparatus in an embodiment of the present invention. As shown in Figure 6, the device includes:
  • a text recognition module 601 configured to identify text with cheating behavior
  • the sorting correction module 602 is configured to correct the position of the text with the cheating behavior in the sorting queue according to the recognition result.
  • Figure 7 is a structural diagram of a text cheat recognition apparatus in an embodiment of the present invention. As shown in Fig. 7, the apparatus includes a window length control unit 701, a window capacity recording unit 702, a window length recording unit 703, and a threshold comparison unit 704.
  • the window length control unit 701 is configured to control the movement of the sliding window on the text and the change of the window length; when the window capacity recorded by the window capacity recording unit 703 is less than the maximum value, the control window length is gradually increased; When the window capacity of the 703 record reaches the maximum value, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal; and when the threshold comparison unit 704 determines that the text is cheating, the stop The movement of the sliding window over the text.
  • the window length recording unit 702 is configured to record the window length of the sliding window each time the window length is increased.
  • the window capacity recording unit 703 is configured to record the window capacity of the sliding window each time the window length is increased; and notify the window length control unit 701 of the value of the window capacity, and start counting again from the initial value.
  • the threshold comparison unit 704 is configured to determine that the text is cheating according to the relationship between the window length and a preset threshold; wherein the preset threshold is set according to the maximum value of the window capacity.
  • the function of the four units of the device can also be:
  • the window length control unit 701 is configured to control the movement of the sliding window on the text and the change of the window length; when the window length recorded by the window length recording unit 702 is less than the maximum value, the control window length is gradually increased; When the window length of the record 702 reaches the maximum value, the window length of the sliding window is restored to the initial value, and the sliding window is moved to the word containing only the last traversal; and when the threshold comparison unit 704 determines that the text is cheating, the stop The movement of the sliding window over the text.
  • the window length recording unit 702 is configured to record the window length of the sliding window each time the window length is increased, and notify the window length control unit 701 and the window capacity recording unit 703 of the value of the window length.
  • the window capacity recording unit 703 is configured to record the window capacity of the sliding window each time the window length is increased; and when the window length recorded by the window length recording unit 702 reaches the maximum value, counting from the initial value.
  • the threshold comparison unit 704 is configured to determine that the text is cheating according to the relationship between the window length and a preset threshold; wherein the preset threshold is set according to the maximum value of the window length.
  • the text is a blog post, and the sorting is a sort of search.
  • the method and the device in the embodiment of the present invention are illustrated.
  • the text in the embodiment of the present invention may also be a webpage text and the like, and all other sorting operations are required. Text, sorted scenes are not limited to search sorting.
  • the embodiment of the present invention identifies cheating in the calculation of the text relevance weight.
  • the text, and its weight reduction processing can establish a more accurate index, thereby improving the objective accuracy of sorting based on this index, and ensuring the quality of text retrieval by users.
  • FIG. 8 is a structural diagram of a blog article retrieval sorting system in an embodiment of the present invention.
  • the system includes a blog system 100, an indexer 200, a retriever 300, an agent 400, and a client 500.
  • the connection relationship between the devices in all the diagrams of the present invention is for the purpose of clearly explaining the information interaction and control process thereof, and therefore should be regarded as a logical connection relationship, and should not be limited to physical connections. among them:
  • the blogging system 100 is used to provide blog related services for users, including storing and managing blog posts, and the like, and provides a relevance factor for the indexer 200 in the present invention, including text relevance factors (eg, text classification, title, Body, nickname, space name, etc., and numerical correlation factors (eg, activity factor, reload factor, response rate factor, publication time factor, etc.).
  • text relevance factors eg, text classification, title, Body, nickname, space name, etc.
  • numerical correlation factors eg, activity factor, reload factor, response rate factor, publication time factor, etc.
  • the core of the blog system 100 can be a web server, but the invention is not limited to its specific form.
  • the indexer 200 is configured to index based on data in the blog system 100 for the searcher 300 to sort the searched blog posts based on the index.
  • the retriever 300 queries and sorts the blog articles based on the search terms input by the user.
  • the agent 400 is configured to receive the search string sent by the client 500, divide the search string into search terms, send it to the searcher 300, and forward the retrieved and sorted results of the retriever 300 to the client 500.
  • the client 500 receives the search term or the search string input by the user: if the user inputs the search term, it can directly send it to the retriever 300, and after receiving the blog article sorting result fed back by the crawler 300, the sorting result is drawn. And displayed to the user interface; if the user inputs a search string, it must be sent to the agent 400 for segmentation, and after receiving the blog article sorting result fed back by the agent 400, the sorting result is drawn and displayed to the user interface. on.
  • the client 500 is typically a variety of terminal devices capable of logging in to the Internet, such as a personal computer (PC), a personal digital assistant (PDA), a mobile phone (MP), etc., and thus the present invention.
  • PC personal computer
  • PDA personal digital assistant
  • MP mobile phone
  • the scope of protection should not be limited to a particular type of client.
  • FIG. 9 is a structural diagram of an indexer in a blog article retrieval sorting system according to an embodiment of the present invention.
  • the indexer 200 includes: a numerical correlation determination unit 201, a text correlation determination unit 202, a text cheat recognition unit 203, a superposition calculation unit 204, and an index construction unit 205.
  • the numerical correlation determining unit 201 is configured to calculate a numerical correlation weight of the search term and each blog post based on the numerical correlation factor extracted from the blog system.
  • the text relevance determining unit 202 is configured to calculate a text-related weight of the search term and each blog post based on the text relevance factor extracted from the blog system.
  • the text cheat recognition unit 203 is configured to recognize the blog article in which the text is cheated when the text relevance determination unit 202 calculates the text relevance weight between the search term and the blog post.
  • the superposition calculation unit 204 is configured to perform superposition calculation on the foregoing numerical correlation weight and text relevance right to obtain a comprehensive correlation weight of the search term, and send it to the index construction unit 205.
  • the index construction unit 205 builds an index based on the comprehensive correlation weight.
  • the indexer 200 includes a text relevance determination unit 202, a text cheat recognition unit 203, and an index construction unit 205.
  • the text relevance determining unit 202 is configured to calculate a text relevance weight of the search term and each blog post based on the text relevance factor extracted from the blog system.
  • the text cheat recognition unit 203 is used to identify a blog post in which the text is cheated, and judges the genre The text relevance weight constructs an index between the search term and each blog post.
  • the process of constructing an index only considers the text correlation factor, the accuracy of the index is not high enough.
  • the indexer 200 shown in Figure 2 has a higher index accuracy.
  • FIG. 10 is a structural diagram of a searcher in a blog article retrieval sorting system according to an embodiment of the present invention.
  • the retriever 300 includes a query unit 301, a composite correlation calculation unit 302, and a sorting unit 303.
  • the user initially inputs a search string containing a plurality of search terms,
  • the index is divided into search terms and sent to the searcher 300, and the searcher 300 receives the search words and then processes them.
  • the query unit 301 queries the relevance weight (text relevance weight, or comprehensive relevance weight) between each search term and each blog post from the index that has been established by the indexer, and sends it to the sorting unit.
  • the compound correlation calculation unit 302 calculates the composite correlation weight between the search string and each blog post based on the correlation weight of each search term, and sends it to the sorting unit 303.
  • the sorting unit 303 sorts each blog post related to the search string according to the composite correlation weight.
  • a retriever 300 is provided that can be directly connected to and communicated with a client, and is suitable for situations in which a user inputs a search term rather than a search string.
  • the retriever 300 at this time includes a query unit 301 and a sorting unit 303.
  • the query unit 301 queries, according to the search term input by the user, the correlation weight (text relevance weight, or comprehensive relevance weight) between the search term and each blog post from the index that has been established by the indexer 200, and It is sent to the sorting unit 303.
  • the sorting unit 303 sorts each blog post related to the search term according to the size of the received correlation weight. It should be noted that since most of the users currently input a search string containing a plurality of search terms, the structure of the retriever 300 shown in Fig. 3 is more widely and typical.
  • FIG. 11 is a flowchart of a method for establishing an index in a blog article retrieval order according to an embodiment of the present invention. As shown in FIG. 11, the method includes the following steps:
  • Step S1101 Extract a correlation factor from the blog system, and format the data.
  • the formatting mentioned here includes normalizing some correlation factors and performing some processing on some correlation factors, such as Log processing, to map the values of most correlation factors. In a fixed interval, for example [0, 100]. Of course, some correlation factors take their original values.
  • the correlation factor referred to in the present invention may include only a text correlation factor, a text correlation factor, and a numerical correlation factor. These correlation factors are used as input parameters when the indexer builds the index as a correlation weight calculation.
  • Step S1102 Calculate the correlation weight of the search term and each blog post, and identify and degrade the blog article with text cheating.
  • only the text relevance factor is considered, which calculates the text relevance weight of the search term based on the text relevance factor, and identifies the blog post that the text is cheated, and then textual relevance of the search term to the blog post.
  • the weight is appropriately degraded.
  • the indexer not only considers the text correlation factor, but also considers the numerical correlation factor, respectively calculates the text relevance weight and the numerical correlation weight, and simultaneously identifies the blog post of the text cheat, and then searches for The word and the text relevance weight of the blog article are appropriately reduced, and finally the text correlation weight and the numerical correlation weight are superimposed to obtain the comprehensive correlation weight.
  • the previous embodiment only performs the weight reduction processing on the text correlation weight, and is used.
  • This embodiment also considers the numerical correlation factor Step increased number According to the accuracy.
  • Step S1103 Construct an index between the search term and each blog post according to the correlation weight after the weight reduction.
  • the index records the relevance weights of each search term, the blog post corresponding to the search term, the search term and the blog post, so that when the user inputs the search term for searching, the search can be performed according to the data in the index.
  • Blog articles are sorted so that users can quickly find the most relevant blog posts.
  • FIG. 12 is a flowchart of a method for establishing an index in a blog article retrieval order according to an embodiment of the present invention. As shown in Figure 12, the process specifically includes:
  • Step S1201 Extract a correlation factor from the blog system, and format the data.
  • the correlation factor at this time includes a text correlation factor and a numerical correlation factor.
  • Step S1202 The indexer calculates a numerical correlation weight of the search term and each blog post.
  • the numerical correlation factor includes an activity factor ⁇ . , the reload rate factor Wdu .
  • the recovery rate factor W the publication time factor ⁇ ⁇ these four.
  • the activity factor W TM is calculated by the blog system, and the value range is [0, 100]. It comprehensively considers the user registration frequency of the blog personal space, the frequency of blog post publication, etc., and is the activity level of the blog personal space. Comprehensive metrics, the higher the activity, the higher the priority of the ranking results of blog posts.
  • ⁇ Reproduced rate factor "is calculated based on the number of repeating duplication system blog articles obtained in the range [0, 100], the higher the rate is reproduced, the higher the priority ranking result blog articles.
  • Reply rate factor W It is calculated according to the number of reply times of the blog post, the value range is [0, 100], and the higher the response rate factor ⁇ , the higher the priority of the sorting result of the blog post.
  • the publishing time factor W ⁇ is the publishing time of the blog post. It can be expressed by UNIX time, and the ranking result of the newly published blog post has higher priority.
  • the numerical correlation weight is calculated and normalized by all the correlation factors listed above, and its value is obtained.
  • the range is in the interval [0, 1] and its calculation formula is as follows:
  • Step S1203 The indexer calculates a text relevance weight of the search term and each blog post, and identifies a blog post with text cheats, and performs a demotion process on the blog post with the text cheat.
  • the text relevance factor is also the text field available for retrieval.
  • the text fields include five categories: a category, a title, a body, a nickname, and a space name.
  • Each field has a fixed weight value W and a correction coefficient ⁇ , as shown in Table 1.
  • further identifying the blog article with text cheating including: traversing the blog article by using a sliding window, and recording the maximum length reached by the sliding window; comparing the maximum length of the active window with a threshold If the threshold value is exceeded, the blog post is determined to be a text cheat; the blog post with text cheating is appropriately degraded, for example, the amplitude adjustment can be performed, and the text correlation weight is corrected to the previous 60%.
  • Step S1204 The indexer uses the superposition calculation unit to perform superposition calculation on the numerical correlation weight and the text correlation weight to obtain the comprehensive correlation weight.
  • Step S1205 The indexer stores and stores the data based on the comprehensive correlation weights for searching by the user. Extraction application.
  • FIG. 13 is a flowchart of a method for retrieving a blog article in an embodiment of the present invention. This embodiment is a case where a user inputs a search term, including:
  • Step S1301 The retriever receives the search term input by the user in the client.
  • Step S1302 The retriever extracts a correlation weight of each search term from a blog article from an index that has been constructed by the indexer, and the correlation weight may be a text relevance weight, or may be a text correlation weight and a numerical value. The comprehensive correlation weight after the correlation weight is superimposed.
  • Step S1303 The searcher sorts the searched blog articles according to the correlation weights, and feeds the sorting result to the client.
  • FIG. 14 is a flowchart of a method for retrieving a blog article in an embodiment of the present invention. This embodiment is a case where the user inputs a search string, and specifically includes:
  • Step S1401 The agent divides the search string input by the user in the client into a search term and sends it to the searcher.
  • Step S1402 The retriever extracts a correlation weight of each search term from a blog article from an index constructed by the indexer, and the correlation weight may be a text relevance weight, or may be a text correlation weight and a numerical correlation.
  • the comprehensive correlation weight after the superposition of sexual weights.
  • Step S1403 The retriever calculates a composite correlation weight of the search string and the blog article.
  • the user inputs the relevance of the search string to the blog post, which can be considered as a comprehensive result of the correlation between the single search term and the blog post. Therefore, in one embodiment, the average is added after the cartridge is added.
  • the model calculates the composite correlation weights. Let the search string ( ⁇ Q ⁇ q ⁇ qz, ..., q n ⁇ , n be the number of index words after the search string is segmented, d is all the blog articles hit by a search word q n , then the search string Q
  • the formula for calculating the compound correlation weights with blog posts is:
  • Step S1404 The retriever sorts the searched blog articles according to the composite correlation weights, and sends the sorting result to the agent.
  • Step S 1405 The agent forwards the sort result to the client, and displays the sort result on the user interface.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé et dispositif de tri de textes. Le tri de textes comprend : la reconnaissance d'une fraude dans un document ; la modification locale de fraude dans un texte dans la file de tri selon le résultat. L'invention concerne également un procédé et un dispositif de reconnaissance de fraude dans des textes.
PCT/CN2008/072319 2007-09-25 2008-09-10 Procédé et dispositif de tri de textes et procédé et dispositif de reconnaissance de fraude dans des textes WO2009046649A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200710123625.7 2007-09-25
CNB2007101236257A CN100545847C (zh) 2007-09-25 2007-09-25 一种对博客文章进行排序的方法及系统

Publications (1)

Publication Number Publication Date
WO2009046649A1 true WO2009046649A1 (fr) 2009-04-16

Family

ID=39095078

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2008/072319 WO2009046649A1 (fr) 2007-09-25 2008-09-10 Procédé et dispositif de tri de textes et procédé et dispositif de reconnaissance de fraude dans des textes

Country Status (2)

Country Link
CN (1) CN100545847C (fr)
WO (1) WO2009046649A1 (fr)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100545847C (zh) * 2007-09-25 2009-09-30 腾讯科技(深圳)有限公司 一种对博客文章进行排序的方法及系统
CN102385585A (zh) * 2010-08-27 2012-03-21 阿里巴巴集团控股有限公司 网页数据库的建立方法、网页搜索方法以及相关装置
CN101984422B (zh) * 2010-10-18 2013-05-29 百度在线网络技术(北京)有限公司 一种容错文本查询的方法和设备
CN102841908A (zh) * 2011-06-21 2012-12-26 富士通株式会社 微博内容排序方法和微博内容排序装置
CN103324637B (zh) * 2012-03-23 2017-12-12 深圳市世纪光速信息技术有限公司 一种热点信息挖掘方法和系统
CN103365845B (zh) * 2012-03-26 2018-07-27 腾讯科技(北京)有限公司 一种微博中的搜索方法及系统
CN103049511B (zh) * 2012-03-28 2016-02-03 温州大学 一种微博关注列表、微博内容的显示方法及其客户端
CN103257982A (zh) * 2012-06-13 2013-08-21 苏州大学 基于关注关系的Blog搜索结果排序算法
CN102880665A (zh) * 2012-09-05 2013-01-16 常州嘴馋了信息科技有限公司 网页博客展示系统
CN103218443A (zh) * 2013-04-22 2013-07-24 中山大学 一种面向博客网页的网页检索系统及方法
CN103810251B (zh) * 2014-01-21 2017-05-10 南京财经大学 一种文本提取方法及装置
CN104899310B (zh) * 2015-06-12 2018-01-19 百度在线网络技术(北京)有限公司 信息排序方法、用于生成信息排序模型的方法及装置
CN105138573A (zh) * 2015-07-28 2015-12-09 沈阳化工大学 基于php的多用户轻博客系统
CN106446087A (zh) * 2016-09-12 2017-02-22 福建中金在线信息科技有限公司 专题信息获取方法及装置
CN113011167B (zh) * 2021-02-09 2024-04-23 腾讯科技(深圳)有限公司 基于人工智能的作弊识别方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1529263A (zh) * 2003-09-18 2004-09-15 北京邮电大学 中文文本自动分词和判别文本抄袭的装置和方法
WO2007033202A1 (fr) * 2005-09-13 2007-03-22 Google Inc. Classement de documents de blogs
CN101071419A (zh) * 2007-05-31 2007-11-14 腾讯科技(深圳)有限公司 在网络上判断文章重要性的方法和系统、及滑动窗口
CN101127046A (zh) * 2007-09-25 2008-02-20 腾讯科技(深圳)有限公司 一种对博客文章进行排序的方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1529263A (zh) * 2003-09-18 2004-09-15 北京邮电大学 中文文本自动分词和判别文本抄袭的装置和方法
WO2007033202A1 (fr) * 2005-09-13 2007-03-22 Google Inc. Classement de documents de blogs
CN101071419A (zh) * 2007-05-31 2007-11-14 腾讯科技(深圳)有限公司 在网络上判断文章重要性的方法和系统、及滑动窗口
CN101127046A (zh) * 2007-09-25 2008-02-20 腾讯科技(深圳)有限公司 一种对博客文章进行排序的方法及系统

Also Published As

Publication number Publication date
CN100545847C (zh) 2009-09-30
CN101127046A (zh) 2008-02-20

Similar Documents

Publication Publication Date Title
WO2009046649A1 (fr) Procédé et dispositif de tri de textes et procédé et dispositif de reconnaissance de fraude dans des textes
KR101557294B1 (ko) 편집 거리 및 문서 정보를 이용한 검색 결과 랭킹
JP5984917B2 (ja) 提案される語を提供するための方法および装置
US9384214B2 (en) Image similarity from disparate sources
CN108388582B (zh) 用于识别相关实体的方法、系统和装置
US20100082653A1 (en) Event media search
US8527564B2 (en) Image object retrieval based on aggregation of visual annotations
EP2438539A1 (fr) Classification d'images sélectionnées conjointement
CN110569496A (zh) 实体链接方法、装置及存储介质
US9529908B2 (en) Tiering of posting lists in search engine index
CN111078931B (zh) 歌单推送方法、装置、计算机设备及存储介质
CN111310023B (zh) 基于记忆网络的个性化搜索方法及系统
WO2018176913A1 (fr) Procédé et appareil de recherche, et support d'informations lisible par ordinateur non temporaire
WO2008084930A1 (fr) Procédé pour fournir un résultat de recherche et système pour exécuter ledit procédé
US20160357857A1 (en) Apparatus, system and method for string disambiguation and entity ranking
CN104778284A (zh) 一种空间图像查询方法和系统
CN111708942A (zh) 多媒体资源推送方法、装置、服务器及存储介质
JP4375626B2 (ja) カテゴリ別のキーワードの入力順位を提供するための検索サービスシステムおよびその方法
CN106033417B (zh) 视频搜索系列剧的排序方法和装置
CN111950267B (zh) 文本三元组的抽取方法及装置、电子设备及存储介质
KR101175194B1 (ko) 이미지 검색 방법, 장치, 서버 및 이 방법을 실행하기 위한 컴퓨터 판독 가능한 기록 매체
KR101649146B1 (ko) 검색 방법 및 검색 서버
KR101615164B1 (ko) 엔-그램 기반의 질의 처리 장치 및 그 방법
CN110008407A (zh) 一种信息检索方法及装置
CN115827645B (zh) 一种跨业务领域的字段匹配方法、装置及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08800831

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 1644/CHENP/2010

Country of ref document: IN

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112 (1) EPC (EPO FORM 1205A DATED 01/09/2010)

122 Ep: pct application non-entry in european phase

Ref document number: 08800831

Country of ref document: EP

Kind code of ref document: A1