CN115687579A - Document tag generation and matching method and device and computer equipment - Google Patents

Document tag generation and matching method and device and computer equipment Download PDF

Info

Publication number
CN115687579A
CN115687579A CN202211158183.0A CN202211158183A CN115687579A CN 115687579 A CN115687579 A CN 115687579A CN 202211158183 A CN202211158183 A CN 202211158183A CN 115687579 A CN115687579 A CN 115687579A
Authority
CN
China
Prior art keywords
tag
document
label
score
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211158183.0A
Other languages
Chinese (zh)
Other versions
CN115687579B (en
Inventor
丘文波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shirong Information Technology Co ltd
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Original Assignee
Guangzhou Shirong Information Technology Co ltd
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shirong Information Technology Co ltd, Guangzhou Shiyuan Electronics Thecnology Co Ltd filed Critical Guangzhou Shirong Information Technology Co ltd
Priority to CN202211158183.0A priority Critical patent/CN115687579B/en
Publication of CN115687579A publication Critical patent/CN115687579A/en
Application granted granted Critical
Publication of CN115687579B publication Critical patent/CN115687579B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application belongs to the technical field of internet, and particularly relates to a document tag generation and matching method, a document tag generation and matching device and computer equipment. The document labeling method comprises the following steps: collecting a search text input by a user and a clicked document name text corresponding to the search text; integrating the records with the same search text but different corresponding clicked document name texts to obtain a first integration result; obtaining the longest common character string of the search text and each document name text according to the first integration result; obtaining the longest public character string with the maximum clicking frequency in the longest public character string according to the longest public character string and the clicked times; setting the longest public character string with the largest clicking frequency as a label candidate word; and setting at least one of the label candidate words as a document label. The method simplifies the creation process of the document tag and improves the matching degree of the search intention of the user and the document tag.

Description

Document tag generation and matching method and device and computer equipment
Technical Field
The application relates to the technical field of internet, in particular to a document tag generating and matching method, a document tag generating and matching device and computer equipment.
Background
In content search in the vertical field, such as academic search, community forum search, and the like, related documents need to be tagged, so that documents required by a user can be quickly matched according to a search text of the user, and a matching effect of the search text and the document tag also affects a final search effect. Currently, document tags are often designed by manual editing, so the creation process is relatively labor intensive, and in some cases the user's search intent matches the document tags to a lesser extent.
Disclosure of Invention
The application mainly aims to provide a document tag generation and matching method, a document tag generation and matching device and computer equipment, and aims to solve the technical problems that a document tag creation process is complex and the matching degree of the document tag and a user search intention is low.
In order to achieve the above object, the present application provides a document tag generating method, including:
collecting a search text input by a user and a clicked document name text corresponding to the search text;
integrating records, which are the same in the search texts but different in the corresponding clicked document name texts, to obtain a first integrated result, wherein the first integrated result comprises the search texts, the document name texts and the clicked times of the document name texts;
obtaining the longest public character string of the search text and each document name text according to the first integration result;
obtaining the longest public character string with the maximum clicking frequency in the longest public character string according to the longest public character string and the clicked times;
setting the longest public character string with the largest clicking frequency as a label candidate word, wherein the label candidate word is at least one;
and setting at least one of the label candidate words as a document label.
The application also provides a document tag matching method, which comprises the following steps:
acquiring a search text input by a user;
generating a first tag for the search text based on a document tag library, wherein the document tag library is constructed and obtained based on the document tag generation method provided by the embodiment, and the first tag comprises at least one tag word;
generating a second tag for each document based on the document tag library, wherein the documents are stored in the document library, a plurality of documents are stored in the document library and are searched by a user, and the second tag comprises at least one tag word;
matching the first label with a second label, and setting the part of the first label, which is the same as the second label, as an effective label, wherein the effective label comprises at least one label word;
sequentially obtaining a label coverage score of each document based on the first label and the second label, wherein the label coverage score is used for representing the matching degree of the document content and the search text;
sequentially obtaining a tag compactness score of each document based on the effective tags, wherein the tag compactness score is used for representing the position closeness degree of the effective tag contents in the document contents;
obtaining an overall tag matching score for each of the documents based on the tag coverage score and the tag compactness score;
sorting the overall label matching documents to obtain a first sorting result;
and setting the document meeting the preset rule as a document matched with the search text according to the preset rule and the first sequencing result.
In one embodiment, the step of sequentially obtaining a tag compactness score for each of the documents based on the valid tags includes:
generating a position element according to the positions of all the label words in the effective labels in the documents, wherein the position element comprises the label words and the position information of the label words;
arranging the position elements in sequence to generate a first sequence;
obtaining a first label combination based on the first sequence, wherein the first label combination comprises all label words in the effective label, and the position distances among all the label words are the closest in the document;
and obtaining a tag compactness score of the document according to the first tag combination.
In one embodiment, the step of obtaining a first tag combination based on the first sequence comprises:
sequentially setting each position element in the first sequence as a target element, and acquiring position elements which are behind the position of the target element and have the closest distance to the target element and contain other tag words to generate a plurality of position element sequences;
respectively calculating the total distance of each label word in each position element sequence;
and setting the position element sequence with the minimum total distance as a first label combination.
In one embodiment, the overall label matching score is obtained according to the following formula:
score=score_cover*(1+t*score_close),
wherein score is an overall label matching score, score _ cover is a label coverage score, score _ close is a label compactness score, and t is a weight that is set based on the label coverage score.
In one embodiment, the label coverage score is obtained according to the following formula:
Figure BDA0003859767290000031
wherein n is the number of label words in the first label, and num _ query _ tag is the number of label words in the second label;
and when the ith label word in the first label is completely the same as any label word in the second label, tag i =1;
And when the ith label word in the first label is partially identical to any label word in the second label, tag i =N,N∈(0,1);
And when the ith label word in the first label is not the same as any label word in the second label, tag i =0。
In one embodiment, the tag compactness score is obtained according to the following formula:
Figure BDA0003859767290000032
wherein L is a total distance of each label word in the first label combination, M is a first preset distance threshold, and K is a second preset distance threshold.
The present application further provides a document tag generating apparatus, including:
the collection module is used for collecting a search text input by a user and a clicked document name text corresponding to the search text;
the integration module is used for integrating the same search texts but different records of the corresponding clicked document name texts to obtain a first integration result, wherein the first integration result comprises the search texts, the document name texts and the clicked times of the document name texts;
the first acquisition module is used for acquiring the longest common character string of the search text and each document name text according to the first integration result;
the second obtaining module is used for obtaining the longest public character string with the largest click frequency in the longest public character string according to the longest public character string and the clicked times;
a label candidate word setting module, configured to set the longest public character string with the largest click frequency as a label candidate word, where the number of the label candidate words is at least one;
and the document tag generation module is used for setting at least one of the tag candidate words as a document tag.
The present application further provides a document tag matching apparatus, including:
the search text acquisition module is used for acquiring a search text input by a user;
a first tag generation module, configured to generate a first tag for the search text, where the document tag library is constructed and obtained based on the document tag generation method provided in the foregoing embodiment, and the first tag includes at least one tag word;
the second tag generation module is used for generating a second tag for each document based on the document tag library, wherein the documents are stored in the document library, a plurality of documents are stored in the document library and are searched by a user, and the second tag comprises at least one tag word;
the effective label generating module is used for matching the first label with the second label and setting the part of the first label, which is the same as the second label, as an effective label, wherein the effective label comprises at least one label word;
a tag coverage score obtaining module, configured to sequentially obtain a tag coverage score of each document based on the first tag and the second tag, where the tag coverage score is used to characterize a matching degree between the document content and the search text;
a compactness score obtaining module, configured to sequentially obtain a label compactness score of each document based on the valid label, where the label compactness score is used to represent a position proximity of the valid label content in the document content;
an overall label matching score obtaining module, configured to obtain an overall label matching score for each document according to the label coverage score and the label compactness score;
the sorting module is used for sorting the overall label matching documents to obtain a first sorting result;
and the matched document setting module is used for setting the document meeting the preset rule as the document matched with the search text according to the preset rule and the first sequencing result.
The present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the document tag generation method and/or the document tag matching method provided in any of the above embodiments when executing the computer program.
The present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps in the document tag generation method and/or the document tag matching method provided in any of the above embodiments.
The document tag generating and matching method, the document tag generating and matching device and the computer equipment collect a search text input by a user and a clicked document name text corresponding to the search text; integrating the same search texts but different corresponding clicked records of the document name texts to obtain a first integrated result, wherein the first integrated result comprises the search texts, the document name texts and the click times of the document name texts; obtaining the longest common character string of the search text and each document name text according to the first integration result; obtaining the longest public character string with the largest click frequency in the longest public character string according to the longest public character string and the click times; setting the longest public character string with the largest click frequency as a label candidate word, wherein the number of the label candidate words is at least one; and setting at least one of the candidate words as a document tag. By automatically generating the document tag and setting the longest public character string with the largest clicking frequency as a tag candidate word, the creation process of the document tag is simplified, and the matching degree of the search intention of the user and the document tag is improved.
Drawings
Fig. 1 is a flowchart illustrating a document tag generation method according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a document tag matching method according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating a document tag library generating method according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a prefix tree according to an embodiment of the present application;
FIG. 5 is a flowchart illustrating the step S206 in the document tag matching method according to another embodiment of the present application;
fig. 6 is a flowchart illustrating step S2063 in the document tag matching method according to an embodiment of the present application;
FIG. 7 is a schematic structural diagram of a document tag generation apparatus according to an embodiment of the present application;
FIG. 8 is a schematic structural diagram of a document tag matching apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Referring to fig. 1, an embodiment of the present application provides a document tag generation method, which includes steps S101 to S106, and the detailed description of each step of the method is as follows.
In one embodiment, the document tag generation method comprises the following steps:
s101, collecting a search text input by a user and a clicked document name text corresponding to the search text;
s102, integrating records, which are identical in search texts and different in corresponding clicked document name texts, to obtain a first integration result, wherein the first integration result comprises the search texts, the document name texts and the clicked times of the document name texts;
s103, obtaining the longest common character string of the search text and each document name text according to the first integration result;
s104, obtaining the longest public character string with the maximum click frequency in the longest public character string according to the longest public character string and the clicked times;
s105, setting the longest public character string with the largest clicking frequency as a label candidate word, wherein the number of the label candidate words is at least one;
s106, setting at least one of the label candidate words as a document label.
As described in step S101, the search text input by the user in the search engine and the document name text corresponding to the search click may be collected from the search log and the click (document obtained by click search) log of the user. To expand sample data, search, click logs over a period of time (e.g., one month) may be collected for centralized information collection.
As described in step S102, the records with the same search text but different corresponding clicked document name texts are integrated to obtain a first integrated result, where the first integrated result includes the search text, each different document name text, and the clicked times of each different document name text. For example, assuming that the same search text that is input multiple times in the statistical log record is "win10 blue screen", but the clicked document name text corresponding to each search result is "win10 blue screen what", "computer blue screen processing method", "newly purchased MAC computer blue screen" and "blue screen reinstallation system", and the clicked times corresponding to the clicked document name text are "10", "1" and "2", respectively, the log records are integrated to obtain a first integrated result, and one form of the first integrated result may be as shown in table 1 below:
TABLE 1 first integration results
User entered search text Clicked document name text Number of times of being clicked
Win10 blue screen How the win10 blue screen 10
Win10 blue screen Processing method of computer blue screen 10
Win10 blue screen Newly bought MAC computer blue screen 1
Win10 blue screen Blue screen reinstallation system 2
As described in the above steps S103-S106, the longest common character string of the search text and each document name text of the user is obtained according to the first integration result; obtaining the longest public character string with the maximum clicking frequency in the longest public character strings according to the clicked times corresponding to the longest public character string and the clicked times corresponding to the document name texts; setting the longest public character string with the largest click frequency as a label candidate word, wherein the number of the label candidate words is at least one; and setting at least one of the label candidate words as a document label.
The following is illustrated by way of example in table 2 below: in line 2 of table 1, the longest common character string of the search text of the user and each document name text is "win10 blue screen", and the clicked frequency of the longest common character string is 10; in row 3 of table 1, the longest common character string of the user's search text and each document name text is "blue screen" and its clicked frequency is 10, in row 4 of table 1, the longest common character string of the user's search text and each document name text is also "blue screen" and its clicked frequency is 1, in row 5 of table 1, the longest common character string of the user's search text and each document name text is also "blue screen" and its clicked frequency is 2, i.e., the clicked frequency of the longest common character string "blue screen" is (10 + 1) 2 times, i.e., 13 times, and thus "blue screen" is set as a candidate, while in some other embodiments, there may be the longest common character string having the same clicked frequency but different text contents, such as "win10 blue screen" and "blue screen", respectively, and the clicked frequency strings of the two above-mentioned longest common character strings are both "10", and "10" is set as a result of the integrated blue screen, and the actual blue screen may also be applied to the blue screen label of this document name text, where "win" 10 "and" blue screen "are both" blue screen ". In practical applications, the number of candidate tag words is not limited to one or two, and may be more than two, which is not limited herein.
TABLE 2 first integration results post-processing
Figure BDA0003859767290000081
In this way, it is expected that, in practical applications, a plurality of related tag words may be obtained, in order to remove redundancy and ensure the specificity of the document tag words (for example, when prepositions such as "of" and "ground" are also defined as document tag words, but there is no specific semantic meaning), in some embodiments, document tag words whose length does not meet preset requirements (for example, the preset requirements are that the length is not 1, and the length cannot be greater than 10) may be eliminated, and, if one long document tag word may be composed of short document tag words, the long document tag word may also be eliminated, so as to eliminate and retain the short tag words.
In some embodiments, when letters or character strings composed of letters exist in the search text and the clicked document name text of the user, in order to avoid errors in text matching (although the capital and small formats of the letters are different in appearance, the expressed meanings of the letters are generally the same in the text), the letters existing in all the texts are subjected to unified format conversion, such as being set to be in a lower case format or being set to be in a higher case format.
Referring to fig. 2, an embodiment of the present application further provides a document tag matching method, which includes steps S201 to S209, and details of each step of the method are described as follows.
In one embodiment, the document tag library matching method comprises the following steps:
s201, acquiring a search text input by a user;
s202, generating a first tag for the search text based on a document tag library, wherein the document tag library is constructed and obtained based on the document tag generation method provided by the embodiment, and the first tag comprises at least one tag word;
s203, generating a second tag for each document based on the document tag library, wherein the documents are stored in the document library, a plurality of documents are stored in the document library and are searched by a user, and the second tag comprises at least one tag word;
s204, matching the first label with a second label, and setting the part of the first label, which is the same as the second label, as an effective label, wherein the effective label comprises at least one label word;
s205, sequentially obtaining a label coverage score of each document based on the first label and the second label, wherein the label coverage score is used for representing the matching degree of the document content and the search text;
s206, sequentially obtaining a tag compactness score of each document based on the effective tags, wherein the tag compactness score is used for representing the position closeness of the effective tag content in the document content;
s207, obtaining an overall label matching score of each document according to the label coverage score and the label compactness score;
s208, sorting the overall label matching documents to obtain a first sorting result;
s209, setting the document meeting the preset rule as a document matched with the search text according to the preset rule and the first sequencing result.
As described in the above steps S201-S204, when it is detected that the user inputs a search text in the search engine, the search text input by the user is acquired; generating a first label for the search text based on a document label library generated by preselection; generating a second label for each document in a document library (such as a Baidu library, a known cybernetics library and the like) based on a pre-generated document label library, wherein the documents are stored in the document library, and a plurality of documents are stored in the document library and are searched by a user; and matching the first label with the second label, and setting the part of the first label, which is the same as the second label, as an effective label, wherein the first label, the second label and the effective label comprise at least one label word.
Exemplarily, when the search text input by the user is "how do the windows blue screen? ", based on the document label library generated in advance, generating a first label (" windows "," win "," blue screen ") for the search text; if the content of one document in the document library is 'win 10 computer blue screen reinstallation system … …', generating a second label for the document ('win', 'win 10', 'computer', 'blue screen', 'reinstallation system') based on a pre-generated document label library, and generating a second label for each document in the document library in the same way; a first tag ("windows", "win", "blue screen") and a second tag ("win", "win10", "computer", "blue screen", "reinstallation system") are matched, and the same portion of the first tag and the second tag is set as an active tag, wherein the first tag, the second tag and the active tag include at least one tagword, and in this embodiment, the active tag is the "win", "blue screen").
In some embodiments, referring to fig. 3, the method for generating the document tag library includes:
s301, generating a plurality of document tags based on a plurality of user search texts and the document tag generation method provided by the embodiment;
s302, generating a document tag library based on the plurality of document tags.
As described in steps 301 to S302, in order to facilitate the subsequent automatic generation of tags for documents and the simplification of the matching process of document tags, a document tag library may be generated in advance based on a large amount of sample data for calling in actual application, thereby improving efficiency. Specifically, a plurality of document tags are generated based on a plurality of user search texts (i.e. different search texts input by the user, clicked document name texts corresponding to search results and clicked times) and the document tag generation method provided by the above embodiment; and generating a document tag library based on the plurality of document tags.
In order to improve the efficiency of statistics and search of character strings, a prefix tree technology can be introduced in the database generation process, a prefix tree is constructed by using a plurality of document tags, and the prefix tree constructed by the plurality of document tags is set as a document tag library (the prefix tree includes all document tags). The prefix tree is also known as a dictionary tree, a word search tree and a Trie tree, is a multi-path tree structure, is a variant of a hash tree and is a multi-branch tree structure for quick retrieval. Its typical application is for counting and ordering a large number of character strings (but not limited to character strings), so it is often used by search engine systems for text word frequency statistics, and its advantages are: the unnecessary character string comparison is reduced to the maximum extent, and the query efficiency is high. Illustratively, referring to FIG. 4, when there is a set of document tags: inn, int, at, age, adv, ant, ate, a prefix tree as shown in fig. 4 may be constructed from the set of document tags.
As described in step S205, a tag coverage score of each document in the document library is sequentially obtained based on the obtained first tag and the obtained second tag, where the tag coverage score is used to represent the matching degree between the content of the document and the search text, and a higher tag coverage score indicates a higher matching degree between the content of the document and the search text. In some embodiments, the coverage score described above may be obtained by the following formula:
Figure BDA0003859767290000111
wherein n is the number of label words in the first label, num __ is the number of label words in the second label;
and when the ith label word in the first label is completely the same as any label word in the second label, tag i =1;
And when the ith label word in the first label is partially identical to any label word in the second label, tag i =N,N∈(0,1);
And when the ith label word in the first label is not the same as any label word in the second label, tag i =0。
In the present embodiment, N may be 0.7. Calculating the formula for obtaining the label coverage score by taking the examples of the first label as ("windows", "win", "blue screen") and the second label as ("win", "win10", "computer", "blue screen", "reinstallation system"), then:
Figure BDA0003859767290000112
as described in step S206 above, based on the valid tags, the tag compactness score of each document in the document library is sequentially obtained, where the tag compactness score is used to represent the position closeness of the content of the valid tag in the document content, and the higher the tag compactness score is, the closer the position between the tag words in the valid tag is, i.e. the more the valid tag meets the real search intention of the user.
In some embodiments, referring to fig. 5, the step of sequentially obtaining a tag compactness score of each document based on the valid tags includes:
s2061, generating position elements according to the positions of all the label words in the effective labels in the documents, wherein the position elements comprise the label words and the position information of the label words;
s2062, arranging the position elements in sequence to generate a first sequence;
s2063, acquiring a first label combination based on the first sequence, wherein the first label combination comprises all label words in the effective labels, and the position distances among all the label words are the closest in the document;
s2064, obtaining the label compactness score of the document according to the first label combination.
According to the steps S2061-S2064, generating a position element according to the positions of all the label words in the effective labels in the document content in each document, wherein the position element comprises the label words and the position information of the label words; arranging the position elements in sequence to generate a first sequence; and acquiring a first label combination based on the first sequence, wherein the first label combination comprises all label words in the effective labels, and the position distances among all the label words in the document are the closest.
Illustratively, when there is a document "document X", the valid tags of the content of the document include 3 notebooks, which are { a, B, C }, respectively, then position elements may be generated according to the positions of the 3 tagbooks, for example, (C, 2), the position of the tagbook "C" is the second character in the document tag, and assuming that the position elements corresponding to the 3 tagbooks are obtained, a first sequence generated after the position elements are arranged in sequence is:
[(C,2),(A,5),(B,10),(C,12),(A,14),(B,23),(A,33),(C,50)]。
then, all the label combinations such as [ (C, 2), (a, 5), (B, 10) ], [ (a, 5), (B, 10), (C, 12) ], etc. contain all the label words in the valid label, and after finding all the similar label combinations, the label combination with the closest position distance between all the label words can be selected as the first label combination. In some embodiments, referring to fig. 6, the step of obtaining a first tag combination based on the first sequence includes:
s2063a, sequentially setting each position element in the first sequence as a target element, acquiring position elements which are behind the position of the target element and have the closest distance with the target element and contain other label words, and generating a plurality of position element sequences;
s2063b, respectively calculating the total distance of each label word in each position element sequence;
s2063c, setting the position element sequence with the minimum total distance as the first label combination.
As described above in steps S2063a to S2063c, it is assumed that the first sequence is:
[ (C, 2), (a, 5), (B, 10), (C, 12), (a, 14), (B, 23), (a, 33), (C, 50) ], sequentially setting each position element in the first sequence as a target element, and acquiring a position element including other tag words in the active tag after the position of the target element and closest to the target element, to generate a plurality of position element sequences, which in this embodiment includes: [ (C, 2), (A, 5), (B, 10) ], [ (A, 5), (B, 10), (C, 12) ], [ (B, 10), (C, 12), (A, 14) ], [ (C, 12), (A, 14), (B, 23) ], [ (A, 14), (B, 23), (C, 50) ], [ (B, 23), (A, 33), (C, 50) ].
After all the position element sequences are found, the total distance of each tag word in each position element sequence is calculated, for example, the position element sequence [ (C, 2), (a, 5), (B, 10) ], the position of the tag word C is 2, the position of the tag word a is 5, the position of the tag word B is 10, the distance between the tag word C and a is a distance length of 3 characters, and the distance between the tag word a and the tag word B is a distance length of 4 characters, so that the total distance of each tag word in the position element sequences [ (C, 2), (a, 5), (B, 10) ] is a distance length of 7 characters, and the total distance of each tag word in the rest position element sequences is calculated in the same manner, so that the position element sequence with the highest cohesion (i.e., the smallest total distance) can be found to be [ (B, 10), (C, 12), (a, 14) ], and the position element sequence is set as the first tag combination, wherein the first tag combination includes all tags in the effective tags, and all the nearest distance between the tags in the document is the effective tags.
In some embodiments, the tag compactness score described above may be obtained according to the following formula:
Figure BDA0003859767290000131
wherein, L is a total distance of each label word in the first label combination, M is a first preset distance threshold, and K is a second preset distance threshold.
In this embodiment, the value of M may be 5,K may be 20, that is, when the total distance L of each tagged word in the first tag combination is less than 5, the tag compactness score _ close is 1, when the total distance of each tagged word in the first tag combination is greater than 20, the tag compactness score _ close is 0, and when the total distance of each tagged word in the first tag combination is between 5 and 20, the tag compactness score _ close is 1/L. It should be noted that, in other embodiments, the values of M and K may be set according to actual design requirements, and are not limited herein.
As described in step S207 above, the overall tag matching score of each document is obtained based on the tag coverage score and the tag compactness score obtained by the above calculation. In some embodiments, the overall label matching score may be obtained according to the following formula:
score=score_cover*(1+t*score_close),
wherein, score is the overall label matching score, score _ cover is the label coverage score, score _ close is the label compactness score, t is the weight, and the weight is set based on the label coverage score. For example, when the value of score _ cover is greater than 0.9, t is 1, and when the value of score _ cover is less than or equal to 0.9, t is 0. In other embodiments, the value of t may also be set according to actual design requirements, and is not limited herein.
As described in the above steps S208 to S209, the overall tag matching scores of all the documents in the document library are respectively calculated and obtained by the above manner of obtaining the overall tag matching scores, and then the first ranking result is obtained according to the ranking of the overall tag matching scores in the order of high and low, for example, the document library includes document a, document B, document C and document D, where the ranking of the overall tag matching scores of each document is as follows: document a < document B < document C < document D; and setting the document meeting the preset rule as the document matched with the search text input by the user at the current time according to the preset rule and the first sequencing result. Illustratively, when the preset rule is to select the document with the global tag matching score ranking three above in the document library as the document matched with the search text input by the user at the current time, in this embodiment, the document B, the document C and the document D are selected as the documents matched with the search text input by the user at the current time for the user to search.
The matching degree between the document label, the document content and the user search text (namely the real search intention) can be better improved by combining the label coverage degree score and the label compactness score to obtain an overall label matching score, taking the overall label matching score as a measuring standard to jointly judge the matching degree between the document label and the user search text, and selecting the document of which the overall label matching score meets the preset rule as a mode for the user to search and obtain the document matched with the search text input by the user at the current time.
The document tag generation and document tag matching method includes collecting a search text input by a user and a clicked document name text corresponding to the search text; integrating the same search texts but different corresponding clicked records of the document name texts to obtain a first integrated result, wherein the first integrated result comprises the search texts, the document name texts and the click times of the document name texts; obtaining the longest common character string of the search text and each document name text according to the first integration result; obtaining the longest public character string with the maximum clicking frequency in the longest public character string according to the longest public character string and the clicking times; setting the longest public character string with the largest click frequency as a label candidate word, wherein the number of the label candidate words is at least one; and setting at least one of the candidate words as a document tag. By automatically generating the document tag and setting the longest public character string with the largest clicking frequency as a tag candidate word, the creation process of the document tag is simplified, and the matching degree of the search intention of the user and the document tag is improved.
Referring to fig. 7, an embodiment of the present application further provides a document tag generating apparatus, including:
a collecting module 701, configured to collect a search text input by a user and a clicked document name text corresponding to the search text;
an integration module 702, configured to integrate records that are the same in the search texts but different in the document name texts clicked correspondingly to obtain a first integration result, where the first integration result includes the search texts, each of the document name texts, and the clicked times of each of the document name texts;
a first obtaining module 703, configured to obtain the longest common character string of the search text and each of the document name texts according to the first integration result;
a second obtaining module 704, configured to obtain, according to the longest public character string and the clicked times, a longest public character string with a largest click frequency in the longest public character string;
a tag candidate word setting module 705, configured to set the longest public character string with the largest click frequency as a tag candidate word, where the number of the tag candidate words is at least one;
and a document tag generating module 706, configured to set at least one of the tag candidate words as a document tag.
In this embodiment, the collecting module 701 may collect the search text input by the user in the search engine and the document name text corresponding to the search click according to the search log and the click (document obtained by click search) log of the user. To expand sample data, search, click logs over a period of time (e.g., one month) may be collected for centralized information collection.
The integration module 702 integrates the same search text but different records corresponding to the clicked document name text to obtain a first integration result, where the first integration result includes the search text, different document name texts, and the clicked times of the different document name texts. For example, assuming that the same search text that is input multiple times in the statistical log record is "win10 blue screen", the clicked document name text corresponding to each search result is "what win10 blue screen is, the" processing method of computer blue screen ", the" newly purchased MAC computer blue screen ", and the" blue screen reinstallation system ", and the clicked times corresponding to the clicked document name text are" 10"," 1", and" 2", respectively, and the log records are integrated, so as to obtain a first integrated result.
The first obtaining module 703 obtains the longest common character string of the search text of the user and the text of each document name according to the first integration result; the second obtaining module 704 obtains the longest public character string with the largest click frequency in the longest public character strings according to the longest public character strings and the clicked times corresponding to the texts of the names of the documents; the tag candidate setting module 705 sets the longest public character string with the largest click frequency as a tag candidate, where the number of the tag candidate is at least one; the document tag generation module 706 sets at least one of the memo candidate words as a document tag. The above example is still used to illustrate: in the above embodiment, the longest common strings "win10 blue screen" and "blue screen" are clicked 10 times and 13 times respectively, so that the "blue screen" is set as a candidate word for a memo, while in some other embodiments, there may be longest common strings that are clicked the same frequency but have different text contents, for example, if the longest common strings are "win10 blue screen" and "blue screen", respectively, and the clicked frequencies of the two longest common strings are both "10", and "10" is the largest clicked frequency in the integration result, then both "win10 blue screen" and "blue screen" are set as candidate tag words, and in practical applications, one of the "win10 blue screen" and "blue screen" may be set as a document tag, and both "win10 blue screen" and "blue screen" may also be set as document tags. In practical applications, the number of candidate tag words is not limited to one or two, and may be more than two, which is not limited herein.
In practical applications, a plurality of related tagged words may be obtained in the above manner, in order to remove redundancy and ensure the specificity of the document tagged words (for example, when prepositions such as "of" and "ground" are also defined as document tagged words, but there is no specific semantic meaning), in some embodiments, document tagged words whose length does not meet preset requirements (for example, the preset requirements are that the length is not 1, and the length cannot be greater than 10) may be eliminated, and, if one long document tagged word may be composed of short document tagged words, the long document tagged word may also be eliminated, so as to repeatedly retain the short tagged word.
In some embodiments, when letters or character strings composed of letters exist in the search text and the clicked document name text of the user, in order to avoid errors in text matching (although the capital and small formats of the letters are different in appearance, the expressed meanings of the letters are generally the same in the text), the letters existing in all the texts are subjected to unified format conversion, such as being set to be in a lower case format or being set to be in a higher case format.
Referring to fig. 8, an embodiment of the present application further provides a document tag matching apparatus, including:
a search text acquisition module 801, configured to acquire a search text input by a user;
a first tag generation module 802, configured to generate a first tag for the search text based on a document tag library, where the document tag library is constructed and obtained based on the document tag generation method provided in the foregoing embodiment, and the first tag includes at least one tag word;
a second tag generating module 803, configured to generate a second tag for each document based on the document tag library, where the document is stored in a document library, a plurality of documents are stored in the document library and are searched by a user, and the second tag includes at least one tag word;
an effective tag generation module 804, configured to match the first tag with the second tag, and set a part of the first tag that is the same as the second tag as an effective tag, where the effective tag includes at least one tag word;
a tag coverage score obtaining module 805, configured to sequentially obtain a tag coverage score of each document based on the first tag and the second tag, where the tag coverage score is used to represent a matching degree between the document content and the search text;
a compactness score obtaining module 806, configured to sequentially obtain a tag compactness score for each of the documents based on the valid tags, where the tag compactness score is used to represent a position proximity of the tag content in the document content;
an overall tag matching score obtaining module 807 for obtaining an overall tag matching score for each of the documents according to the tag coverage score and the tag compactness score;
a sorting module 808, configured to sort the overall tag matching documents to obtain a first sorting result;
and the matched document setting module 809 is configured to set the document meeting the preset rule as a document matched with the search text according to a preset rule and the first ranking result.
In this embodiment, when it is detected that the user inputs a search text in the search engine, the search text input by the user is acquired through the search text acquisition module 801; generating a first tag for the search text by a first tag generation module 802 based on a document tag library which is generated by preselection; the second tag generating module 803 generates a second tag for each document in a document library (such as a hundred degree library, a known cybernetics library, etc.) based on a document tag library generated in advance, where the documents are stored in the document library, and a plurality of documents are stored in the document library and are searched by a user; the valid tag generating module 804 matches the first tag with the second tag, and sets a part of the first tag that is the same as the second tag as a valid tag, where the first tag, the second tag, and the valid tag include at least one tag word.
Exemplarily, when a search text input by a user is "how do with windows blue screen? ", based on the document label library generated in advance, generating a first label (" windows "," win "," blue screen ") for the search text; if the content of one document in the document library is 'win 10 computer blue screen reinstallation system … …', generating a second label for the document ('win', 'win 10', 'computer', 'blue screen', 'reinstallation system') based on a pre-generated document label library, and generating a second label for each document in the document library in the same way; the method comprises the steps of matching a first label (a "windows", "win", "blue screen") with a second label (a "win", "win10", "computer", "blue screen", "reinstallation system") and setting the same part of the first label and the second label as an effective label, wherein the first label, the second label and the effective label comprise at least one label word, and in the embodiment, the effective label is a "win", "blue screen").
In this embodiment, in order to facilitate subsequent automatic generation of a tag for a document and simplification of a matching process of a document tag, a document tag library may be generated based on a large amount of sample data for calling in actual application, so as to improve efficiency. Specifically, a plurality of document tags are generated based on a plurality of user search texts (that is, different search texts input by users, clicked document name texts corresponding to search results and clicked times) and the document tag generation method provided by the above embodiment; and generating a document tag library based on the plurality of document tags.
In some embodiments, in order to improve the statistics and search efficiency of the character string, in the document tag library generation process, a prefix tree technique may be introduced, where a prefix tree is constructed by using a plurality of document tags, and the prefix tree constructed by using the plurality of document tags is set as the document tag library (the prefix tree includes all document tags). The prefix tree, also known as dictionary tree, word search tree and Trie tree, is a multipath tree structure, is a variety of hash tree and is a multi-branch tree structure for quick retrieval. Its typical application is for counting and ordering a large number of character strings (but not limited to character strings), so it is often used by search engine systems for text word frequency statistics, and its advantages are: the unnecessary character string comparison is reduced to the maximum extent, and the query efficiency is high.
In this embodiment, the tag coverage score obtaining module 805 further obtains a tag coverage score of each document in the document library in sequence based on the obtained first tag and the obtained second tag, where the tag coverage score is used to represent the matching degree between the content of the document and the search text, and a higher tag coverage score indicates a higher matching degree between the content of the document and the search text; and sequentially obtaining a tag compactness score of each document in the document library through the compactness score obtaining module 806 based on the effective tag, where the tag compactness score is used to represent the position closeness of the effective tag content in the document content, and the higher the tag compactness score is, the closer the position between tag words in the effective tag is, that is, the more the effective tag conforms to the real search intention of the user; then, the overall tag matching score of each document in the document library is obtained through the volume tag matching score obtaining module 807 according to the tag coverage score and the tag compactness score obtained through the above calculation; then, the ranking module 808 ranks the documents according to the overall tag matching scores in a high-low order to obtain a first ranking result, for example, the document library includes document a, document B, document C, and document D, where the overall tag matching scores of the documents are ranked as follows: document a < document B < document C < document D; finally, the matched document setting module 809 sets the document meeting the preset rule as the document matched with the search text input by the user at the current time according to the preset rule and the first sorting result. Illustratively, when the preset rule is to select the document with the global tag matching score ranking three above in the document library as the document matched with the search text input by the user at the current time, in this embodiment, the document B, the document C and the document D are selected as the documents matched with the search text input by the user at the current time for the user to search.
It can be understood that each component of the document tag generation apparatus and the document tag matching apparatus provided in the present application may respectively implement the functions of any one of the document tag generation method, the document tag library generation method, and the document tag matching method provided in any one of the above embodiments, and the specific structures are not described in detail again.
Referring to fig. 9, an embodiment of the present application further provides a computer device, and an internal structure of the computer device may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a storage medium and an internal memory. The storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and computer programs in the storage medium. The database of the computer device is used for storing relevant data of a document tag generation method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement one or more of the document tag generation method, the document tag library generation method, and the document tag matching method provided by any of the above embodiments.
The embodiments of the present application further provide a computer-readable storage medium, which may be non-volatile or volatile, and on which a computer program is stored, where the computer program, when executed by a processor, implements one or more of the document tag generation method, the document tag library generation method, and the document tag matching method provided in any of the above embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (SSRDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.
The document tag generation method, the document tag matching method, the document tag generation device and the document tag matching device collect a search text input by a user and a clicked document name text corresponding to the search text; integrating the same search texts but different corresponding clicked records of the document name texts to obtain a first integrated result, wherein the first integrated result comprises the search texts, the document name texts and the click times of the document name texts; obtaining the longest common character string of the search text and each document name text according to the first integration result; obtaining the longest public character string with the largest click frequency in the longest public character string according to the longest public character string and the click times; setting the longest public character string with the largest clicking frequency as a label candidate word, wherein the label candidate word is at least one; and setting at least one of the candidate words as a document tag. By automatically generating the document tag and setting the longest public character string with the largest clicking frequency as a tag candidate word, the creation process of the document tag is simplified, and the matching degree of the search intention of the user and the document tag is improved.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, apparatus, article, or method that comprises the element.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims (11)

1. A document tag generation method is characterized by comprising the following steps:
collecting a search text input by a user and a clicked document name text corresponding to the search text;
integrating records, which are the same in the search texts but different in the corresponding clicked document name texts, to obtain a first integrated result, wherein the first integrated result comprises the search texts, the document name texts and the clicked times of the document name texts;
obtaining the longest common character string of the search text and each document name text according to the first integration result;
obtaining the longest public character string with the maximum clicking frequency in the longest public character string according to the longest public character string and the clicked times;
setting the longest public character string with the largest click frequency as a label candidate word, wherein the number of the label candidate words is at least one;
and setting at least one of the label candidate words as a document label.
2. A document tag matching method is characterized by comprising the following steps:
acquiring a search text input by a user;
generating a first tag for the search text based on a document tag library, wherein the document tag library is obtained based on a document tag construction obtained by the document tag generation method according to claim 1, and the first tag comprises at least one tag word;
generating a second tag for each document based on the document tag library, wherein the documents are stored in the document library, a plurality of documents are stored in the document library and are searched by a user, and the second tag comprises at least one tag word;
matching the first label with a second label, and setting the part of the first label, which is the same as the second label, as an effective label, wherein the effective label comprises at least one label word;
sequentially obtaining a label coverage score of each document based on the first label and the second label, wherein the label coverage score is used for representing the matching degree of the document content and the search text;
sequentially obtaining a tag compactness score of each document based on the effective tags, wherein the tag compactness score is used for representing the position closeness degree of the effective tag contents in the document contents;
obtaining an overall tag matching score for each of the documents based on the tag coverage score and the tag compactness score;
sorting the overall label matching documents to obtain a first sorting result;
and setting the document meeting the preset rule as the document matched with the search text according to the preset rule and the first sequencing result.
3. The document tag matching method of claim 2, wherein the step of sequentially obtaining a tag compactness score for each of the documents based on the valid tags comprises:
generating a position element according to the positions of all the label words in the effective labels in the documents, wherein the position element comprises the label words and the position information of the label words;
arranging the elements at each position in sequence to generate a first sequence;
obtaining a first label combination based on the first sequence, wherein the first label combination comprises all label words in the effective label, and the position distances among all the label words are the closest in the document;
and obtaining a tag compactness score of the document according to the first tag combination.
4. The document tag matching method of claim 3, wherein the step of obtaining a first tag combination based on the first sequence comprises:
sequentially setting each position element in the first sequence as a target element, and acquiring position elements which are behind the position of the target element and have the closest distance to the target element and contain other tag words to generate a plurality of position element sequences;
respectively calculating the total distance of each label word in each position element sequence;
and setting the position element sequence with the minimum total distance as a first label combination.
5. The document tag matching method of claim 2, wherein the overall tag matching score is obtained according to the following formula:
score=score_cover*(1+t*score_close),
wherein score is an overall label matching score, score _ cover is a label coverage score, score _ close is a label compactness score, and t is a weight that is set based on the label coverage score.
6. The document tag matching method of claim 2, wherein the tag coverage score is obtained according to the following formula:
Figure FDA0003859767280000031
wherein n is the number of label words in the first label, and num _ query _ tag is the number of label words in the second label;
and when the ith label word in the first label is completely the same as any label word in the second label, tag i =1;
And when the ith label word in the first label is partially identical to any label word in the second label, tag i =N,N∈(0,1);
And when the ith label word in the first label is not the same as any label word in the second label, tag i =0。
7. The document tag matching method of claim 3, wherein the tag compactness score is obtained according to the following formula:
Figure FDA0003859767280000032
wherein L is a total distance of each label word in the first label combination, M is a first preset distance threshold, and K is a second preset distance threshold.
8. A document tag generation apparatus, comprising:
the collection module is used for collecting a search text input by a user and a clicked document name text corresponding to the search text;
the integration module is used for integrating the same search texts but different records of the corresponding clicked document name texts to obtain a first integration result, wherein the first integration result comprises the search texts, the document name texts and the click times of the document name texts;
the first acquisition module is used for acquiring the longest public character string of the search text and each document name text according to the first integration result;
the second obtaining module is used for obtaining the longest public character string with the largest click frequency in the longest public character string according to the longest public character string and the click times;
a label candidate word setting module, configured to set the longest public character string with the largest click frequency as a label candidate word, where the number of the label candidate words is at least one;
and the document tag generation module is used for setting at least one of the tag candidate words as a document tag.
9. A document tag matching apparatus, comprising:
the search text acquisition module is used for acquiring a search text input by a user;
a first tag generation module, configured to generate a first tag for the search text, where the document tag library is constructed and obtained based on the document tag generation method according to claim 1, and the first tag includes at least one tag word;
the second tag generation module is used for generating a second tag for each document based on the document tag library, wherein the documents are stored in the document library, a plurality of documents are stored in the document library and are searched by a user, and the second tag comprises at least one tag word;
the effective label generating module is used for matching the first label with the second label and setting the part of the first label, which is the same as the second label, as an effective label, wherein the effective label comprises at least one label word;
a tag coverage score obtaining module, configured to sequentially obtain a tag coverage score of each document based on the first tag and the second tag, where the coverage score is used to represent a matching degree between the document content and the search text;
a compactness score obtaining module, configured to sequentially obtain a tag compactness score of each document based on the valid tag, where the compactness score is used to represent a position proximity of the tag content in the document content;
an overall label matching score obtaining module, configured to obtain an overall label matching score for each document according to the label coverage score and the label compactness score;
the sorting module is used for sorting the overall label matching documents to obtain a first sorting result;
and the matched document setting module is used for setting the document meeting the preset rule as the document matched with the search text according to the preset rule and the first sequencing result.
10. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the document tag generation method of claim 1 and/or the document tag matching method of any of claims 2-7.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the document tag generation method of claim 1 and/or the document tag matching method of any one of claims 2 to 7.
CN202211158183.0A 2022-09-22 2022-09-22 Document tag generation and matching method, device and computer equipment Active CN115687579B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211158183.0A CN115687579B (en) 2022-09-22 2022-09-22 Document tag generation and matching method, device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211158183.0A CN115687579B (en) 2022-09-22 2022-09-22 Document tag generation and matching method, device and computer equipment

Publications (2)

Publication Number Publication Date
CN115687579A true CN115687579A (en) 2023-02-03
CN115687579B CN115687579B (en) 2023-08-01

Family

ID=85061934

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211158183.0A Active CN115687579B (en) 2022-09-22 2022-09-22 Document tag generation and matching method, device and computer equipment

Country Status (1)

Country Link
CN (1) CN115687579B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116756273A (en) * 2023-07-04 2023-09-15 重庆亚利贝德科技咨询有限公司 Working system for realizing feature tag screening of massive entrusted documents

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090226098A1 (en) * 2006-05-19 2009-09-10 Nagaoka University Of Technology Character string updated degree evaluation program
US7752222B1 (en) * 2007-07-20 2010-07-06 Google Inc. Finding text on a web page
US20110161311A1 (en) * 2009-12-28 2011-06-30 Yahoo! Inc. Search suggestion clustering and presentation
CN103886034A (en) * 2014-03-05 2014-06-25 北京百度网讯科技有限公司 Method and equipment for building indexes and matching inquiry input information of user
US20140280081A1 (en) * 2013-03-14 2014-09-18 Microsoft Corporation Part-of-speech tagging for ranking search results
CN109376309A (en) * 2018-12-28 2019-02-22 北京百度网讯科技有限公司 Document recommendation method and device based on semantic label
CN109408794A (en) * 2017-08-17 2019-03-01 阿里巴巴集团控股有限公司 A kind of frequency dictionary method for building up, segmenting method, server and client side's equipment
US20190095439A1 (en) * 2017-09-22 2019-03-28 Microsoft Technology Licensing, Llc Content pattern based automatic document classification
CN109858040A (en) * 2019-03-05 2019-06-07 腾讯科技(深圳)有限公司 Name entity recognition method, device and computer equipment
CN110795943A (en) * 2019-09-25 2020-02-14 中国科学院计算技术研究所 Topic representation generation method and system for event
CN111563207A (en) * 2020-07-14 2020-08-21 口碑(上海)信息技术有限公司 Search result sorting method and device, storage medium and computer equipment
CN113268995A (en) * 2021-07-19 2021-08-17 北京邮电大学 Chinese academy keyword extraction method, device and storage medium
CN113901173A (en) * 2021-09-23 2022-01-07 深信服科技股份有限公司 Retrieval method, retrieval device, electronic equipment and computer storage medium
CN115017879A (en) * 2022-05-27 2022-09-06 深圳证券信息有限公司 Text comparison method, computer device and computer storage medium

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090226098A1 (en) * 2006-05-19 2009-09-10 Nagaoka University Of Technology Character string updated degree evaluation program
US7752222B1 (en) * 2007-07-20 2010-07-06 Google Inc. Finding text on a web page
US20110161311A1 (en) * 2009-12-28 2011-06-30 Yahoo! Inc. Search suggestion clustering and presentation
US20140280081A1 (en) * 2013-03-14 2014-09-18 Microsoft Corporation Part-of-speech tagging for ranking search results
CN103886034A (en) * 2014-03-05 2014-06-25 北京百度网讯科技有限公司 Method and equipment for building indexes and matching inquiry input information of user
CN109408794A (en) * 2017-08-17 2019-03-01 阿里巴巴集团控股有限公司 A kind of frequency dictionary method for building up, segmenting method, server and client side's equipment
US20190095439A1 (en) * 2017-09-22 2019-03-28 Microsoft Technology Licensing, Llc Content pattern based automatic document classification
CN109376309A (en) * 2018-12-28 2019-02-22 北京百度网讯科技有限公司 Document recommendation method and device based on semantic label
CN109858040A (en) * 2019-03-05 2019-06-07 腾讯科技(深圳)有限公司 Name entity recognition method, device and computer equipment
CN110795943A (en) * 2019-09-25 2020-02-14 中国科学院计算技术研究所 Topic representation generation method and system for event
CN111563207A (en) * 2020-07-14 2020-08-21 口碑(上海)信息技术有限公司 Search result sorting method and device, storage medium and computer equipment
CN113268995A (en) * 2021-07-19 2021-08-17 北京邮电大学 Chinese academy keyword extraction method, device and storage medium
CN113901173A (en) * 2021-09-23 2022-01-07 深信服科技股份有限公司 Retrieval method, retrieval device, electronic equipment and computer storage medium
CN115017879A (en) * 2022-05-27 2022-09-06 深圳证券信息有限公司 Text comparison method, computer device and computer storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ALPA JAIN: "Organizing query completions for web search", PROCEEDINGS OF THE 19TH ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2010, pages 1169 - 1178 *
李雯;文勇军;唐立军;: "多特征融合的教育资源标签生成算法", 计算机与现代化, no. 09, pages 23 - 28 *
郑弘晖;郭红;: "基于有效最低公共祖先的XML关键字查询算法", 计算机应用, no. 03, pages 261 - 266 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116756273A (en) * 2023-07-04 2023-09-15 重庆亚利贝德科技咨询有限公司 Working system for realizing feature tag screening of massive entrusted documents

Also Published As

Publication number Publication date
CN115687579B (en) 2023-08-01

Similar Documents

Publication Publication Date Title
US8117178B2 (en) Natural language based service selection system and method, service query system and method
US20160350434A1 (en) Systems and methods for improved web searching
WO2021098648A1 (en) Text recommendation method, apparatus and device, and medium
CN102063469B (en) Method and device for acquiring relevant keyword message and computer equipment
CN101169780A (en) Semantic ontology retrieval system and method
US8812508B2 (en) Systems and methods for extracting phases from text
JP6355840B2 (en) Stopword identification method and apparatus
CN103577416A (en) Query expansion method and system
US9971828B2 (en) Document tagging and retrieval using per-subject dictionaries including subject-determining-power scores for entries
CN108090178B (en) Text data analysis method, text data analysis device, server and storage medium
CN110232126B (en) Hot spot mining method, server and computer readable storage medium
CN107844493B (en) File association method and system
CN112231418A (en) Power standard knowledge graph construction method and device, computer equipment and medium
CN113342976A (en) Method, device, storage medium and equipment for automatically acquiring and processing data
CN101425068A (en) Method for ordering search result and ordering device
US8799314B2 (en) System and method for managing information map
CN112685475A (en) Report query method and device, computer equipment and storage medium
CN115687579A (en) Document tag generation and matching method and device and computer equipment
KR100671077B1 (en) Server, Method and System for Providing Information Search Service by Using Sheaf of Pages
CN114253990A (en) Database query method and device, computer equipment and storage medium
CN108509449B (en) Information processing method and server
CN107590233A (en) A kind of file management method and device
CN109062946A (en) It is a kind of to highlight method and device based on multiple web pages
CN115544225A (en) Digital archive information association retrieval method based on semantics
CN116610853A (en) Search recommendation method, search recommendation system, computer device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant