CN115687579A

CN115687579A - Document tag generation and matching method and device and computer equipment

Info

Publication number: CN115687579A
Application number: CN202211158183.0A
Authority: CN
Inventors: 丘文波
Original assignee: Guangzhou Shirong Information Technology Co ltd; Guangzhou Shiyuan Electronics Thecnology Co Ltd
Current assignee: Guangzhou Shirong Information Technology Co ltd; Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date: 2022-09-22
Filing date: 2022-09-22
Publication date: 2023-02-03
Anticipated expiration: 2042-09-22
Also published as: CN115687579B

Abstract

The application belongs to the technical field of internet, and particularly relates to a document tag generation and matching method, a document tag generation and matching device and computer equipment. The document labeling method comprises the following steps: collecting a search text input by a user and a clicked document name text corresponding to the search text; integrating the records with the same search text but different corresponding clicked document name texts to obtain a first integration result; obtaining the longest common character string of the search text and each document name text according to the first integration result; obtaining the longest public character string with the maximum clicking frequency in the longest public character string according to the longest public character string and the clicked times; setting the longest public character string with the largest clicking frequency as a label candidate word; and setting at least one of the label candidate words as a document label. The method simplifies the creation process of the document tag and improves the matching degree of the search intention of the user and the document tag.

Description

Document tag generation and matching method and device and computer equipment

Technical Field

The application relates to the technical field of internet, in particular to a document tag generating and matching method, a document tag generating and matching device and computer equipment.

Background

In content search in the vertical field, such as academic search, community forum search, and the like, related documents need to be tagged, so that documents required by a user can be quickly matched according to a search text of the user, and a matching effect of the search text and the document tag also affects a final search effect. Currently, document tags are often designed by manual editing, so the creation process is relatively labor intensive, and in some cases the user's search intent matches the document tags to a lesser extent.

Disclosure of Invention

The application mainly aims to provide a document tag generation and matching method, a document tag generation and matching device and computer equipment, and aims to solve the technical problems that a document tag creation process is complex and the matching degree of the document tag and a user search intention is low.

In order to achieve the above object, the present application provides a document tag generating method, including:

collecting a search text input by a user and a clicked document name text corresponding to the search text;

integrating records, which are the same in the search texts but different in the corresponding clicked document name texts, to obtain a first integrated result, wherein the first integrated result comprises the search texts, the document name texts and the clicked times of the document name texts;

obtaining the longest public character string of the search text and each document name text according to the first integration result;

obtaining the longest public character string with the maximum clicking frequency in the longest public character string according to the longest public character string and the clicked times;

setting the longest public character string with the largest clicking frequency as a label candidate word, wherein the label candidate word is at least one;

and setting at least one of the label candidate words as a document label.

The application also provides a document tag matching method, which comprises the following steps:

acquiring a search text input by a user;

generating a first tag for the search text based on a document tag library, wherein the document tag library is constructed and obtained based on the document tag generation method provided by the embodiment, and the first tag comprises at least one tag word;

generating a second tag for each document based on the document tag library, wherein the documents are stored in the document library, a plurality of documents are stored in the document library and are searched by a user, and the second tag comprises at least one tag word;

matching the first label with a second label, and setting the part of the first label, which is the same as the second label, as an effective label, wherein the effective label comprises at least one label word;

sequentially obtaining a label coverage score of each document based on the first label and the second label, wherein the label coverage score is used for representing the matching degree of the document content and the search text;

sequentially obtaining a tag compactness score of each document based on the effective tags, wherein the tag compactness score is used for representing the position closeness degree of the effective tag contents in the document contents;

obtaining an overall tag matching score for each of the documents based on the tag coverage score and the tag compactness score;

sorting the overall label matching documents to obtain a first sorting result;

and setting the document meeting the preset rule as a document matched with the search text according to the preset rule and the first sequencing result.

In one embodiment, the step of sequentially obtaining a tag compactness score for each of the documents based on the valid tags includes:

generating a position element according to the positions of all the label words in the effective labels in the documents, wherein the position element comprises the label words and the position information of the label words;

arranging the position elements in sequence to generate a first sequence;

obtaining a first label combination based on the first sequence, wherein the first label combination comprises all label words in the effective label, and the position distances among all the label words are the closest in the document;

and obtaining a tag compactness score of the document according to the first tag combination.

In one embodiment, the step of obtaining a first tag combination based on the first sequence comprises:

sequentially setting each position element in the first sequence as a target element, and acquiring position elements which are behind the position of the target element and have the closest distance to the target element and contain other tag words to generate a plurality of position element sequences;

respectively calculating the total distance of each label word in each position element sequence;

and setting the position element sequence with the minimum total distance as a first label combination.

In one embodiment, the overall label matching score is obtained according to the following formula:

score＝score_cover*(1+t*score_close)，

wherein score is an overall label matching score, score _ cover is a label coverage score, score _ close is a label compactness score, and t is a weight that is set based on the label coverage score.

In one embodiment, the label coverage score is obtained according to the following formula:

wherein n is the number of label words in the first label, and num _ query _ tag is the number of label words in the second label;

and when the ith label word in the first label is completely the same as any label word in the second label, tag _i ＝1；

And when the ith label word in the first label is partially identical to any label word in the second label, tag _i ＝N，N∈(0，1)；

And when the ith label word in the first label is not the same as any label word in the second label, tag _i ＝0。

In one embodiment, the tag compactness score is obtained according to the following formula:

wherein L is a total distance of each label word in the first label combination, M is a first preset distance threshold, and K is a second preset distance threshold.

The present application further provides a document tag generating apparatus, including:

the collection module is used for collecting a search text input by a user and a clicked document name text corresponding to the search text;

the integration module is used for integrating the same search texts but different records of the corresponding clicked document name texts to obtain a first integration result, wherein the first integration result comprises the search texts, the document name texts and the clicked times of the document name texts;

the first acquisition module is used for acquiring the longest common character string of the search text and each document name text according to the first integration result;

the second obtaining module is used for obtaining the longest public character string with the largest click frequency in the longest public character string according to the longest public character string and the clicked times;

a label candidate word setting module, configured to set the longest public character string with the largest click frequency as a label candidate word, where the number of the label candidate words is at least one;

and the document tag generation module is used for setting at least one of the tag candidate words as a document tag.

The present application further provides a document tag matching apparatus, including:

the search text acquisition module is used for acquiring a search text input by a user;

a first tag generation module, configured to generate a first tag for the search text, where the document tag library is constructed and obtained based on the document tag generation method provided in the foregoing embodiment, and the first tag includes at least one tag word;

the second tag generation module is used for generating a second tag for each document based on the document tag library, wherein the documents are stored in the document library, a plurality of documents are stored in the document library and are searched by a user, and the second tag comprises at least one tag word;

the effective label generating module is used for matching the first label with the second label and setting the part of the first label, which is the same as the second label, as an effective label, wherein the effective label comprises at least one label word;

a tag coverage score obtaining module, configured to sequentially obtain a tag coverage score of each document based on the first tag and the second tag, where the tag coverage score is used to characterize a matching degree between the document content and the search text;

a compactness score obtaining module, configured to sequentially obtain a label compactness score of each document based on the valid label, where the label compactness score is used to represent a position proximity of the valid label content in the document content;

an overall label matching score obtaining module, configured to obtain an overall label matching score for each document according to the label coverage score and the label compactness score;

the sorting module is used for sorting the overall label matching documents to obtain a first sorting result;

and the matched document setting module is used for setting the document meeting the preset rule as the document matched with the search text according to the preset rule and the first sequencing result.

The present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the document tag generation method and/or the document tag matching method provided in any of the above embodiments when executing the computer program.

The present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps in the document tag generation method and/or the document tag matching method provided in any of the above embodiments.

The document tag generating and matching method, the document tag generating and matching device and the computer equipment collect a search text input by a user and a clicked document name text corresponding to the search text; integrating the same search texts but different corresponding clicked records of the document name texts to obtain a first integrated result, wherein the first integrated result comprises the search texts, the document name texts and the click times of the document name texts; obtaining the longest common character string of the search text and each document name text according to the first integration result; obtaining the longest public character string with the largest click frequency in the longest public character string according to the longest public character string and the click times; setting the longest public character string with the largest click frequency as a label candidate word, wherein the number of the label candidate words is at least one; and setting at least one of the candidate words as a document tag. By automatically generating the document tag and setting the longest public character string with the largest clicking frequency as a tag candidate word, the creation process of the document tag is simplified, and the matching degree of the search intention of the user and the document tag is improved.

Drawings

Fig. 1 is a flowchart illustrating a document tag generation method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a document tag matching method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a document tag library generating method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a prefix tree according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating the step S206 in the document tag matching method according to another embodiment of the present application;

fig. 6 is a flowchart illustrating step S2063 in the document tag matching method according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a document tag generation apparatus according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a document tag matching apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, an embodiment of the present application provides a document tag generation method, which includes steps S101 to S106, and the detailed description of each step of the method is as follows.

In one embodiment, the document tag generation method comprises the following steps:

s101, collecting a search text input by a user and a clicked document name text corresponding to the search text;

s102, integrating records, which are identical in search texts and different in corresponding clicked document name texts, to obtain a first integration result, wherein the first integration result comprises the search texts, the document name texts and the clicked times of the document name texts;

s103, obtaining the longest common character string of the search text and each document name text according to the first integration result;

s104, obtaining the longest public character string with the maximum click frequency in the longest public character string according to the longest public character string and the clicked times;

s105, setting the longest public character string with the largest clicking frequency as a label candidate word, wherein the number of the label candidate words is at least one;

s106, setting at least one of the label candidate words as a document label.

As described in step S101, the search text input by the user in the search engine and the document name text corresponding to the search click may be collected from the search log and the click (document obtained by click search) log of the user. To expand sample data, search, click logs over a period of time (e.g., one month) may be collected for centralized information collection.

As described in step S102, the records with the same search text but different corresponding clicked document name texts are integrated to obtain a first integrated result, where the first integrated result includes the search text, each different document name text, and the clicked times of each different document name text. For example, assuming that the same search text that is input multiple times in the statistical log record is "win10 blue screen", but the clicked document name text corresponding to each search result is "win10 blue screen what", "computer blue screen processing method", "newly purchased MAC computer blue screen" and "blue screen reinstallation system", and the clicked times corresponding to the clicked document name text are "10", "1" and "2", respectively, the log records are integrated to obtain a first integrated result, and one form of the first integrated result may be as shown in table 1 below:

TABLE 1 first integration results

User entered search text	Clicked document name text	Number of times of being clicked
			Win10 blue screen	How the win10 blue screen	10
Win10 blue screen	Processing method of computer blue screen	10
			Win10 blue screen	Newly bought MAC computer blue screen	1
Win10 blue screen	Blue screen reinstallation system	2

As described in the above steps S103-S106, the longest common character string of the search text and each document name text of the user is obtained according to the first integration result; obtaining the longest public character string with the maximum clicking frequency in the longest public character strings according to the clicked times corresponding to the longest public character string and the clicked times corresponding to the document name texts; setting the longest public character string with the largest click frequency as a label candidate word, wherein the number of the label candidate words is at least one; and setting at least one of the label candidate words as a document label.

The following is illustrated by way of example in table 2 below: in line 2 of table 1, the longest common character string of the search text of the user and each document name text is "win10 blue screen", and the clicked frequency of the longest common character string is 10; in row 3 of table 1, the longest common character string of the user's search text and each document name text is "blue screen" and its clicked frequency is 10, in row 4 of table 1, the longest common character string of the user's search text and each document name text is also "blue screen" and its clicked frequency is 1, in row 5 of table 1, the longest common character string of the user's search text and each document name text is also "blue screen" and its clicked frequency is 2, i.e., the clicked frequency of the longest common character string "blue screen" is (10 + 1) 2 times, i.e., 13 times, and thus "blue screen" is set as a candidate, while in some other embodiments, there may be the longest common character string having the same clicked frequency but different text contents, such as "win10 blue screen" and "blue screen", respectively, and the clicked frequency strings of the two above-mentioned longest common character strings are both "10", and "10" is set as a result of the integrated blue screen, and the actual blue screen may also be applied to the blue screen label of this document name text, where "win" 10 "and" blue screen "are both" blue screen ". In practical applications, the number of candidate tag words is not limited to one or two, and may be more than two, which is not limited herein.

TABLE 2 first integration results post-processing

In this way, it is expected that, in practical applications, a plurality of related tag words may be obtained, in order to remove redundancy and ensure the specificity of the document tag words (for example, when prepositions such as "of" and "ground" are also defined as document tag words, but there is no specific semantic meaning), in some embodiments, document tag words whose length does not meet preset requirements (for example, the preset requirements are that the length is not 1, and the length cannot be greater than 10) may be eliminated, and, if one long document tag word may be composed of short document tag words, the long document tag word may also be eliminated, so as to eliminate and retain the short tag words.

In some embodiments, when letters or character strings composed of letters exist in the search text and the clicked document name text of the user, in order to avoid errors in text matching (although the capital and small formats of the letters are different in appearance, the expressed meanings of the letters are generally the same in the text), the letters existing in all the texts are subjected to unified format conversion, such as being set to be in a lower case format or being set to be in a higher case format.

Referring to fig. 2, an embodiment of the present application further provides a document tag matching method, which includes steps S201 to S209, and details of each step of the method are described as follows.

In one embodiment, the document tag library matching method comprises the following steps:

s201, acquiring a search text input by a user;

s202, generating a first tag for the search text based on a document tag library, wherein the document tag library is constructed and obtained based on the document tag generation method provided by the embodiment, and the first tag comprises at least one tag word;

s203, generating a second tag for each document based on the document tag library, wherein the documents are stored in the document library, a plurality of documents are stored in the document library and are searched by a user, and the second tag comprises at least one tag word;

s204, matching the first label with a second label, and setting the part of the first label, which is the same as the second label, as an effective label, wherein the effective label comprises at least one label word;

s205, sequentially obtaining a label coverage score of each document based on the first label and the second label, wherein the label coverage score is used for representing the matching degree of the document content and the search text;

s206, sequentially obtaining a tag compactness score of each document based on the effective tags, wherein the tag compactness score is used for representing the position closeness of the effective tag content in the document content;

s207, obtaining an overall label matching score of each document according to the label coverage score and the label compactness score;

s208, sorting the overall label matching documents to obtain a first sorting result;

s209, setting the document meeting the preset rule as a document matched with the search text according to the preset rule and the first sequencing result.

As described in the above steps S201-S204, when it is detected that the user inputs a search text in the search engine, the search text input by the user is acquired; generating a first label for the search text based on a document label library generated by preselection; generating a second label for each document in a document library (such as a Baidu library, a known cybernetics library and the like) based on a pre-generated document label library, wherein the documents are stored in the document library, and a plurality of documents are stored in the document library and are searched by a user; and matching the first label with the second label, and setting the part of the first label, which is the same as the second label, as an effective label, wherein the first label, the second label and the effective label comprise at least one label word.

Exemplarily, when the search text input by the user is "how do the windows blue screen? ", based on the document label library generated in advance, generating a first label (" windows "," win "," blue screen ") for the search text; if the content of one document in the document library is 'win 10 computer blue screen reinstallation system … …', generating a second label for the document ('win', 'win 10', 'computer', 'blue screen', 'reinstallation system') based on a pre-generated document label library, and generating a second label for each document in the document library in the same way; a first tag ("windows", "win", "blue screen") and a second tag ("win", "win10", "computer", "blue screen", "reinstallation system") are matched, and the same portion of the first tag and the second tag is set as an active tag, wherein the first tag, the second tag and the active tag include at least one tagword, and in this embodiment, the active tag is the "win", "blue screen").

In some embodiments, referring to fig. 3, the method for generating the document tag library includes:

s301, generating a plurality of document tags based on a plurality of user search texts and the document tag generation method provided by the embodiment;

s302, generating a document tag library based on the plurality of document tags.

As described in steps 301 to S302, in order to facilitate the subsequent automatic generation of tags for documents and the simplification of the matching process of document tags, a document tag library may be generated in advance based on a large amount of sample data for calling in actual application, thereby improving efficiency. Specifically, a plurality of document tags are generated based on a plurality of user search texts (i.e. different search texts input by the user, clicked document name texts corresponding to search results and clicked times) and the document tag generation method provided by the above embodiment; and generating a document tag library based on the plurality of document tags.

In order to improve the efficiency of statistics and search of character strings, a prefix tree technology can be introduced in the database generation process, a prefix tree is constructed by using a plurality of document tags, and the prefix tree constructed by the plurality of document tags is set as a document tag library (the prefix tree includes all document tags). The prefix tree is also known as a dictionary tree, a word search tree and a Trie tree, is a multi-path tree structure, is a variant of a hash tree and is a multi-branch tree structure for quick retrieval. Its typical application is for counting and ordering a large number of character strings (but not limited to character strings), so it is often used by search engine systems for text word frequency statistics, and its advantages are: the unnecessary character string comparison is reduced to the maximum extent, and the query efficiency is high. Illustratively, referring to FIG. 4, when there is a set of document tags: inn, int, at, age, adv, ant, ate, a prefix tree as shown in fig. 4 may be constructed from the set of document tags.

As described in step S205, a tag coverage score of each document in the document library is sequentially obtained based on the obtained first tag and the obtained second tag, where the tag coverage score is used to represent the matching degree between the content of the document and the search text, and a higher tag coverage score indicates a higher matching degree between the content of the document and the search text. In some embodiments, the coverage score described above may be obtained by the following formula:

wherein n is the number of label words in the first label, num __ is the number of label words in the second label;

And when the ith label word in the first label is partially identical to any label word in the second label, tag _i ＝N，N∈(0,1)；

In the present embodiment, N may be 0.7. Calculating the formula for obtaining the label coverage score by taking the examples of the first label as ("windows", "win", "blue screen") and the second label as ("win", "win10", "computer", "blue screen", "reinstallation system"), then:

as described in step S206 above, based on the valid tags, the tag compactness score of each document in the document library is sequentially obtained, where the tag compactness score is used to represent the position closeness of the content of the valid tag in the document content, and the higher the tag compactness score is, the closer the position between the tag words in the valid tag is, i.e. the more the valid tag meets the real search intention of the user.

In some embodiments, referring to fig. 5, the step of sequentially obtaining a tag compactness score of each document based on the valid tags includes:

s2061, generating position elements according to the positions of all the label words in the effective labels in the documents, wherein the position elements comprise the label words and the position information of the label words;

s2062, arranging the position elements in sequence to generate a first sequence;

s2063, acquiring a first label combination based on the first sequence, wherein the first label combination comprises all label words in the effective labels, and the position distances among all the label words are the closest in the document;

s2064, obtaining the label compactness score of the document according to the first label combination.

According to the steps S2061-S2064, generating a position element according to the positions of all the label words in the effective labels in the document content in each document, wherein the position element comprises the label words and the position information of the label words; arranging the position elements in sequence to generate a first sequence; and acquiring a first label combination based on the first sequence, wherein the first label combination comprises all label words in the effective labels, and the position distances among all the label words in the document are the closest.

Illustratively, when there is a document "document X", the valid tags of the content of the document include 3 notebooks, which are { a, B, C }, respectively, then position elements may be generated according to the positions of the 3 tagbooks, for example, (C, 2), the position of the tagbook "C" is the second character in the document tag, and assuming that the position elements corresponding to the 3 tagbooks are obtained, a first sequence generated after the position elements are arranged in sequence is:

[(C,2),(A,5),(B,10),(C,12),(A,14),(B,23),(A,33),(C,50)]。

then, all the label combinations such as [ (C, 2), (a, 5), (B, 10) ], [ (a, 5), (B, 10), (C, 12) ], etc. contain all the label words in the valid label, and after finding all the similar label combinations, the label combination with the closest position distance between all the label words can be selected as the first label combination. In some embodiments, referring to fig. 6, the step of obtaining a first tag combination based on the first sequence includes:

s2063a, sequentially setting each position element in the first sequence as a target element, acquiring position elements which are behind the position of the target element and have the closest distance with the target element and contain other label words, and generating a plurality of position element sequences;

s2063b, respectively calculating the total distance of each label word in each position element sequence;

s2063c, setting the position element sequence with the minimum total distance as the first label combination.

As described above in steps S2063a to S2063c, it is assumed that the first sequence is:

[ (C, 2), (a, 5), (B, 10), (C, 12), (a, 14), (B, 23), (a, 33), (C, 50) ], sequentially setting each position element in the first sequence as a target element, and acquiring a position element including other tag words in the active tag after the position of the target element and closest to the target element, to generate a plurality of position element sequences, which in this embodiment includes: [ (C, 2), (A, 5), (B, 10) ], [ (A, 5), (B, 10), (C, 12) ], [ (B, 10), (C, 12), (A, 14) ], [ (C, 12), (A, 14), (B, 23) ], [ (A, 14), (B, 23), (C, 50) ], [ (B, 23), (A, 33), (C, 50) ].

After all the position element sequences are found, the total distance of each tag word in each position element sequence is calculated, for example, the position element sequence [ (C, 2), (a, 5), (B, 10) ], the position of the tag word C is 2, the position of the tag word a is 5, the position of the tag word B is 10, the distance between the tag word C and a is a distance length of 3 characters, and the distance between the tag word a and the tag word B is a distance length of 4 characters, so that the total distance of each tag word in the position element sequences [ (C, 2), (a, 5), (B, 10) ] is a distance length of 7 characters, and the total distance of each tag word in the rest position element sequences is calculated in the same manner, so that the position element sequence with the highest cohesion (i.e., the smallest total distance) can be found to be [ (B, 10), (C, 12), (a, 14) ], and the position element sequence is set as the first tag combination, wherein the first tag combination includes all tags in the effective tags, and all the nearest distance between the tags in the document is the effective tags.

In some embodiments, the tag compactness score described above may be obtained according to the following formula:

wherein, L is a total distance of each label word in the first label combination, M is a first preset distance threshold, and K is a second preset distance threshold.

In this embodiment, the value of M may be 5,K may be 20, that is, when the total distance L of each tagged word in the first tag combination is less than 5, the tag compactness score _ close is 1, when the total distance of each tagged word in the first tag combination is greater than 20, the tag compactness score _ close is 0, and when the total distance of each tagged word in the first tag combination is between 5 and 20, the tag compactness score _ close is 1/L. It should be noted that, in other embodiments, the values of M and K may be set according to actual design requirements, and are not limited herein.

As described in step S207 above, the overall tag matching score of each document is obtained based on the tag coverage score and the tag compactness score obtained by the above calculation. In some embodiments, the overall label matching score may be obtained according to the following formula:

score＝score_cover*(1+t*score_close)，

wherein, score is the overall label matching score, score _ cover is the label coverage score, score _ close is the label compactness score, t is the weight, and the weight is set based on the label coverage score. For example, when the value of score _ cover is greater than 0.9, t is 1, and when the value of score _ cover is less than or equal to 0.9, t is 0. In other embodiments, the value of t may also be set according to actual design requirements, and is not limited herein.

As described in the above steps S208 to S209, the overall tag matching scores of all the documents in the document library are respectively calculated and obtained by the above manner of obtaining the overall tag matching scores, and then the first ranking result is obtained according to the ranking of the overall tag matching scores in the order of high and low, for example, the document library includes document a, document B, document C and document D, where the ranking of the overall tag matching scores of each document is as follows: document a < document B < document C < document D; and setting the document meeting the preset rule as the document matched with the search text input by the user at the current time according to the preset rule and the first sequencing result. Illustratively, when the preset rule is to select the document with the global tag matching score ranking three above in the document library as the document matched with the search text input by the user at the current time, in this embodiment, the document B, the document C and the document D are selected as the documents matched with the search text input by the user at the current time for the user to search.

The matching degree between the document label, the document content and the user search text (namely the real search intention) can be better improved by combining the label coverage degree score and the label compactness score to obtain an overall label matching score, taking the overall label matching score as a measuring standard to jointly judge the matching degree between the document label and the user search text, and selecting the document of which the overall label matching score meets the preset rule as a mode for the user to search and obtain the document matched with the search text input by the user at the current time.

The document tag generation and document tag matching method includes collecting a search text input by a user and a clicked document name text corresponding to the search text; integrating the same search texts but different corresponding clicked records of the document name texts to obtain a first integrated result, wherein the first integrated result comprises the search texts, the document name texts and the click times of the document name texts; obtaining the longest common character string of the search text and each document name text according to the first integration result; obtaining the longest public character string with the maximum clicking frequency in the longest public character string according to the longest public character string and the clicking times; setting the longest public character string with the largest click frequency as a label candidate word, wherein the number of the label candidate words is at least one; and setting at least one of the candidate words as a document tag. By automatically generating the document tag and setting the longest public character string with the largest clicking frequency as a tag candidate word, the creation process of the document tag is simplified, and the matching degree of the search intention of the user and the document tag is improved.

Referring to fig. 7, an embodiment of the present application further provides a document tag generating apparatus, including:

a collecting module 701, configured to collect a search text input by a user and a clicked document name text corresponding to the search text;

an integration module 702, configured to integrate records that are the same in the search texts but different in the document name texts clicked correspondingly to obtain a first integration result, where the first integration result includes the search texts, each of the document name texts, and the clicked times of each of the document name texts;

a first obtaining module 703, configured to obtain the longest common character string of the search text and each of the document name texts according to the first integration result;

a second obtaining module 704, configured to obtain, according to the longest public character string and the clicked times, a longest public character string with a largest click frequency in the longest public character string;

a tag candidate word setting module 705, configured to set the longest public character string with the largest click frequency as a tag candidate word, where the number of the tag candidate words is at least one;

and a document tag generating module 706, configured to set at least one of the tag candidate words as a document tag.

In this embodiment, the collecting module 701 may collect the search text input by the user in the search engine and the document name text corresponding to the search click according to the search log and the click (document obtained by click search) log of the user. To expand sample data, search, click logs over a period of time (e.g., one month) may be collected for centralized information collection.

The integration module 702 integrates the same search text but different records corresponding to the clicked document name text to obtain a first integration result, where the first integration result includes the search text, different document name texts, and the clicked times of the different document name texts. For example, assuming that the same search text that is input multiple times in the statistical log record is "win10 blue screen", the clicked document name text corresponding to each search result is "what win10 blue screen is, the" processing method of computer blue screen ", the" newly purchased MAC computer blue screen ", and the" blue screen reinstallation system ", and the clicked times corresponding to the clicked document name text are" 10"," 1", and" 2", respectively, and the log records are integrated, so as to obtain a first integrated result.

The first obtaining module 703 obtains the longest common character string of the search text of the user and the text of each document name according to the first integration result; the second obtaining module 704 obtains the longest public character string with the largest click frequency in the longest public character strings according to the longest public character strings and the clicked times corresponding to the texts of the names of the documents; the tag candidate setting module 705 sets the longest public character string with the largest click frequency as a tag candidate, where the number of the tag candidate is at least one; the document tag generation module 706 sets at least one of the memo candidate words as a document tag. The above example is still used to illustrate: in the above embodiment, the longest common strings "win10 blue screen" and "blue screen" are clicked 10 times and 13 times respectively, so that the "blue screen" is set as a candidate word for a memo, while in some other embodiments, there may be longest common strings that are clicked the same frequency but have different text contents, for example, if the longest common strings are "win10 blue screen" and "blue screen", respectively, and the clicked frequencies of the two longest common strings are both "10", and "10" is the largest clicked frequency in the integration result, then both "win10 blue screen" and "blue screen" are set as candidate tag words, and in practical applications, one of the "win10 blue screen" and "blue screen" may be set as a document tag, and both "win10 blue screen" and "blue screen" may also be set as document tags. In practical applications, the number of candidate tag words is not limited to one or two, and may be more than two, which is not limited herein.

In practical applications, a plurality of related tagged words may be obtained in the above manner, in order to remove redundancy and ensure the specificity of the document tagged words (for example, when prepositions such as "of" and "ground" are also defined as document tagged words, but there is no specific semantic meaning), in some embodiments, document tagged words whose length does not meet preset requirements (for example, the preset requirements are that the length is not 1, and the length cannot be greater than 10) may be eliminated, and, if one long document tagged word may be composed of short document tagged words, the long document tagged word may also be eliminated, so as to repeatedly retain the short tagged word.

Referring to fig. 8, an embodiment of the present application further provides a document tag matching apparatus, including:

a search text acquisition module 801, configured to acquire a search text input by a user;

a first tag generation module 802, configured to generate a first tag for the search text based on a document tag library, where the document tag library is constructed and obtained based on the document tag generation method provided in the foregoing embodiment, and the first tag includes at least one tag word;

a second tag generating module 803, configured to generate a second tag for each document based on the document tag library, where the document is stored in a document library, a plurality of documents are stored in the document library and are searched by a user, and the second tag includes at least one tag word;

an effective tag generation module 804, configured to match the first tag with the second tag, and set a part of the first tag that is the same as the second tag as an effective tag, where the effective tag includes at least one tag word;

a tag coverage score obtaining module 805, configured to sequentially obtain a tag coverage score of each document based on the first tag and the second tag, where the tag coverage score is used to represent a matching degree between the document content and the search text;

a compactness score obtaining module 806, configured to sequentially obtain a tag compactness score for each of the documents based on the valid tags, where the tag compactness score is used to represent a position proximity of the tag content in the document content;

an overall tag matching score obtaining module 807 for obtaining an overall tag matching score for each of the documents according to the tag coverage score and the tag compactness score;

a sorting module 808, configured to sort the overall tag matching documents to obtain a first sorting result;

and the matched document setting module 809 is configured to set the document meeting the preset rule as a document matched with the search text according to a preset rule and the first ranking result.

In this embodiment, when it is detected that the user inputs a search text in the search engine, the search text input by the user is acquired through the search text acquisition module 801; generating a first tag for the search text by a first tag generation module 802 based on a document tag library which is generated by preselection; the second tag generating module 803 generates a second tag for each document in a document library (such as a hundred degree library, a known cybernetics library, etc.) based on a document tag library generated in advance, where the documents are stored in the document library, and a plurality of documents are stored in the document library and are searched by a user; the valid tag generating module 804 matches the first tag with the second tag, and sets a part of the first tag that is the same as the second tag as a valid tag, where the first tag, the second tag, and the valid tag include at least one tag word.

Exemplarily, when a search text input by a user is "how do with windows blue screen? ", based on the document label library generated in advance, generating a first label (" windows "," win "," blue screen ") for the search text; if the content of one document in the document library is 'win 10 computer blue screen reinstallation system … …', generating a second label for the document ('win', 'win 10', 'computer', 'blue screen', 'reinstallation system') based on a pre-generated document label library, and generating a second label for each document in the document library in the same way; the method comprises the steps of matching a first label (a "windows", "win", "blue screen") with a second label (a "win", "win10", "computer", "blue screen", "reinstallation system") and setting the same part of the first label and the second label as an effective label, wherein the first label, the second label and the effective label comprise at least one label word, and in the embodiment, the effective label is a "win", "blue screen").

In this embodiment, in order to facilitate subsequent automatic generation of a tag for a document and simplification of a matching process of a document tag, a document tag library may be generated based on a large amount of sample data for calling in actual application, so as to improve efficiency. Specifically, a plurality of document tags are generated based on a plurality of user search texts (that is, different search texts input by users, clicked document name texts corresponding to search results and clicked times) and the document tag generation method provided by the above embodiment; and generating a document tag library based on the plurality of document tags.

In some embodiments, in order to improve the statistics and search efficiency of the character string, in the document tag library generation process, a prefix tree technique may be introduced, where a prefix tree is constructed by using a plurality of document tags, and the prefix tree constructed by using the plurality of document tags is set as the document tag library (the prefix tree includes all document tags). The prefix tree, also known as dictionary tree, word search tree and Trie tree, is a multipath tree structure, is a variety of hash tree and is a multi-branch tree structure for quick retrieval. Its typical application is for counting and ordering a large number of character strings (but not limited to character strings), so it is often used by search engine systems for text word frequency statistics, and its advantages are: the unnecessary character string comparison is reduced to the maximum extent, and the query efficiency is high.

In this embodiment, the tag coverage score obtaining module 805 further obtains a tag coverage score of each document in the document library in sequence based on the obtained first tag and the obtained second tag, where the tag coverage score is used to represent the matching degree between the content of the document and the search text, and a higher tag coverage score indicates a higher matching degree between the content of the document and the search text; and sequentially obtaining a tag compactness score of each document in the document library through the compactness score obtaining module 806 based on the effective tag, where the tag compactness score is used to represent the position closeness of the effective tag content in the document content, and the higher the tag compactness score is, the closer the position between tag words in the effective tag is, that is, the more the effective tag conforms to the real search intention of the user; then, the overall tag matching score of each document in the document library is obtained through the volume tag matching score obtaining module 807 according to the tag coverage score and the tag compactness score obtained through the above calculation; then, the ranking module 808 ranks the documents according to the overall tag matching scores in a high-low order to obtain a first ranking result, for example, the document library includes document a, document B, document C, and document D, where the overall tag matching scores of the documents are ranked as follows: document a < document B < document C < document D; finally, the matched document setting module 809 sets the document meeting the preset rule as the document matched with the search text input by the user at the current time according to the preset rule and the first sorting result. Illustratively, when the preset rule is to select the document with the global tag matching score ranking three above in the document library as the document matched with the search text input by the user at the current time, in this embodiment, the document B, the document C and the document D are selected as the documents matched with the search text input by the user at the current time for the user to search.

It can be understood that each component of the document tag generation apparatus and the document tag matching apparatus provided in the present application may respectively implement the functions of any one of the document tag generation method, the document tag library generation method, and the document tag matching method provided in any one of the above embodiments, and the specific structures are not described in detail again.

Referring to fig. 9, an embodiment of the present application further provides a computer device, and an internal structure of the computer device may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a storage medium and an internal memory. The storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and computer programs in the storage medium. The database of the computer device is used for storing relevant data of a document tag generation method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement one or more of the document tag generation method, the document tag library generation method, and the document tag matching method provided by any of the above embodiments.

The embodiments of the present application further provide a computer-readable storage medium, which may be non-volatile or volatile, and on which a computer program is stored, where the computer program, when executed by a processor, implements one or more of the document tag generation method, the document tag library generation method, and the document tag matching method provided in any of the above embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (SSRDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.

The document tag generation method, the document tag matching method, the document tag generation device and the document tag matching device collect a search text input by a user and a clicked document name text corresponding to the search text; integrating the same search texts but different corresponding clicked records of the document name texts to obtain a first integrated result, wherein the first integrated result comprises the search texts, the document name texts and the click times of the document name texts; obtaining the longest common character string of the search text and each document name text according to the first integration result; obtaining the longest public character string with the largest click frequency in the longest public character string according to the longest public character string and the click times; setting the longest public character string with the largest clicking frequency as a label candidate word, wherein the label candidate word is at least one; and setting at least one of the candidate words as a document tag. By automatically generating the document tag and setting the longest public character string with the largest clicking frequency as a tag candidate word, the creation process of the document tag is simplified, and the matching degree of the search intention of the user and the document tag is improved.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, apparatus, article, or method that comprises the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A document tag generation method is characterized by comprising the following steps:

obtaining the longest common character string of the search text and each document name text according to the first integration result;

setting the longest public character string with the largest click frequency as a label candidate word, wherein the number of the label candidate words is at least one;

and setting at least one of the label candidate words as a document label.

2. A document tag matching method is characterized by comprising the following steps:

acquiring a search text input by a user;

generating a first tag for the search text based on a document tag library, wherein the document tag library is obtained based on a document tag construction obtained by the document tag generation method according to claim 1, and the first tag comprises at least one tag word;

sorting the overall label matching documents to obtain a first sorting result;

and setting the document meeting the preset rule as the document matched with the search text according to the preset rule and the first sequencing result.

3. The document tag matching method of claim 2, wherein the step of sequentially obtaining a tag compactness score for each of the documents based on the valid tags comprises:

arranging the elements at each position in sequence to generate a first sequence;

4. The document tag matching method of claim 3, wherein the step of obtaining a first tag combination based on the first sequence comprises:

5. The document tag matching method of claim 2, wherein the overall tag matching score is obtained according to the following formula:

score＝score_cover*(1+t*score_close)，

6. The document tag matching method of claim 2, wherein the tag coverage score is obtained according to the following formula:

7. The document tag matching method of claim 3, wherein the tag compactness score is obtained according to the following formula:

8. A document tag generation apparatus, comprising:

the integration module is used for integrating the same search texts but different records of the corresponding clicked document name texts to obtain a first integration result, wherein the first integration result comprises the search texts, the document name texts and the click times of the document name texts;

the first acquisition module is used for acquiring the longest public character string of the search text and each document name text according to the first integration result;

the second obtaining module is used for obtaining the longest public character string with the largest click frequency in the longest public character string according to the longest public character string and the click times;

9. A document tag matching apparatus, comprising:

a first tag generation module, configured to generate a first tag for the search text, where the document tag library is constructed and obtained based on the document tag generation method according to claim 1, and the first tag includes at least one tag word;

a tag coverage score obtaining module, configured to sequentially obtain a tag coverage score of each document based on the first tag and the second tag, where the coverage score is used to represent a matching degree between the document content and the search text;

a compactness score obtaining module, configured to sequentially obtain a tag compactness score of each document based on the valid tag, where the compactness score is used to represent a position proximity of the tag content in the document content;

10. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the document tag generation method of claim 1 and/or the document tag matching method of any of claims 2-7.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the document tag generation method of claim 1 and/or the document tag matching method of any one of claims 2 to 7.