CN111552862B

CN111552862B - Automatic template mining system and method based on cross support evaluation

Info

Publication number: CN111552862B
Application number: CN201911383296.9A
Authority: CN
Inventors: 何立华; 贺小勇
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-12-28
Filing date: 2019-12-28
Publication date: 2023-04-21
Anticipated expiration: 2039-12-28
Also published as: CN111552862A

Abstract

The invention discloses an automatic template mining system and method based on cross support evaluation. The system comprises an intention recognition module, a category word replacement module, a frequent item set mining module and a template ordering module; the intention recognition module is used for carrying out intention recognition on the history record of the user and sending the record subjected to the intention recognition to the category word replacement module; the category word replacement module is used for cutting words from the records subjected to conscious recognition, replacing category words, and sending the records subjected to category word replacement to the frequent item set mining module; the frequent item set mining module is used for mining the records after the category words are replaced by using an association rule mining algorithm and screening frequent items to obtain a preliminary template; the template ordering module is used for ordering the preliminary templates according to the entropy value and the similarity with the existing word list. The invention improves the frequent item set mining method, and the evaluation based on the cross support degree has higher quality compared with the evaluation based on the confidence degree.

Description

Automatic template mining system and method based on cross support evaluation

Technical Field

The invention relates to the field of automatic mining of search templates, in particular to an automatic template mining system and method based on cross support evaluation.

Background

In vertical searching, when the search keyword of the user matches with the rule word in the database, the related data in the database is returned. In practical application, the search keywords of the user are various, it is difficult to manually configure all the matching words, and along with the increase of the number of the search categories, the manual configuration is obviously an unrealistic way, so that the design algorithm is necessary to automatically mine out the search templates commonly used by the user. The current research mainly excavates a search template from historical data of a user, and a typical representative hundred-degree search technology patent is automatic excavation method, demand recognition method and corresponding device of a demand recognition template. The method comprises the following specific steps: determining a record set corresponding to a preset type in a search log; selecting records of which the number of times of being clicked corresponding to a preset type exceeds the preset number of times from the set to form a seed template; matching the words of the preset type in the seed template with the words of the preset dictionary, and replacing the words with type attribute words; and obtaining a template.

The defects of the technology are mainly expressed in that: some records with potential template intent will be discarded. For example, "how much money is paid in a hotel in seven days", "how much money is paid in a hotel in a home", the two records are "how much money is paid in a hotel", and according to the prior art, if the click rate of the two records is low, the two records are removed, and in fact, they are all intended as templates.

Such problems can be avoided by frequent item set mining based methods. However, the method for mining frequent item sets based on the traditional confidence coefficient has the problem of rare item sets. By the problem of a sparse term set, it is meant that a non-frequent term set may yield potentially valuable rules, but is filtered out during the process of filtering with confidence. In the automatic template mining process, the problem of unbalanced item set support distribution exists in the records after intention recognition, so that the occurrence frequency of category words can far exceed the occurrence frequency of other words.

In addition to traditional confidence

Considering only the influence of A on B, but ignoring the occurrence of B, assuming that the confidence level is unreasonably set, when +.>

When the situation of (a) occurs, it is shown that the two items a, B are independent, but such a record is also kept due to the confidence level being set too low. After the intention recognition, the situation of unbalanced support degree distribution exists, so that the mined quality comparison depends on the confidence degree setting, and the most suitable confidence degree threshold value is often difficult to find.

Disclosure of Invention

In order to solve the problems existing in the conventional confidence evaluation, the invention provides an automatic template mining system and method based on cross support evaluation.

The object of the invention is achieved by at least one of the following technical solutions.

The template automatic mining system based on the cross support evaluation comprises an intention recognition module, a category word replacement module, a frequent item set mining module and a template ordering module;

the intention recognition module is used for carrying out intention recognition on the history record of the user and sending the record subjected to the intention recognition to the category word replacement module;

the category word replacement module is used for cutting words from the records subjected to conscious recognition, replacing category words, and sending the records subjected to category word replacement to the frequent item set mining module;

the frequent item set mining module is used for mining the records after the category words are replaced by using an association rule mining algorithm and screening frequent items to obtain a preliminary template;

the template ordering module is used for ordering the preliminary templates according to the entropy value and the similarity with the existing word list.

Further, in the intention recognition module, training an intention recognition model by adopting a relevant record, wherein the relevant record refers to a search record of a user, the intention recognition model comprises a fasttext model, and intention recognition is carried out on a historical search record by adopting the trained intention recognition model;

the training intention model is to input data with category labels, and the output of the model is corresponding category labels, for example, the training intention model is input with: 'hotel how much money', the tag is 'hotel'; the model is characterized in that the weather is the weather, the input of the model is the hotel money, the weather is the weather, the output is the hotel, the weather is the data with a large number of labels, the model learns parameters in the weather, the training is carried out, the meaning model calculates the probability that the records belong to each category according to the input records and outputs the category with the highest probability, for example, the probability that the model is newly input into the hotel nearby is the hotel, the probability of the hotel is the greatest, the model is classified into the hotel, the probability of the hotel belongs to other categories is smaller, and the model cannot be classified into other categories.

Further, in the category word replacement module, the record which is identified by intention is segmented by adopting the barker segmentation, for example, the input is 'how much money is paid by a hotel nearby', and the segmented word is obtained after: nearby/hotel/money. The matching algorithm is used to obtain the nearby',hoteland money of each word; words related to the fixed category in the record are replaced by words of the fixed category, such as template mining of food category, fast food restaurant, food restaurant and the like are replaced by [ food ]. The unified symbols herein include, but are not limited to, the expression form of [ food ]. The history search record is replaced, for example, a word list including words such as ' nearby fast food shops ' and ' fast food shops ' is provided in the history search record, the words are provided by service providers and can be directly taken, and then if the words in the search record can be found in the word list, the words are uniformly replaced by [ food ], namely the ' fast food shops ' and the ' fast food shops ' are replaced by [ food ] ' to obtain the ' nearby [ food ] '.

Further, in the frequent item set mining module, the record after the category word is replaced is subjected to word segmentation, the word segmentation in the category word replacement module is input as 'nearby fast food shops', the word segmentation is later as 'nearby/fast food shops', the word segmentation in the frequent item set mining module is input as 'nearby [ food ]', and the word segmentation is later as 'nearby/[ food ]'; and removing the duplication of the word obtained after the word segmentation, removing punctuation marks as the item to be mined, and mining by using an association rule mining algorithm.

Further, the association rule mining algorithm comprises a modified FP-Growth algorithm, namely, frequent items are screened by adopting cross support: the cross support evaluation index is as follows:

wherein A, B denotes a word obtained after word segmentation, P (a) denotes a frequency of occurrence of a obtained after word segmentation, P (B) denotes a frequency of occurrence of B obtained after word segmentation, P (AB) denotes a probability of simultaneous occurrence of a and B, such as 'nearby hotel', after word segmentation is 'nearby/hotel', after input to the model, a is 'nearby', B is 'hotel', P (AB) denotes a probability of simultaneous occurrence of two words of 'nearby' and 'hotel'; when the cross support degree is calculated, the corresponding category words are not included in the calculation range; the cross-support threshold is set to be greater than 0, with a greater than 0 meaning that there is a positive correlation between the two terms. Because the occurrence frequency of category words is too high, the meaning of screening by using the category words is not great after the records are subjected to intention recognition, the obtained records are the same type after the records are subjected to intention recognition, such as food, and then, in the records, a plurality of records contain the mark word [ food ], the occurrence frequency of the word is very high, and because the mining algorithm of frequent items screens according to the frequency, the frequency is too high, the mining is interfered, so that the category words such as [ food ] are not considered, namely, the occurrence frequency of the word such as [ food ] is not calculated when the frequency is calculated.

Further, in the template ordering module, the probability of the matching word in the same type of template is calculated, the universality of the template is evaluated by utilizing entropy values, the entropy value ordering is measured by the probability number included by the words with fixed category, and the calculation formula of the entropy value S is as follows:

S＝-∑p(A)log(p(A))；

for example, the template being excavated is [ food ]]How much money is due to [ food ]]Is replaced and may contain many cases, provided that [ food ] is here]The three conditions of 'fast food restaurant', 'hot pot restaurant' are included, wherein 'fast food restaurant' has 5 times of money, 'food restaurant' has 3 times of money, 'hot pot store' has 2 times of money, and 'food']How much money' the entropy of this template is:

adding another template is' [ food ]]If not good, only one type of 'fast food restaurant not good' matches the template, and assuming that 'fast food restaurant not good' appears 5 times, then the entropy of the template is: />

This entropy is higher than the previous [ food ]]The amount of money is ' small ', which can be seen from practice, is ' food]How much money isThe templates are more universal.

The similarity with the existing vocabulary is calculated by cosine similarity, the template mining is a task of multiple iterations, each time the mined word adds the previous words to the vocabulary to be actually used, and the next time the new word is mined, if the new word is similar to the word in the vocabulary to be actually used, the new word is reasonably believed to be suitable for being added to the vocabulary to be actually used. In practice, each word can be represented by a multidimensional vector, the representation of the vector is trained in advance, that is, each word corresponds to a multidimensional vector representation, and then the similarity between the vectors can be obtained by calculating cosine values between the vectors. For example, the vector expression [1 0 1 1 1] of the word A, the vector of the word B is [0 1 1 1 0 1], and the similarity of the two words can be obtained by calculating the cosine values of the two vectors. Training an LR model, namely a sorting model, to sort templates according to entropy values and similarity with the existing word list, wherein the sorting model uses an LR algorithm (linear regression model), firstly training the model to obtain sorting priority ratio parameters, for example, after training, the entropy values occupy 40 percent, the similarity ratio occupies 60 percent, and then sorting the templates according to the entropy values and the similarity priority ratio parameters with the existing word list.

Further, the training method of the ranking model includes the steps that firstly, historical record data are manually collected, data related to training categories are labeled with a label '1', irrelevant to the training categories are labeled with a label '0', for example, nearby hotels 'are collected, nearest weather' is relevant to hotels, nearest weather 'is relevant to weather, and supposing that the category related to weather is present, input of nearest weather is manually labeled with a label' 1', and nearby hotels' are manually labeled with a label '0'; then inputting the parameters into the sorting model to train, obtaining the priority duty ratio parameters of the sorting model, and next inputting a new record to install the parameters for sorting.

The automatic template mining method based on the cross support evaluation comprises the following steps:

s1, inputting a history record of a user, and carrying out intention recognition on the history record of the user by adopting an intention recognition module to obtain a record subjected to intention recognition

S2, adopting a category word replacement module to segment the record subjected to intention recognition, and using the bargain word segmentation to replace the words related to the fixed category in the record with the fixed category words to obtain a record after the category words are replaced;

s3, adopting a frequent item set mining module to segment the record subjected to the replacement of the category words with the bargain segmentation words and remove the duplication, taking the words obtained after the segmentation as the items to be mined, and mining by using an association rule mining algorithm;

s4, screening the frequent items by adopting a frequent item set mining module;

s5, traversing the records subjected to intention recognition, which are processed in the S1, and keeping the results which simultaneously accord with the frequent items screened in the S4 to obtain a preliminary template;

s6, adopting a template ordering module to order and display the templates.

Compared with the prior art, the invention has the advantages that:

compared with the traditional confidence mining method, the method has the advantages that:

the invention can save the confidence coefficient and the quality of the excavated result is higher.

Drawings

FIG. 1 is a flow chart of steps of the template automatic mining method based on cross support evaluation of the present invention;

FIG. 2 is a schematic diagram of results obtained through an intent recognition model in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a process for category word replacement in an embodiment of the present invention;

FIG. 4 is a pseudo code schematic diagram of an improvement in frequent itemset mining algorithm evaluation metrics in an embodiment of the present invention;

FIG. 5 is a pseudo code schematic of an algorithm for mining potential templates in an embodiment of the invention;

FIG. 6 is a schematic diagram of an automatic template mining method based on cross support evaluation in an embodiment of the present invention;

fig. 7 is a schematic diagram illustrating a conventional automatic template mining method based on confidence in an embodiment of the present invention.

Detailed Description

For the purpose of making the technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the embodiments are some embodiments of the present invention, but not all embodiments.

Examples:

in this embodiment, the cutting of the templates of the food category is taken as an example.

in the intention recognition module, training an intention recognition model by adopting a related record, wherein the related record refers to a search record of a user, the intention recognition model comprises a fasttext model, and intention recognition is carried out on a historical search record by adopting the trained intention recognition model;

in the category word replacement module, the record subjected to intention recognition is segmented by adopting the bargaining segmentation, for example, the input is 'how much money is paid by a nearby hotel', and the segmented word is obtained: nearby/hotel/money. The matching algorithm is used to obtain the nearby',hoteland money of each word; words related to the fixed category in the record are replaced by words of the fixed category, such as template mining of food category, fast food restaurant, food restaurant and the like are replaced by [ food ]. The unified symbols herein include, but are not limited to, the expression form of [ food ]. The history search record is replaced, for example, a word list including words such as ' nearby fast food shops ' and ' fast food shops ' is provided in the history search record, the words are provided by service providers and can be directly taken, and then if the words in the search record can be found in the word list, the words are uniformly replaced by [ food ], namely the ' fast food shops ' and the ' fast food shops ' are replaced by [ food ] ' to obtain the ' nearby [ food ] '.

in the frequent item set mining module, the record after the category word is replaced is subjected to word segmentation, the word segmentation in the category word replacement module is input as 'nearby fast food stores', the word segmentation is later as 'nearby/fast food stores', the word segmentation in the frequent item set mining module is input as 'nearby [ food ]', and the word segmentation is later as 'nearby/[ food ]'; and removing the duplication of the word obtained after the word segmentation, removing punctuation marks as the item to be mined, and mining by using an association rule mining algorithm.

The association rule mining algorithm comprises an improved FP-Growth algorithm, namely, frequent items are screened by adopting cross support degree: the cross support evaluation index is as follows:

In this embodiment, A and B are two terms, and the conventional confidence coefficient calculation method is that

Assuming that A and B are independent of each other, then

A is irrelevant to B, but if the confidence coefficient of B reaches a set threshold value, AB is mined as an associated item, and the mined rule is a pseudo rule and belongs to interference information. The method of the invention is->

When A and B are independent, the obtained cross support degree is 0, so that the items cannot be screened out, and the digging quality is improved. In addition, the traditional confidence coefficient method needs to manually set a proper threshold, some association rules are difficult to lose due to overlarge threshold, and some redundant information is reserved due to overlarge threshold.

In the template ordering module, the possibility of matching words in the same type of template is calculated, the universality of the template is evaluated by utilizing entropy values, the entropy value ordering is measured by the number of the possibility included by the words with fixed category, and the calculation formula of the entropy value S is as follows:

S＝-∑p(A)log(p(A))；

This entropy is higher than the previous [ food ]]The amount of money is ' small ', which can be seen from practice, is ' food]The template is more universal.

The training method of the sequencing model comprises the steps of firstly manually collecting historical record data, marking data related to training categories by a label '1', marking irrelevant to the training categories by a label '0', such as collecting 'nearby hotels', 'nearest weather', then 'nearby hotels' are relevant to hotels, 'nearest weather' are relevant to weather, and the input of 'nearest weather' is manually marked by a label '1' and 'nearby hotels' are manually marked by a label '0' on the premise of being in the weather-relevant category; then inputting the parameters into the sorting model to train, obtaining the priority duty ratio parameters of the sorting model, and next inputting a new record to install the parameters for sorting.

The automatic template mining method based on cross support evaluation, as shown in fig. 1, comprises the following steps:

s1, inputting a history record of a user, and carrying out intention recognition on the history record of the user by adopting an intention recognition module to obtain a history record with delicacies, wherein the history record is shown in FIG. 2;

s2, adopting a category word replacement module to record food intentions, and performing word segmentation by using the bargain word segmentation, wherein as shown in figure 3, category words related to food are replaced by [ food ] to obtain records after category words are replaced;

in this embodiment, the evaluation criteria of the FP-Growth algorithm is improved, and the improved algorithm pseudo code is shown in fig. 4;

and (3) using frequent items mined by the association rule in S4 to find potential templates in the original records, wherein a specific implementation algorithm is shown in pseudo codes in FIG. 5.

S4, screening the frequent items by adopting a frequent item set mining module;

s6, adopting a template ordering module to order and display the templates.

According to the template extracted in the step S6, the extracted words are added to the word list of the actual application, so that the user experience is improved, in the embodiment, the extracted results of the food category are shown, as shown in fig. 6, the existing words in the word list show whether the extracted words appear in the word list of the actual application or not, the words are represented by 0 if the extracted words do not appear, and the words are displayed if the extracted words do not need to be added any more when the words are added. For example, the template mined is [ food ] public number, the word of 'public number' does not exist in the word list of the actual application, so that a user does not return a result when searching for 'Dai sister chafing dish public number', the information related to 'Dai sister chafing dish' exists in the database actually, and only the tag of the related information is 'Dai sister chafing dish', 'Dai sister chafing dish' and 'Dai sister chafing dish public number' are not matched, so that the result is not returned. If the term "public number" is mined, the term is matched, and a return result is obtained.

FIG. 7 is a conventional confidence-based mining, and FIG. 6 is a cross-support based mining. The main difference is that the word 'Shanzhen' is not very strong with the hot pot in reality, because 'Shanzhen' is a city, the hot pot can be matched with other cities, so that the 'Shanzhen' can be screened out based on the traditional confidence, the occurrence times are more than the set threshold value, the situation is reserved, but the situation is not very strong, and the evaluation result based on the cross support degree is shown in fig. 6, so that the situation can be avoided.

Claims

1. The template automatic mining system based on the cross support evaluation is characterized by comprising an intention recognition module, a category word replacement module, a frequent item set mining module and a template ordering module;

the template ordering module is used for ordering the preliminary templates according to the entropy value and the similarity with the existing word list;

the training intention recognition model inputs data with category labels, the output of the model is the corresponding category label, the training is carried out to enable the intention model to calculate the probability that the records belong to each category according to the input records, and the category with the highest probability is output;

in the category word replacement module, the record subjected to intention recognition is segmented by adopting the barker segmentation, and the words related to the fixed category in the record are replaced by the fixed category words;

in the frequent item set mining module, a record after the category words are replaced is subjected to word segmentation, the words obtained after the word segmentation are subjected to duplication removal, punctuation marks are removed as items to be mined, and mining is performed by using an association rule mining algorithm;

wherein A, B represents a word obtained after word segmentation, P (A) represents the frequency of occurrence of A obtained after word segmentation, P (B) represents the frequency of occurrence of B obtained after word segmentation, and P (AB) represents the probability of simultaneous occurrence of A and B; when the cross support degree is calculated, the corresponding category words are not included in the calculation range; the cross support threshold is set to be greater than 0, and the fact that the cross support threshold is greater than 0 indicates that a positive correlation relationship exists between two items;

S＝-∑p(A)log(p(A))；

calculating the similarity between the model and the existing vocabulary by using cosine similarity, training an LR model, namely a sequencing model to sequence the templates according to the entropy value and the similarity between the model and the existing vocabulary, wherein the sequencing model adopts an LR algorithm, firstly training the model to obtain a sequencing priority duty ratio parameter, and then sequencing the templates according to the entropy value and the similarity priority duty ratio parameter between the model and the existing vocabulary; the training method of the sequencing model comprises the steps of firstly manually collecting historical record data, marking data related to training categories with a label '1', marking data not related to the training categories with a label '0', inputting the data into the sequencing model for training to obtain a priority duty ratio parameter of the sequencing model, and inputting a new record next time for installing the parameter for sequencing.

2. The automatic template mining method of the automatic template mining system based on cross support evaluation according to claim 1, characterized by comprising the steps of:

S2, adopting a category word replacement module to segment the junction word for the record which is subjected to intention recognition, replacing the words related to the fixed category in the record with the fixed category words, and obtaining the record after replacing the category words;

s3, adopting a frequent item set mining module to record the replaced category words, performing word segmentation and duplication removal by using the bargain segmentation, taking the words obtained after word segmentation as the items to be mined, and mining by using a correlation rule mining algorithm;

s4, screening the frequent items by adopting a frequent item set mining module;

s6, adopting a template ordering module to order and display the templates.