CN111552862B - Automatic template mining system and method based on cross support evaluation - Google Patents

Automatic template mining system and method based on cross support evaluation Download PDF

Info

Publication number
CN111552862B
CN111552862B CN201911383296.9A CN201911383296A CN111552862B CN 111552862 B CN111552862 B CN 111552862B CN 201911383296 A CN201911383296 A CN 201911383296A CN 111552862 B CN111552862 B CN 111552862B
Authority
CN
China
Prior art keywords
category
module
words
record
mining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911383296.9A
Other languages
Chinese (zh)
Other versions
CN111552862A (en
Inventor
何立华
贺小勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201911383296.9A priority Critical patent/CN111552862B/en
Publication of CN111552862A publication Critical patent/CN111552862A/en
Application granted granted Critical
Publication of CN111552862B publication Critical patent/CN111552862B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an automatic template mining system and method based on cross support evaluation. The system comprises an intention recognition module, a category word replacement module, a frequent item set mining module and a template ordering module; the intention recognition module is used for carrying out intention recognition on the history record of the user and sending the record subjected to the intention recognition to the category word replacement module; the category word replacement module is used for cutting words from the records subjected to conscious recognition, replacing category words, and sending the records subjected to category word replacement to the frequent item set mining module; the frequent item set mining module is used for mining the records after the category words are replaced by using an association rule mining algorithm and screening frequent items to obtain a preliminary template; the template ordering module is used for ordering the preliminary templates according to the entropy value and the similarity with the existing word list. The invention improves the frequent item set mining method, and the evaluation based on the cross support degree has higher quality compared with the evaluation based on the confidence degree.

Description

Automatic template mining system and method based on cross support evaluation
Technical Field
The invention relates to the field of automatic mining of search templates, in particular to an automatic template mining system and method based on cross support evaluation.
Background
In vertical searching, when the search keyword of the user matches with the rule word in the database, the related data in the database is returned. In practical application, the search keywords of the user are various, it is difficult to manually configure all the matching words, and along with the increase of the number of the search categories, the manual configuration is obviously an unrealistic way, so that the design algorithm is necessary to automatically mine out the search templates commonly used by the user. The current research mainly excavates a search template from historical data of a user, and a typical representative hundred-degree search technology patent is automatic excavation method, demand recognition method and corresponding device of a demand recognition template. The method comprises the following specific steps: determining a record set corresponding to a preset type in a search log; selecting records of which the number of times of being clicked corresponding to a preset type exceeds the preset number of times from the set to form a seed template; matching the words of the preset type in the seed template with the words of the preset dictionary, and replacing the words with type attribute words; and obtaining a template.
The defects of the technology are mainly expressed in that: some records with potential template intent will be discarded. For example, "how much money is paid in a hotel in seven days", "how much money is paid in a hotel in a home", the two records are "how much money is paid in a hotel", and according to the prior art, if the click rate of the two records is low, the two records are removed, and in fact, they are all intended as templates.
Such problems can be avoided by frequent item set mining based methods. However, the method for mining frequent item sets based on the traditional confidence coefficient has the problem of rare item sets. By the problem of a sparse term set, it is meant that a non-frequent term set may yield potentially valuable rules, but is filtered out during the process of filtering with confidence. In the automatic template mining process, the problem of unbalanced item set support distribution exists in the records after intention recognition, so that the occurrence frequency of category words can far exceed the occurrence frequency of other words.
In addition to traditional confidence
Figure RE-GDA0002581687190000021
Considering only the influence of A on B, but ignoring the occurrence of B, assuming that the confidence level is unreasonably set, when +.>
Figure RE-GDA0002581687190000022
When the situation of (a) occurs, it is shown that the two items a, B are independent, but such a record is also kept due to the confidence level being set too low. After the intention recognition, the situation of unbalanced support degree distribution exists, so that the mined quality comparison depends on the confidence degree setting, and the most suitable confidence degree threshold value is often difficult to find.
Disclosure of Invention
In order to solve the problems existing in the conventional confidence evaluation, the invention provides an automatic template mining system and method based on cross support evaluation.
The object of the invention is achieved by at least one of the following technical solutions.
The template automatic mining system based on the cross support evaluation comprises an intention recognition module, a category word replacement module, a frequent item set mining module and a template ordering module;
the intention recognition module is used for carrying out intention recognition on the history record of the user and sending the record subjected to the intention recognition to the category word replacement module;
the category word replacement module is used for cutting words from the records subjected to conscious recognition, replacing category words, and sending the records subjected to category word replacement to the frequent item set mining module;
the frequent item set mining module is used for mining the records after the category words are replaced by using an association rule mining algorithm and screening frequent items to obtain a preliminary template;
the template ordering module is used for ordering the preliminary templates according to the entropy value and the similarity with the existing word list.
Further, in the intention recognition module, training an intention recognition model by adopting a relevant record, wherein the relevant record refers to a search record of a user, the intention recognition model comprises a fasttext model, and intention recognition is carried out on a historical search record by adopting the trained intention recognition model;
the training intention model is to input data with category labels, and the output of the model is corresponding category labels, for example, the training intention model is input with: 'hotel how much money', the tag is 'hotel'; the model is characterized in that the weather is the weather, the input of the model is the hotel money, the weather is the weather, the output is the hotel, the weather is the data with a large number of labels, the model learns parameters in the weather, the training is carried out, the meaning model calculates the probability that the records belong to each category according to the input records and outputs the category with the highest probability, for example, the probability that the model is newly input into the hotel nearby is the hotel, the probability of the hotel is the greatest, the model is classified into the hotel, the probability of the hotel belongs to other categories is smaller, and the model cannot be classified into other categories.
Further, in the category word replacement module, the record which is identified by intention is segmented by adopting the barker segmentation, for example, the input is 'how much money is paid by a hotel nearby', and the segmented word is obtained after: nearby/hotel/money. The matching algorithm is used to obtain the nearby',hoteland money of each word; words related to the fixed category in the record are replaced by words of the fixed category, such as template mining of food category, fast food restaurant, food restaurant and the like are replaced by [ food ]. The unified symbols herein include, but are not limited to, the expression form of [ food ]. The history search record is replaced, for example, a word list including words such as ' nearby fast food shops ' and ' fast food shops ' is provided in the history search record, the words are provided by service providers and can be directly taken, and then if the words in the search record can be found in the word list, the words are uniformly replaced by [ food ], namely the ' fast food shops ' and the ' fast food shops ' are replaced by [ food ] ' to obtain the ' nearby [ food ] '.
Further, in the frequent item set mining module, the record after the category word is replaced is subjected to word segmentation, the word segmentation in the category word replacement module is input as 'nearby fast food shops', the word segmentation is later as 'nearby/fast food shops', the word segmentation in the frequent item set mining module is input as 'nearby [ food ]', and the word segmentation is later as 'nearby/[ food ]'; and removing the duplication of the word obtained after the word segmentation, removing punctuation marks as the item to be mined, and mining by using an association rule mining algorithm.
Further, the association rule mining algorithm comprises a modified FP-Growth algorithm, namely, frequent items are screened by adopting cross support: the cross support evaluation index is as follows:
Figure RE-GDA0002581687190000041
wherein A, B denotes a word obtained after word segmentation, P (a) denotes a frequency of occurrence of a obtained after word segmentation, P (B) denotes a frequency of occurrence of B obtained after word segmentation, P (AB) denotes a probability of simultaneous occurrence of a and B, such as 'nearby hotel', after word segmentation is 'nearby/hotel', after input to the model, a is 'nearby', B is 'hotel', P (AB) denotes a probability of simultaneous occurrence of two words of 'nearby' and 'hotel'; when the cross support degree is calculated, the corresponding category words are not included in the calculation range; the cross-support threshold is set to be greater than 0, with a greater than 0 meaning that there is a positive correlation between the two terms. Because the occurrence frequency of category words is too high, the meaning of screening by using the category words is not great after the records are subjected to intention recognition, the obtained records are the same type after the records are subjected to intention recognition, such as food, and then, in the records, a plurality of records contain the mark word [ food ], the occurrence frequency of the word is very high, and because the mining algorithm of frequent items screens according to the frequency, the frequency is too high, the mining is interfered, so that the category words such as [ food ] are not considered, namely, the occurrence frequency of the word such as [ food ] is not calculated when the frequency is calculated.
Further, in the template ordering module, the probability of the matching word in the same type of template is calculated, the universality of the template is evaluated by utilizing entropy values, the entropy value ordering is measured by the probability number included by the words with fixed category, and the calculation formula of the entropy value S is as follows:
S=-∑p(A)log(p(A));
for example, the template being excavated is [ food ]]How much money is due to [ food ]]Is replaced and may contain many cases, provided that [ food ] is here]The three conditions of 'fast food restaurant', 'hot pot restaurant' are included, wherein 'fast food restaurant' has 5 times of money, 'food restaurant' has 3 times of money, 'hot pot store' has 2 times of money, and 'food']How much money' the entropy of this template is:
Figure RE-GDA0002581687190000042
adding another template is' [ food ]]If not good, only one type of 'fast food restaurant not good' matches the template, and assuming that 'fast food restaurant not good' appears 5 times, then the entropy of the template is: />
Figure RE-GDA0002581687190000051
This entropy is higher than the previous [ food ]]The amount of money is ' small ', which can be seen from practice, is ' food]How much money isThe templates are more universal.
The similarity with the existing vocabulary is calculated by cosine similarity, the template mining is a task of multiple iterations, each time the mined word adds the previous words to the vocabulary to be actually used, and the next time the new word is mined, if the new word is similar to the word in the vocabulary to be actually used, the new word is reasonably believed to be suitable for being added to the vocabulary to be actually used. In practice, each word can be represented by a multidimensional vector, the representation of the vector is trained in advance, that is, each word corresponds to a multidimensional vector representation, and then the similarity between the vectors can be obtained by calculating cosine values between the vectors. For example, the vector expression [1 0 1 1 1] of the word A, the vector of the word B is [0 1 1 1 0 1], and the similarity of the two words can be obtained by calculating the cosine values of the two vectors. Training an LR model, namely a sorting model, to sort templates according to entropy values and similarity with the existing word list, wherein the sorting model uses an LR algorithm (linear regression model), firstly training the model to obtain sorting priority ratio parameters, for example, after training, the entropy values occupy 40 percent, the similarity ratio occupies 60 percent, and then sorting the templates according to the entropy values and the similarity priority ratio parameters with the existing word list.
Further, the training method of the ranking model includes the steps that firstly, historical record data are manually collected, data related to training categories are labeled with a label '1', irrelevant to the training categories are labeled with a label '0', for example, nearby hotels 'are collected, nearest weather' is relevant to hotels, nearest weather 'is relevant to weather, and supposing that the category related to weather is present, input of nearest weather is manually labeled with a label' 1', and nearby hotels' are manually labeled with a label '0'; then inputting the parameters into the sorting model to train, obtaining the priority duty ratio parameters of the sorting model, and next inputting a new record to install the parameters for sorting.
The automatic template mining method based on the cross support evaluation comprises the following steps:
s1, inputting a history record of a user, and carrying out intention recognition on the history record of the user by adopting an intention recognition module to obtain a record subjected to intention recognition
S2, adopting a category word replacement module to segment the record subjected to intention recognition, and using the bargain word segmentation to replace the words related to the fixed category in the record with the fixed category words to obtain a record after the category words are replaced;
s3, adopting a frequent item set mining module to segment the record subjected to the replacement of the category words with the bargain segmentation words and remove the duplication, taking the words obtained after the segmentation as the items to be mined, and mining by using an association rule mining algorithm;
s4, screening the frequent items by adopting a frequent item set mining module;
s5, traversing the records subjected to intention recognition, which are processed in the S1, and keeping the results which simultaneously accord with the frequent items screened in the S4 to obtain a preliminary template;
s6, adopting a template ordering module to order and display the templates.
Compared with the prior art, the invention has the advantages that:
compared with the traditional confidence mining method, the method has the advantages that:
the invention can save the confidence coefficient and the quality of the excavated result is higher.
Drawings
FIG. 1 is a flow chart of steps of the template automatic mining method based on cross support evaluation of the present invention;
FIG. 2 is a schematic diagram of results obtained through an intent recognition model in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a process for category word replacement in an embodiment of the present invention;
FIG. 4 is a pseudo code schematic diagram of an improvement in frequent itemset mining algorithm evaluation metrics in an embodiment of the present invention;
FIG. 5 is a pseudo code schematic of an algorithm for mining potential templates in an embodiment of the invention;
FIG. 6 is a schematic diagram of an automatic template mining method based on cross support evaluation in an embodiment of the present invention;
fig. 7 is a schematic diagram illustrating a conventional automatic template mining method based on confidence in an embodiment of the present invention.
Detailed Description
For the purpose of making the technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the embodiments are some embodiments of the present invention, but not all embodiments.
Examples:
in this embodiment, the cutting of the templates of the food category is taken as an example.
The template automatic mining system based on the cross support evaluation comprises an intention recognition module, a category word replacement module, a frequent item set mining module and a template ordering module;
the intention recognition module is used for carrying out intention recognition on the history record of the user and sending the record subjected to the intention recognition to the category word replacement module;
in the intention recognition module, training an intention recognition model by adopting a related record, wherein the related record refers to a search record of a user, the intention recognition model comprises a fasttext model, and intention recognition is carried out on a historical search record by adopting the trained intention recognition model;
the training intention model is to input data with category labels, and the output of the model is corresponding category labels, for example, the training intention model is input with: 'hotel how much money', the tag is 'hotel'; the model is characterized in that the weather is the weather, the input of the model is the hotel money, the weather is the weather, the output is the hotel, the weather is the data with a large number of labels, the model learns parameters in the weather, the training is carried out, the meaning model calculates the probability that the records belong to each category according to the input records and outputs the category with the highest probability, for example, the probability that the model is newly input into the hotel nearby is the hotel, the probability of the hotel is the greatest, the model is classified into the hotel, the probability of the hotel belongs to other categories is smaller, and the model cannot be classified into other categories.
The category word replacement module is used for cutting words from the records subjected to conscious recognition, replacing category words, and sending the records subjected to category word replacement to the frequent item set mining module;
in the category word replacement module, the record subjected to intention recognition is segmented by adopting the bargaining segmentation, for example, the input is 'how much money is paid by a nearby hotel', and the segmented word is obtained: nearby/hotel/money. The matching algorithm is used to obtain the nearby',hoteland money of each word; words related to the fixed category in the record are replaced by words of the fixed category, such as template mining of food category, fast food restaurant, food restaurant and the like are replaced by [ food ]. The unified symbols herein include, but are not limited to, the expression form of [ food ]. The history search record is replaced, for example, a word list including words such as ' nearby fast food shops ' and ' fast food shops ' is provided in the history search record, the words are provided by service providers and can be directly taken, and then if the words in the search record can be found in the word list, the words are uniformly replaced by [ food ], namely the ' fast food shops ' and the ' fast food shops ' are replaced by [ food ] ' to obtain the ' nearby [ food ] '.
The frequent item set mining module is used for mining the records after the category words are replaced by using an association rule mining algorithm and screening frequent items to obtain a preliminary template;
in the frequent item set mining module, the record after the category word is replaced is subjected to word segmentation, the word segmentation in the category word replacement module is input as 'nearby fast food stores', the word segmentation is later as 'nearby/fast food stores', the word segmentation in the frequent item set mining module is input as 'nearby [ food ]', and the word segmentation is later as 'nearby/[ food ]'; and removing the duplication of the word obtained after the word segmentation, removing punctuation marks as the item to be mined, and mining by using an association rule mining algorithm.
The association rule mining algorithm comprises an improved FP-Growth algorithm, namely, frequent items are screened by adopting cross support degree: the cross support evaluation index is as follows:
Figure RE-GDA0002581687190000081
wherein A, B denotes a word obtained after word segmentation, P (a) denotes a frequency of occurrence of a obtained after word segmentation, P (B) denotes a frequency of occurrence of B obtained after word segmentation, P (AB) denotes a probability of simultaneous occurrence of a and B, such as 'nearby hotel', after word segmentation is 'nearby/hotel', after input to the model, a is 'nearby', B is 'hotel', P (AB) denotes a probability of simultaneous occurrence of two words of 'nearby' and 'hotel'; when the cross support degree is calculated, the corresponding category words are not included in the calculation range; the cross-support threshold is set to be greater than 0, with a greater than 0 meaning that there is a positive correlation between the two terms. Because the occurrence frequency of category words is too high, the meaning of screening by using the category words is not great after the records are subjected to intention recognition, the obtained records are the same type after the records are subjected to intention recognition, such as food, and then, in the records, a plurality of records contain the mark word [ food ], the occurrence frequency of the word is very high, and because the mining algorithm of frequent items screens according to the frequency, the frequency is too high, the mining is interfered, so that the category words such as [ food ] are not considered, namely, the occurrence frequency of the word such as [ food ] is not calculated when the frequency is calculated.
In this embodiment, A and B are two terms, and the conventional confidence coefficient calculation method is that
Figure RE-GDA0002581687190000091
Assuming that A and B are independent of each other, then
Figure RE-GDA0002581687190000092
A is irrelevant to B, but if the confidence coefficient of B reaches a set threshold value, AB is mined as an associated item, and the mined rule is a pseudo rule and belongs to interference information. The method of the invention is->
Figure RE-GDA0002581687190000093
When A and B are independent, the obtained cross support degree is 0, so that the items cannot be screened out, and the digging quality is improved. In addition, the traditional confidence coefficient method needs to manually set a proper threshold, some association rules are difficult to lose due to overlarge threshold, and some redundant information is reserved due to overlarge threshold.
The template ordering module is used for ordering the preliminary templates according to the entropy value and the similarity with the existing word list.
In the template ordering module, the possibility of matching words in the same type of template is calculated, the universality of the template is evaluated by utilizing entropy values, the entropy value ordering is measured by the number of the possibility included by the words with fixed category, and the calculation formula of the entropy value S is as follows:
S=-∑p(A)log(p(A));
for example, the template being excavated is [ food ]]How much money is due to [ food ]]Is replaced and may contain many cases, provided that [ food ] is here]The three conditions of 'fast food restaurant', 'hot pot restaurant' are included, wherein 'fast food restaurant' has 5 times of money, 'food restaurant' has 3 times of money, 'hot pot store' has 2 times of money, and 'food']How much money' the entropy of this template is:
Figure RE-GDA0002581687190000101
adding another template is' [ food ]]If not good, only one type of 'fast food restaurant not good' matches the template, and assuming that 'fast food restaurant not good' appears 5 times, then the entropy of the template is: />
Figure RE-GDA0002581687190000102
This entropy is higher than the previous [ food ]]The amount of money is ' small ', which can be seen from practice, is ' food]The template is more universal.
The similarity with the existing vocabulary is calculated by cosine similarity, the template mining is a task of multiple iterations, each time the mined word adds the previous words to the vocabulary to be actually used, and the next time the new word is mined, if the new word is similar to the word in the vocabulary to be actually used, the new word is reasonably believed to be suitable for being added to the vocabulary to be actually used. In practice, each word can be represented by a multidimensional vector, the representation of the vector is trained in advance, that is, each word corresponds to a multidimensional vector representation, and then the similarity between the vectors can be obtained by calculating cosine values between the vectors. For example, the vector expression [1 0 1 1 1] of the word A, the vector of the word B is [0 1 1 1 0 1], and the similarity of the two words can be obtained by calculating the cosine values of the two vectors. Training an LR model, namely a sorting model, to sort templates according to entropy values and similarity with the existing word list, wherein the sorting model uses an LR algorithm (linear regression model), firstly training the model to obtain sorting priority ratio parameters, for example, after training, the entropy values occupy 40 percent, the similarity ratio occupies 60 percent, and then sorting the templates according to the entropy values and the similarity priority ratio parameters with the existing word list.
The training method of the sequencing model comprises the steps of firstly manually collecting historical record data, marking data related to training categories by a label '1', marking irrelevant to the training categories by a label '0', such as collecting 'nearby hotels', 'nearest weather', then 'nearby hotels' are relevant to hotels, 'nearest weather' are relevant to weather, and the input of 'nearest weather' is manually marked by a label '1' and 'nearby hotels' are manually marked by a label '0' on the premise of being in the weather-relevant category; then inputting the parameters into the sorting model to train, obtaining the priority duty ratio parameters of the sorting model, and next inputting a new record to install the parameters for sorting.
The automatic template mining method based on cross support evaluation, as shown in fig. 1, comprises the following steps:
s1, inputting a history record of a user, and carrying out intention recognition on the history record of the user by adopting an intention recognition module to obtain a history record with delicacies, wherein the history record is shown in FIG. 2;
s2, adopting a category word replacement module to record food intentions, and performing word segmentation by using the bargain word segmentation, wherein as shown in figure 3, category words related to food are replaced by [ food ] to obtain records after category words are replaced;
s3, adopting a frequent item set mining module to segment the record subjected to the replacement of the category words with the bargain segmentation words and remove the duplication, taking the words obtained after the segmentation as the items to be mined, and mining by using an association rule mining algorithm;
in this embodiment, the evaluation criteria of the FP-Growth algorithm is improved, and the improved algorithm pseudo code is shown in fig. 4;
and (3) using frequent items mined by the association rule in S4 to find potential templates in the original records, wherein a specific implementation algorithm is shown in pseudo codes in FIG. 5.
S4, screening the frequent items by adopting a frequent item set mining module;
s5, traversing the records subjected to intention recognition, which are processed in the S1, and keeping the results which simultaneously accord with the frequent items screened in the S4 to obtain a preliminary template;
s6, adopting a template ordering module to order and display the templates.
According to the template extracted in the step S6, the extracted words are added to the word list of the actual application, so that the user experience is improved, in the embodiment, the extracted results of the food category are shown, as shown in fig. 6, the existing words in the word list show whether the extracted words appear in the word list of the actual application or not, the words are represented by 0 if the extracted words do not appear, and the words are displayed if the extracted words do not need to be added any more when the words are added. For example, the template mined is [ food ] public number, the word of 'public number' does not exist in the word list of the actual application, so that a user does not return a result when searching for 'Dai sister chafing dish public number', the information related to 'Dai sister chafing dish' exists in the database actually, and only the tag of the related information is 'Dai sister chafing dish', 'Dai sister chafing dish' and 'Dai sister chafing dish public number' are not matched, so that the result is not returned. If the term "public number" is mined, the term is matched, and a return result is obtained.
FIG. 7 is a conventional confidence-based mining, and FIG. 6 is a cross-support based mining. The main difference is that the word 'Shanzhen' is not very strong with the hot pot in reality, because 'Shanzhen' is a city, the hot pot can be matched with other cities, so that the 'Shanzhen' can be screened out based on the traditional confidence, the occurrence times are more than the set threshold value, the situation is reserved, but the situation is not very strong, and the evaluation result based on the cross support degree is shown in fig. 6, so that the situation can be avoided.

Claims (2)

1. The template automatic mining system based on the cross support evaluation is characterized by comprising an intention recognition module, a category word replacement module, a frequent item set mining module and a template ordering module;
the intention recognition module is used for carrying out intention recognition on the history record of the user and sending the record subjected to the intention recognition to the category word replacement module;
the category word replacement module is used for cutting words from the records subjected to conscious recognition, replacing category words, and sending the records subjected to category word replacement to the frequent item set mining module;
the frequent item set mining module is used for mining the records after the category words are replaced by using an association rule mining algorithm and screening frequent items to obtain a preliminary template;
the template ordering module is used for ordering the preliminary templates according to the entropy value and the similarity with the existing word list;
in the intention recognition module, training an intention recognition model by adopting a related record, wherein the related record refers to a search record of a user, the intention recognition model comprises a fasttext model, and intention recognition is carried out on a historical search record by adopting the trained intention recognition model;
the training intention recognition model inputs data with category labels, the output of the model is the corresponding category label, the training is carried out to enable the intention model to calculate the probability that the records belong to each category according to the input records, and the category with the highest probability is output;
in the category word replacement module, the record subjected to intention recognition is segmented by adopting the barker segmentation, and the words related to the fixed category in the record are replaced by the fixed category words;
in the frequent item set mining module, a record after the category words are replaced is subjected to word segmentation, the words obtained after the word segmentation are subjected to duplication removal, punctuation marks are removed as items to be mined, and mining is performed by using an association rule mining algorithm;
the association rule mining algorithm comprises an improved FP-Growth algorithm, namely, frequent items are screened by adopting cross support degree: the cross support evaluation index is as follows:
Figure FDA0004081968750000011
wherein A, B represents a word obtained after word segmentation, P (A) represents the frequency of occurrence of A obtained after word segmentation, P (B) represents the frequency of occurrence of B obtained after word segmentation, and P (AB) represents the probability of simultaneous occurrence of A and B; when the cross support degree is calculated, the corresponding category words are not included in the calculation range; the cross support threshold is set to be greater than 0, and the fact that the cross support threshold is greater than 0 indicates that a positive correlation relationship exists between two items;
in the template ordering module, the possibility of matching words in the same type of template is calculated, the universality of the template is evaluated by utilizing entropy values, the entropy value ordering is measured by the number of the possibility included by the words with fixed category, and the calculation formula of the entropy value S is as follows:
S=-∑p(A)log(p(A));
calculating the similarity between the model and the existing vocabulary by using cosine similarity, training an LR model, namely a sequencing model to sequence the templates according to the entropy value and the similarity between the model and the existing vocabulary, wherein the sequencing model adopts an LR algorithm, firstly training the model to obtain a sequencing priority duty ratio parameter, and then sequencing the templates according to the entropy value and the similarity priority duty ratio parameter between the model and the existing vocabulary; the training method of the sequencing model comprises the steps of firstly manually collecting historical record data, marking data related to training categories with a label '1', marking data not related to the training categories with a label '0', inputting the data into the sequencing model for training to obtain a priority duty ratio parameter of the sequencing model, and inputting a new record next time for installing the parameter for sequencing.
2. The automatic template mining method of the automatic template mining system based on cross support evaluation according to claim 1, characterized by comprising the steps of:
s1, inputting a history record of a user, and carrying out intention recognition on the history record of the user by adopting an intention recognition module to obtain a record subjected to intention recognition
S2, adopting a category word replacement module to segment the junction word for the record which is subjected to intention recognition, replacing the words related to the fixed category in the record with the fixed category words, and obtaining the record after replacing the category words;
s3, adopting a frequent item set mining module to record the replaced category words, performing word segmentation and duplication removal by using the bargain segmentation, taking the words obtained after word segmentation as the items to be mined, and mining by using a correlation rule mining algorithm;
s4, screening the frequent items by adopting a frequent item set mining module;
s5, traversing the records subjected to intention recognition, which are processed in the S1, and keeping the results which simultaneously accord with the frequent items screened in the S4 to obtain a preliminary template;
s6, adopting a template ordering module to order and display the templates.
CN201911383296.9A 2019-12-28 2019-12-28 Automatic template mining system and method based on cross support evaluation Active CN111552862B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911383296.9A CN111552862B (en) 2019-12-28 2019-12-28 Automatic template mining system and method based on cross support evaluation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911383296.9A CN111552862B (en) 2019-12-28 2019-12-28 Automatic template mining system and method based on cross support evaluation

Publications (2)

Publication Number Publication Date
CN111552862A CN111552862A (en) 2020-08-18
CN111552862B true CN111552862B (en) 2023-04-21

Family

ID=72007194

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911383296.9A Active CN111552862B (en) 2019-12-28 2019-12-28 Automatic template mining system and method based on cross support evaluation

Country Status (1)

Country Link
CN (1) CN111552862B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108038234A (en) * 2017-12-26 2018-05-15 众安信息技术服务有限公司 A kind of question sentence template automatic generation method and device
CN110442760A (en) * 2019-07-24 2019-11-12 银江股份有限公司 A kind of the synonym method for digging and device of question and answer searching system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108038234A (en) * 2017-12-26 2018-05-15 众安信息技术服务有限公司 A kind of question sentence template automatic generation method and device
CN110442760A (en) * 2019-07-24 2019-11-12 银江股份有限公司 A kind of the synonym method for digging and device of question and answer searching system

Also Published As

Publication number Publication date
CN111552862A (en) 2020-08-18

Similar Documents

Publication Publication Date Title
CN106156127B (en) Method and device for selecting data content to push to terminal
JP5092165B2 (en) Data construction method and system
CN105095434B (en) The recognition methods of timeliness demand and device
CN103970733B (en) A kind of Chinese new word identification method based on graph structure
CN102073684B (en) Method and device for excavating search log and page search method and device
CN104268216A (en) Data cleaning system based on internet information
CN102419778A (en) Information searching method for discovering and clustering sub-topics of query statement
CN110750588A (en) Multi-source heterogeneous data fusion method, system, device and storage medium
CN111104398B (en) Detection method and elimination method for intelligent ship approximate repeated record
CN101719167A (en) Interactive movie searching method
CN106649557B (en) Semantic association mining method for defect report and mail list
CN108027814A (en) Disable word recognition method and device
US20070136220A1 (en) Apparatus for learning classification model and method and program thereof
CN104361500A (en) Air conditioner after-sale failure data processing method and system
CN103020083B (en) The automatic mining method of demand recognition template, demand recognition methods and corresponding device
CN106933883B (en) Method and device for classifying common search terms of interest points based on search logs
CN103870489B (en) Chinese personal name based on search daily record is from extending recognition methods
CN103761286A (en) Method for retrieving service resources on basis of user interest
CN114510566B (en) Method and system for mining, classifying and analyzing hotword based on worksheet
CN111597322B (en) Automatic template mining system and method based on frequent item sets
CN111552862B (en) Automatic template mining system and method based on cross support evaluation
CN105677723B (en) A kind of data label foundation and search method for industrial signal source
JPH08314751A (en) Fault countermeasure supporting method
CN104699832A (en) Determining method and device of related information
CN102054008A (en) Method and device for acquiring network information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant