WO2023024411A1 - Association rule assessment method and apparatus based on machine learning - Google Patents

Association rule assessment method and apparatus based on machine learning Download PDF

Info

Publication number
WO2023024411A1
WO2023024411A1 PCT/CN2022/071425 CN2022071425W WO2023024411A1 WO 2023024411 A1 WO2023024411 A1 WO 2023024411A1 CN 2022071425 W CN2022071425 W CN 2022071425W WO 2023024411 A1 WO2023024411 A1 WO 2023024411A1
Authority
WO
WIPO (PCT)
Prior art keywords
item
antecedent
text information
association
association rule
Prior art date
Application number
PCT/CN2022/071425
Other languages
French (fr)
Chinese (zh)
Inventor
蒋雪涵
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023024411A1 publication Critical patent/WO2023024411A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the present application relates to the technical field of artificial intelligence, in particular to a method, a device, a computer device and a readable storage medium for evaluating association rules based on machine learning.
  • Association analysis is a commonly used mining algorithm, which is used to mine the internal associations between data, and can be applied to many application scenarios in life. For example, in shopping scenarios, association rules are used to discover the internal commonality of group buying habits and guide supermarket products. Placement, in medical scenarios, uses association rules to mine the possibility of patients consuming medical items to guide doctors in case diagnosis.
  • association rules can be proposed by domain experts, and candidate sets that meet certain measurement values, such as confidence, support, and promotion, can also be obtained through data mining, and then the rationality can be confirmed by experts.
  • the association rule is "oral anesthesia ⁇ root canal” where "oral anesthesia” may It is caused by the patient's “tooth extraction” or “root canal treatment”, and the "root canal” is only caused by the patient's “root canal treatment”, which makes the “oral anesthesia” deduce that the "root canal” has a certain deviation
  • the mining process of the above-mentioned association rules has the following two deficiencies.
  • association rules There are a large number of false positives in the mining association rules, and the rules are too complex, which will lead to weak interpretability of the association rules;
  • the mining of association rules depends on expert experience, and the opinions of different experts may differ, resulting in the subjectivity of association rules.
  • the present application provides a method, device, computer equipment and readable storage medium for evaluating association rules based on machine learning, the main purpose of which is to solve the subjectivity and interpretability of association rules mined in the prior art. problem of weakness.
  • a method for evaluating association rules based on machine learning comprising:
  • described association rule comprises antecedent and subsequent item
  • described item co-occurrence condition is that item occurs simultaneously in antecedent and subsequent item
  • the pre-trained text information encoder and antecedent predictor Use the pre-trained text information encoder and antecedent predictor to extract the features of the collected item text information, and obtain the coded vector representation of the item text information, and the text information encoder is used to determine whether the consequent appears in the association rules Forecasting, the antecedent predictor is used to predict whether the antecedent appears in the association rule;
  • each association rule is evaluated according to the coded vector representation of the item text information, and an evaluation result reflecting the causal relationship between the antecedent and the subsequent in the association rule is obtained.
  • a device for evaluating association rules based on machine learning comprising:
  • a mining unit configured to use item co-occurrence conditions to mine association rules from the item collection, the association rules include antecedents and subsequent items, and the item co-occurrence conditions are that items in the antecedent and subsequent items appear simultaneously;
  • the extraction unit is used to perform feature extraction on the collected item text information by using the pre-trained text information encoder and antecedent predictor to obtain the coded vector representation of the item text information, and the text information encoder is used for the association rule Predict whether the consequent appears, and the antecedent predictor is used to predict whether the antecedent appears in the association rules;
  • the evaluation unit is configured to respond to the evaluation instruction of the association rules, evaluate each association rule according to the coded vector representation of the item text information, and obtain an evaluation result reflecting the causal relationship between the antecedent and the consequent in the association rule.
  • a computer device including a memory and a processor, the memory stores computer-readable instructions, and when the processor executes the computer-readable instructions, it implements association rules based on machine learning. Steps in the method of evaluation.
  • a computer storage medium on which computer readable instructions are stored, and when the computer readable instructions are executed by a processor, the steps of the method for evaluating association rules based on machine learning are realized.
  • this application introduces causal correction to evaluate the causal relationship of association rules obtained by mining, removes the features that are only related to the antecedent or the latter of the association rules, and obtains the causal explanation of the latter for the former. Increase the interpretability of association rules, thereby reducing false positives in association rules and avoiding the influence of subjective factors on association rule screening.
  • FIG. 1 shows a schematic flowchart of a method for evaluating association rules based on machine learning provided by an embodiment of the present application
  • FIG. 2 shows a schematic flowchart of another method for evaluating association rules based on machine learning provided by an embodiment of the present application
  • FIG. 3 shows a schematic structural diagram of a device for evaluating association rules based on machine learning provided by an embodiment of the present application
  • FIG. 4 shows a schematic structural diagram of another apparatus for evaluating association rules based on machine learning provided by an embodiment of the present application.
  • AI artificial intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the embodiment of this application provides a method for evaluating association rules based on machine learning.
  • the causal screening of association rules is realized, and the number of association rules is increased.
  • the method includes:
  • the association rule has the form of condition A ⁇ condition B, which means that condition B can be obtained when condition A is satisfied.
  • condition A and condition B are the antecedents and postconditions contained in the association rule respectively.
  • the item on the left of the arrow is the antecedent of the association rule
  • the item on the right of the arrow is the aftermath of the association rule.
  • the antecedent and the aftermath can be one item or multiple items, and the set of items can involve different fields.
  • a large amount of user text information can be obtained through a pre-set interface channel, and a large amount of user text information can be aggregated to form an item collection .
  • the item co-occurrence condition is that the antecedent and the subsequent item in the association rules appear at the same time.
  • the patient purchased item A and item B at the same time.
  • the premise of the item co-occurrence condition is that the item in the antecedent and the latter item Items in the file appear at the same time.
  • the first is the generation of the item set, which can be generated using the PF growth algorithm, and then filter out the association rules that meet the preset conditions from the full arrangement of the item set , where the preset condition is that the support and confidence are greater than a given threshold at the same time, the support is defined as the co-occurrence frequency of the antecedent and the consequent, and the confidence is defined as the ratio of the co-occurrence frequency of the antecedent and the latter to the probability of the antecedent , the antecedent probability is the co-occurrence frequency of all items in the antecedent.
  • the frequency of co-occurrence of the antecedent and the posterior there are 1,000 inspection items in the item collection, among which, there are 800 medical visit text information for both blood routine examination and urine routine examination, so the co-occurrence frequency of blood routine and urine routine items is 0.8.
  • the antecedent is a collection of items. There can be one item or multiple items. If it is one item, the antecedent probability is the occurrence probability of the item. If it is multiple items, the antecedent probability is multiple The frequency with which items co-occur.
  • the given threshold here can be set according to the actual project requirements. If the actual project requirement is quality inspection, it will be judged as a violation sample if it violates the rules. It is necessary to set the preset conditions of high confidence and low support.
  • the execution subject can be a device that evaluates association rules based on machine learning, and is specifically applied on the server side.
  • the item co-occurrence condition is used to mine the association rules from the item collection to meet the preset conditions, which can be used as association rules Preliminary screening of , which can distinguish the association relationship existing in the project collection.
  • the above servers can be independent servers, or provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery network (Content Delivery Network) , CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
  • the text information encoder is used to predict whether the consequent appears in the association rules.
  • Natural language models such as TextCNN and BERT can be used.
  • the input parameter is the item text information
  • the output parameter is the encoded vector representation of the item text information.
  • the item text The coded vector representation of information is used to classify, and it can also output the predicted value of whether the subsequent part in the association rule appears in the item text information.
  • the item text information here can be medical text data
  • the medical text data can be electronic healthcare records (Electronic Healthcare Record), electronic personal health records, including medical records, electrocardiograms, medical images, and a series of electronic records that are valuable for future reference. .
  • electronic healthcare records Electronic Healthcare Record
  • electronic personal health records including medical records, electrocardiograms, medical images, and a series of electronic records that are valuable for future reference.
  • the text information encoder and the antecedent prediction machine can use the machine algorithm of artificial intelligence to combine whether the antecedent and the aftermath in the association rules appear in the item text information as label data to train the network model to vectorize the item text information It is expressed as an encoded vector representation, and the text information encoder and the antecedent predictor perform adversarial learning during the training process, that is, the optimization goals of the two are opposite. Through adversarial learning, only The information related to the predecessor in the association rule, retains the information related to the former and the latter.
  • association rule evaluation instruction In response to an association rule evaluation instruction, evaluate each association rule according to the coded vector representation of the item text information, and obtain an evaluation result reflecting the causal relationship between the antecedent and the consequent in the association rule.
  • the coded vector of the item text information is used to represent the aftermath of the predicted association rule, and then the latter and the antecedent of the association rule are used.
  • the coding vector of the item text information represents the antecedent of the predicted association rules, which can remove the information irrelevant to the subsequent item in the item text information, and realize the causal correction in the item text information. Further, through the corrected text item information, similar text items The information evaluates the causal contribution of the antecedent to the consequent in association rules.
  • each association rule will filter out multiple text samples, and then target each text in the text sample set Sample, traversing the coded vector representation of the item text information, querying the K items of text information that are most similar to the coded vector representation of the text sample, which can be obtained by calculating the distance between the coded vectors, and further calculating K for each text sample.
  • the probability value of the subsequent occurrence of the association rule in the project text information and calculate the average value of the occurrence probability of the subsequent occurrence calculated by all text samples, as the evaluation value reflecting the causal relationship between the antecedent and the subsequent in the association rule, the The evaluation value is the representation of the causal relationship between the antecedent and the subsequent in association rules.
  • the embodiment of the present application provides a method for evaluating association rules based on machine learning, using item co-occurrence conditions to mine association rules from item collections, the association rules include antecedents and subsequent items, and the item co-occurrence conditions are antecedents and The items in the subsequent items appear at the same time, and the pre-trained text information encoder and the antecedent predictor are used to extract the features of the collected item text information, and the encoded vector representation of the item text information is obtained.
  • the text information encoder is used for the association rules. Predict whether the consequent appears.
  • the antecedent predictor is used to predict whether the antecedent appears in the association rules.
  • each association rule is evaluated according to the coded vector representation of the item text information, and the obtained It reflects the evaluation result of the causal relationship between the antecedent and the consequent in the association rule.
  • this application introduces the method of causal correction to evaluate the causal relationship of the association rules obtained by mining, and removes the features only related to the antecedent or the latter of the association rules, and obtains The causal interpretation of the latter to the antecedent increases the interpretability of association rules, thereby reducing false positives in association rules and avoiding the influence of subjective factors on association rule screening.
  • the embodiment of this application provides another method for evaluating association rules based on machine learning.
  • the causal screening of association rules can be realized, and association rules can be added.
  • the method includes:
  • the item set is equivalent to a set of different items, and each item is an item in the item set, and the item can be a customer consumption item, for example, milk, biscuits, medical payment items, for example, blood routine, urine test. Since the association between items in the item collection can guide consumption or auxiliary medical reimbursement to a certain extent, for example, customers will purchase item C while purchasing item A and item B, and patients will pay for medical item C while paying Medical Item D and Medical Item E.
  • the frequent item subset is to contain at least one item in the item set, and the number of times the contained items appear in a record at the same time is greater than or equal to the minimum support.
  • the number of items in the medium determines the frequent item subsets containing different item numbers, and lists all the item subsets according to the item number, and further filters out the frequent item subsets whose support degree is greater than the preset threshold from the item subset.
  • the process of screening frequent item subsets can follow the following two principles. If an item subset is a frequent item subset, then the item subset is a frequent item subset. If an item subset is an infrequent item Subset, then the superset of the item subset is the infrequent item subset, and this process can save the generation time of the frequent item subset.
  • the collection of items is ⁇ A,B,C,D ⁇ , first list the subset of items containing one item as follows: ⁇ A ⁇ , ⁇ B ⁇ , ⁇ C ⁇ , ⁇ D ⁇ , then list the subset of items containing two items
  • the subset of items in is as follows: ⁇ A,B ⁇ , ⁇ A,C ⁇ , ⁇ A,D ⁇ , ⁇ B,C ⁇ , ⁇ B,D ⁇ , ⁇ C,D ⁇ , and then list the three items
  • the subset of items is as follows: ⁇ A,B,C ⁇ , ⁇ A,B,D ⁇ , ⁇ A,C,D ⁇ , ⁇ B,C,D ⁇ , and the subset of frequent items with support greater than 3/5 ⁇ A ⁇ , ⁇ B ⁇ , ⁇ A,B ⁇ , ⁇ B,C ⁇ , ⁇ A,C ⁇ , ⁇ A,B,C ⁇ .
  • the parameter index includes support and confidence at least, the support is the co-occurrence frequency of the antecedent and the consequent, and the confidence is the ratio of the support to the probability of the antecedent.
  • the candidate association rules generated for the frequent item subset are equivalent to the derivation relationship of the items in the frequent item subset.
  • the frequent item subset is ⁇ A, B, C ⁇
  • the threshold value of the parameter index threshold is set as a preset condition to filter out candidate association rules with weak correlation and improve the reliability of the association rules.
  • the items in each frequent item subset can form multiple candidate association rules, and the confidence and support can be calculated separately for each candidate association rule. If the confidence and support meet the preset conditions, that is, both are greater than the set Confidence threshold and support threshold of , indicating that the candidate association rule has strong relevance and can be retained; otherwise, the candidate association rule is filtered.
  • association rule For each association rule, use a pre-determined whether the antecedent and the consequent appear in the item text information as tag data.
  • each item text information will contain at least one item, specifically for each item association rule, the appearance of the preceding item in the item text information is equivalent to the occurrence of all items in the item text information in the item, for example , the antecedent is the blood routine and urine routine of the project, and if the project text information contains the blood routine and urine routine, it is considered that the antecedent appears in the project text information, and similarly, whether the latter appears in the project text information is Items in the aftermath appear in the item text information.
  • the text information encoder can obtain the coded vector representation of the item text information, and use the coded vector representation to predict whether the consequence of the association rule appears in the item text information.
  • the optimization goal of the text information coder is Maximize the prediction of whether the consequents in the association rules appear in the item text information. That is to say, for each association rule, the label data that appears in the item text information of the subsequent items in the association rule will be used for training, and the multi-label loss function will be combined during the training process, where each label corresponds to a cross-entropy loss function , multiple labels are added for multiple cross-entropy loss functions, and the specific loss function is publicly expressed as:
  • y is whether the consequent appears in the item text information in the association rule, is the predicted value of whether the consequent appears in the item text information in the encoder output association rule
  • x is whether the antecedent appears in the item text information in the association rule
  • the antecedent predictor can predict whether the antecedent in the association rule appears in the item text information by using the coded vector representation and the predicted value of the latter in the user text information in the association rule.
  • the optimization goal of the antecedent predictor is to maximize the prediction of whether the antecedent appears in the project text information in the association rules. That is to say, for each association rule, the label data that appears in the item text information of the previous item in the association rule will be used for training, and the multi-label loss function will be combined during the training process.
  • the loss function is also the loss of the multi-label problem function, the formula is expressed as:
  • the text information encoder and the antecedent predictor perform confrontational learning during the training process, so that the information related to the antecedent in the association rules is removed from the item text information, and the antecedent and the aftermath in the association rules are retained. Related information.
  • association rule 207 In response to the evaluation instruction of the association rule, for each association rule, calculate an evaluation value reflecting the causal relationship between the antecedent and the consequent in the association rule according to the coded vector representation of the item text information.
  • the item text information contains multiple texts.
  • the text that appears in the preceding item in the association rule can be selected from the item text information as the sample text, and then the encoded vector representation of the item text information is traversed, and the query and each The coded vectors of each sample text represent the texts that meet the similarity condition, as the similar target text of each sample text, for each sample text similar target text, calculate the evaluation reflecting the causal relationship between the antecedent and the consequent in the association rule value.
  • the project text information contains 100 texts.
  • the project text information that occurs before the selected item contains 10 sample texts, that is, sample text 1-10.
  • traverse the encoding vectors of 100 texts means, find 5 target texts that are similar to the coded vector representation of the sample text, and further calculate the probability value of the occurrence of the consequent of the association rule in the 5 target texts, if the probability value is 0.8, it means that in the 5 target texts.
  • the qualified probability values a2, a3, a4, and a5 can be calculated, and the probability value is further weighted to obtain the average value (a1+a2+a3 +a4+a5)/5, to obtain the evaluation value reflecting the causal relationship between the antecedent and the consequent in the association rule.
  • the evaluation value here can represent the mutual causal interpretation between the antecedent and the subsequent in the association rules, which can more intuitively reflect whether there is a causal relationship between the antecedents and the latter, and increase the interpretability of the association rules. If the evaluation value is greater than the preset threshold, it means that the causal explanatory power between the antecedent and the subsequent in the association rule is strong, indicating that there is a causal relationship between the antecedent and the latter; otherwise, the explanatory power of the association rule is weak, namely Although the association rules have been mined, the antecedents and consequents in the association rules are less rational.
  • association rules can be used to filter or explain the mined association rules to achieve data collocation and data prediction, for example, for clothing collocation in shopping scenarios, epidemic situation judgment for livestock breeding scenarios, Business push for page access scenarios, etc.
  • the pre-trained text information encoder and antecedent predictor are used to extract the features of the pre-collected item text information, and the vector encoding representation of the extracted item text information is used to extract the antecedents in the association rules.
  • the relevant features of the antecedent or subsequent item in the association rules can be removed, and the causal explanation of the latter item for the anterior item can be obtained, thereby reducing false positives of potential rules and reducing the time spent in mining association rules.
  • Subjectivity, using information encoders and antecedent predictors enables fast and stable iterations, improving the interpretability of association rules.
  • an embodiment of the present application provides a device for evaluating association rules based on machine learning.
  • the device includes: a mining unit 31, an extraction unit 32 , evaluation unit 33 .
  • the mining unit 31 can be used to mine association rules from the item collection using item co-occurrence conditions, the association rules include antecedents and subsequent items, and the item co-occurrence conditions are that items in the antecedent and subsequent items appear simultaneously;
  • the extraction unit 32 can be used to perform feature extraction on the collected item text information by using a pre-trained text information encoder and antecedent predictor to obtain a coded vector representation of the item text information, and the text information encoder is used for the associated Predict whether the latter appears in the rule, and the antecedent predictor is used to predict whether the antecedent appears in the association rule;
  • the evaluation unit 33 may be configured to respond to the evaluation instruction of the association rules, evaluate each association rule according to the coded vector representation of the item text information, and obtain an evaluation result reflecting the causal relationship between the antecedent and the subsequent in the association rule .
  • the embodiment of the present application provides a device for evaluating association rules based on machine learning, using item co-occurrence conditions to mine association rules from item collections, the association rules include antecedents and subsequent items, and the item co-occurrence conditions are The items in the subsequent items appear at the same time, and the pre-trained text information encoder and the antecedent predictor are used to extract the features of the collected item text information, and the encoded vector representation of the item text information is obtained.
  • the text information encoder is used for the association rules. Predict whether the consequent appears.
  • the antecedent predictor is used to predict whether the antecedent appears in the association rules.
  • each association rule is evaluated according to the coded vector representation of the item text information, and the obtained It reflects the evaluation result of the causal relationship between the antecedent and the consequent in the association rule.
  • this application introduces the method of causal correction to evaluate the causal relationship of the association rules obtained by mining, and removes the features only related to the antecedent or the latter of the association rules, and obtains The causal interpretation of the latter to the antecedent increases the interpretability of association rules, thereby reducing false positives in association rules and avoiding the influence of subjective factors on association rule screening.
  • FIG. 4 is a schematic structural diagram of another device for evaluating association rules based on machine learning according to an embodiment of the present application, as shown in FIG. 4 , the item co-occurrence condition is that the preceding item and the subsequent item appear simultaneously in the association rule, and the mining unit 31 includes:
  • the arrangement module 311 can be used to perform full arrangement on the subset of frequent items included in the item set;
  • the selection module 312 can be used to generate candidate association rules for the subset of frequent items, and use preset parameter indicators to filter the candidate association rules to obtain candidate rules that meet the preset conditions, and the parameter indicators include at least supporting degree and confidence, the support degree is the co-occurrence frequency of the antecedent and the consequent, and the confidence degree is the ratio of the support degree to the probability of the antecedent.
  • the device further includes:
  • the generation unit 34 can be used to perform feature extraction on the collected item text information using the pre-trained text information encoder and antecedent predictor, and obtain the coded vector representation of the item text information. For each association rule, use Predetermining whether said antecedent and said consequent appear in item text information as tag data;
  • the first construction unit 35 can be used to input the item text information carrying the label data into the first network model for training, and construct a text information encoder whose optimization goal is to maximize the prediction in the association rules. Whether the aftermath appears in the project text information;
  • the second construction unit 36 can be used to input the encoded vector representation of the first network model output item text information and the predicted value of whether the consequent appears in the item text information in the association rules to the second network model for training, and construct An antecedent predictor, the optimization objective of the antecedent predictor is to maximize the prediction of whether the antecedent in the association rule appears in the project text information.
  • the text information encoder and the antecedent predictor perform adversarial learning during the training process, so that the information related to the antecedents in the association rules is removed from the item text information, and the antecedents in the association rules are retained. Information related to the item and the subsequent item.
  • the evaluation unit 33 includes:
  • the calculation module 331 can be used to calculate, for each association rule, an evaluation value that reflects the causal relationship between the antecedent and the consequent in the association rule according to the coded vector representation of the item text information;
  • the determination module 332 may be configured to determine that there is a causal relationship between the antecedent and the consequent in the association rule if the evaluation value is greater than a preset threshold.
  • the item text information includes a plurality of texts
  • the calculation module 331 includes:
  • the selection sub-module 3311 can be used to select the text that appears in the preceding item in the association rule from the item text information as the sample text for each association rule;
  • the query sub-module 3312 can be used to traverse the coded vector representation of the item text information, and query the text that meets the similarity condition with the coded vector representation of each sample text, as the similar target text of each sample text;
  • the calculation sub-module 3313 can be used to calculate the evaluation value reflecting the causal relationship between the antecedent and the consequent in the association rule for the similar target text of each sample text.
  • the calculation submodule 3313 can specifically be used to calculate the probability value of the occurrence of the consequent in the association rule in the similar target text of each sample text, Obtain the probability value of each sample text meeting the evaluation conditions;
  • the calculation sub-module 3313 can also be used to obtain the evaluation value reflecting the causal relationship between the antecedent and the consequent in the association rule by weighting the average probability value of each sample text meeting the evaluation condition.
  • this embodiment also provides a readable storage medium, the readable storage medium may be non-volatile or volatile, and Computer-readable instructions are stored on it, and when the computer-readable instructions are executed by the processor, the above-mentioned method for evaluating association rules based on machine learning as shown in FIG. 1 and FIG. 2 is realized.
  • the technical solution of the present application can be embodied in the form of software products, which can be stored in a non-volatile storage medium (which can be CD-ROM, U disk, mobile hard disk, etc.), including several The instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in various implementation scenarios of the present application.
  • a non-volatile storage medium which can be CD-ROM, U disk, mobile hard disk, etc.
  • the instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in various implementation scenarios of the present application.
  • the embodiment of this application also provides a computer device, which can be a personal computer, Servers, network devices, etc.
  • the physical device includes a readable storage medium and a processor; the readable storage medium is used to store computer-readable instructions; the processor is used to execute computer-readable instructions to achieve the above as shown in Figure 1 and Figure 2
  • the computer device may also include a user interface, a network interface, a camera, a radio frequency (Radio Frequency, RF) circuit, a sensor, an audio circuit, a WI-FI module, and the like.
  • the user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the like, and optional user interfaces may also include a USB interface, a card reader interface, and the like.
  • the network interface may include a standard wired interface, a wireless interface (such as a Bluetooth interface, a WI-FI interface) and the like.
  • the physical device structure of the device for evaluating association rules based on machine learning does not constitute a limitation on the physical device, and may include more or fewer components, or combine some components, or different component arrangements.
  • the readable storage medium may also include an operating system and a network communication module.
  • the operating system is a program that manages the hardware and software resources of the above-mentioned computer equipment, and supports the operation of information processing programs and other software and/or programs.
  • the network communication module is used to realize the communication among various components inside the readable storage medium, and communicate with other hardware and software in the physical device.
  • this application can be realized by means of software plus a necessary general-purpose hardware platform, or by hardware.
  • this application introduces the method of causal correction to evaluate the causal relationship of the association rules obtained by mining, and removes the features only related to the antecedent or the latter of the association rules, and obtains The causal interpretation of the latter to the antecedent increases the interpretability of association rules, thereby reducing false positives in association rules and avoiding the influence of subjective factors on association rule screening.
  • the accompanying drawing is only a schematic diagram of a preferred implementation scenario, and the modules or processes in the accompanying drawings are not necessarily necessary for implementing the present application.
  • the modules in the devices in the implementation scenario can be distributed among the devices in the implementation scenario according to the description of the implementation scenario, or can be located in one or more devices different from the implementation scenario according to corresponding changes.
  • the modules of the above implementation scenarios can be combined into one module, or can be further split into multiple sub-modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application relates to the technical field of artificial intelligence. Disclosed are an association rule assessment method and apparatus based on machine learning, and a computer device and a readable storage medium. The method comprises: mining association rules from an item set by using an item co-occurrence condition, wherein each association rule comprises an antecedent and a consequent, and the item co-occurrence condition is that an item occurs in both the antecedent and the consequent; performing feature extraction on collected item text information by using a pre-trained text information encoder and an antecedent prediction machine, so as to obtain a code vector representation of the item text information, wherein the text information encoder is used for predicting whether the consequent occurs in the association rule, and the antecedent prediction machine is used for predicting whether the antecedent occurs in the association rule; and in response to an assessment instruction for the association rules, assessing each association rule according to the code vector representation of the item text information, so as to obtain an assessment result, which reflects a causal relationship between the antecedent and the consequent in the association rule. By means of the present application, a causal relationship assessment can be performed on an association rule, thereby improving the interpretability of the association rule.

Description

基于机器学习对关联规则进行评估的方法及装置Method and device for evaluating association rules based on machine learning
本申请要求与2021年8月25日提交中国专利局、申请号为202110980623.X、申请名称为“基于机器学习对关联规则进行评估的方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of the Chinese patent application submitted to the China Patent Office on August 25, 2021, with the application number 202110980623.X and the application name "Method and device for evaluating association rules based on machine learning", the entire content of which Incorporated in the application by reference.
技术领域technical field
本申请涉及人工智能技术领域,尤其是涉及到基于机器学习对关联规则进行评估的方法、装置、计算机设备及可读存储介质。The present application relates to the technical field of artificial intelligence, in particular to a method, a device, a computer device and a readable storage medium for evaluating association rules based on machine learning.
背景技术Background technique
关联分析是一种常用的挖掘算法,用来挖掘数据之间的内在关联,可应用在生活中很多应用场景,例如,在购物场景中,通过关联规则发现群体购买习惯的内在共性,指导超市产品摆放,在医疗场景中,通过关联规则挖掘病患消费医疗项目的可能,指导医生病例诊断。Association analysis is a commonly used mining algorithm, which is used to mine the internal associations between data, and can be applied to many application scenarios in life. For example, in shopping scenarios, association rules are used to discover the internal commonality of group buying habits and guide supermarket products. Placement, in medical scenarios, uses association rules to mine the possibility of patients consuming medical items to guide doctors in case diagnosis.
通常情况下,关联规则可以由领域专家提出,还可以通过数据挖掘得到满足有些度量值,如置信度、支持度以及提升度等要求的候选集,再经过专家确认合理性。然而,发明人意识到关联规则中项目是由不同因素决定的,这些因素的共同作用对项目之间关系的评估具有偏差,例如,关联规则为“口腔麻醉→根管”这里“口腔麻醉”可能是由于患者做了“拔牙术”或者“根管治疗”导致的,而“根管”仅仅因为患者做了“根管治疗”导致,使得“口腔麻醉”推理出“根管”是有一定偏差的,使得上述关联规则的挖掘过程存在以下两点不足之处,其一是挖掘出的关联规则存在大量假阳性的情况,且规则过于复杂,会导致关联规则的可解释性较弱;其二是挖掘出的关联规则依赖专家经验,不同专家的意见可能存在出入,导致关联规则存在主观性。Usually, association rules can be proposed by domain experts, and candidate sets that meet certain measurement values, such as confidence, support, and promotion, can also be obtained through data mining, and then the rationality can be confirmed by experts. However, the inventor realized that the items in the association rules are determined by different factors, and the combined effect of these factors has biased the evaluation of the relationship between items. For example, the association rule is "oral anesthesia → root canal" where "oral anesthesia" may It is caused by the patient's "tooth extraction" or "root canal treatment", and the "root canal" is only caused by the patient's "root canal treatment", which makes the "oral anesthesia" deduce that the "root canal" has a certain deviation However, the mining process of the above-mentioned association rules has the following two deficiencies. One is that there are a large number of false positives in the mining association rules, and the rules are too complex, which will lead to weak interpretability of the association rules; The mining of association rules depends on expert experience, and the opinions of different experts may differ, resulting in the subjectivity of association rules.
发明内容Contents of the invention
有鉴于此,本申请提供了一种基于机器学习对关联规则进行评估的方法、装置、计算机设备及可读存储介质,主要目的在于解决现有技术中挖掘得到的关联规则存在主观性以及可解释性较弱的问题。In view of this, the present application provides a method, device, computer equipment and readable storage medium for evaluating association rules based on machine learning, the main purpose of which is to solve the subjectivity and interpretability of association rules mined in the prior art. problem of weakness.
依据本申请一个方面,提供了一种基于机器学习对关联规则进行评估的方法,该方法包括:According to one aspect of the present application, a method for evaluating association rules based on machine learning is provided, the method comprising:
使用项目共现条件从项目集合中挖掘关联规则,所述关联规则包括前件和后件,所述 项目共现条件为前件和后件中项目同时出现;Use item co-occurrence condition to mine association rules from item collection, described association rule comprises antecedent and subsequent item, and described item co-occurrence condition is that item occurs simultaneously in antecedent and subsequent item;
利用预先训练的文本信息编码器和前件预测机对收集的项目文本信息进行特征提取,得到项目文本信息的编码向量表示,所述文本信息编码器用于对所述关联规则中后件是否出现进行预测,所述前件预测机用于对所述关联规则中前件是否出现进行预测;Use the pre-trained text information encoder and antecedent predictor to extract the features of the collected item text information, and obtain the coded vector representation of the item text information, and the text information encoder is used to determine whether the consequent appears in the association rules Forecasting, the antecedent predictor is used to predict whether the antecedent appears in the association rule;
响应于关联规则的评估指令,根据所述项目文本信息的编码向量表示对每条关联规则进行评估,得到反映关联规则中前件和后件之间因果关系的评估结果。In response to the evaluation instruction of the association rules, each association rule is evaluated according to the coded vector representation of the item text information, and an evaluation result reflecting the causal relationship between the antecedent and the subsequent in the association rule is obtained.
依据本申请另一个方面,提供了一种基于机器学习对关联规则进行评估的装置,所述装置包括:According to another aspect of the present application, a device for evaluating association rules based on machine learning is provided, the device comprising:
挖掘单元,用于使用项目共现条件从项目集合中挖掘关联规则,所述关联规则包括前件和后件,所述项目共现条件为前件和后件中项目同时出现;A mining unit, configured to use item co-occurrence conditions to mine association rules from the item collection, the association rules include antecedents and subsequent items, and the item co-occurrence conditions are that items in the antecedent and subsequent items appear simultaneously;
提取单元,用于利用预先训练的文本信息编码器和前件预测机对收集的项目文本信息进行特征提取,得到项目文本信息的编码向量表示,所述文本信息编码器用于对所述关联规则中后件是否出现进行预测,所述前件预测机用于对所述关联规则中前件是否出现进行预测;The extraction unit is used to perform feature extraction on the collected item text information by using the pre-trained text information encoder and antecedent predictor to obtain the coded vector representation of the item text information, and the text information encoder is used for the association rule Predict whether the consequent appears, and the antecedent predictor is used to predict whether the antecedent appears in the association rules;
评估单元,用于响应于关联规则的评估指令,根据所述项目文本信息的编码向量表示对每条关联规则进行评估,得到反映关联规则中前件和后件之间因果关系的评估结果。The evaluation unit is configured to respond to the evaluation instruction of the association rules, evaluate each association rule according to the coded vector representation of the item text information, and obtain an evaluation result reflecting the causal relationship between the antecedent and the consequent in the association rule.
依据本申请又一个方面,提供了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现基于机器学习对关联规则进行评估的方法的步骤。According to still another aspect of the present application, a computer device is provided, including a memory and a processor, the memory stores computer-readable instructions, and when the processor executes the computer-readable instructions, it implements association rules based on machine learning. Steps in the method of evaluation.
依据本申请再一个方面,提供了一种计算机存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现基于机器学习对关联规则进行评估的方法的步骤。According to still another aspect of the present application, a computer storage medium is provided, on which computer readable instructions are stored, and when the computer readable instructions are executed by a processor, the steps of the method for evaluating association rules based on machine learning are realized.
本申请在对关联规则进行评估时通过引入因果矫正的方式对挖掘得到关联规则进行因果关系评估,去除仅跟关联规则前件或后件相关的特征,得到后件对于前件的因果解释,以增加关联规则的可解释性,从而减少关联规则存在的假阳性,避免主观因素对关联规则筛选的影响。When evaluating association rules, this application introduces causal correction to evaluate the causal relationship of association rules obtained by mining, removes the features that are only related to the antecedent or the latter of the association rules, and obtains the causal explanation of the latter for the former. Increase the interpretability of association rules, thereby reducing false positives in association rules and avoiding the influence of subjective factors on association rule screening.
附图说明Description of drawings
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating the preferred embodiments and are not to be considered as limiting the application. Also throughout the drawings, the same reference numerals are used to designate the same parts. In the attached picture:
图1示出了本申请实施例提供的一种基于机器学习对关联规则进行评估的方法的流程示意图;FIG. 1 shows a schematic flowchart of a method for evaluating association rules based on machine learning provided by an embodiment of the present application;
图2示出了本申请实施例提供的另一种基于机器学习对关联规则进行评估的方法的流程示意图;FIG. 2 shows a schematic flowchart of another method for evaluating association rules based on machine learning provided by an embodiment of the present application;
图3示出了本申请实施例提供的一种基于机器学习对关联规则进行评估的装置的结构示意图;FIG. 3 shows a schematic structural diagram of a device for evaluating association rules based on machine learning provided by an embodiment of the present application;
图4示出了本申请实施例提供的另一种基于机器学习对关联规则进行评估的装置的结构示意图。FIG. 4 shows a schematic structural diagram of another apparatus for evaluating association rules based on machine learning provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。The embodiments of the present application may acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
本申请实施例提供了一种基于机器学习对关联规则进行评估的方法,通过使用项目文本信息的编码向量表示针对每条关联规则进行因果关系评估,实现对关联规则的因果筛选,增加关联规则的可解释性,如图1所示,该方法包括:The embodiment of this application provides a method for evaluating association rules based on machine learning. By using the coded vector representation of project text information to evaluate the causal relationship of each association rule, the causal screening of association rules is realized, and the number of association rules is increased. Interpretability, as shown in Figure 1, the method includes:
101、使用项目共现条件从项目集合中挖掘关联规则。101. Mining association rules from item collections using item co-occurrence conditions.
其中,关联规则具有条件A→条件B的形式,表示在满足条件A的情况下,可以得到条件B,这里条件A和条件B分别为关联规则包含的包括前件和后件,在关联规则中,箭头左边的项目为关联规则的前件,箭头右边的项目为关联规则的后件,这里前件和后件可以是一个项目,也可以是多个项目,项目集合可以涉及到不同的领域,例如,针对医疗领域的药品项目、检验项目等,针对网络购物领域的支付项目、评价项目等,具体可以通过预先设置的接口渠道获取大量的用户文本信息,并汇总大量的用户文本信息形成项目集合。项目共现条件为关联规则中前件与后件两个项目同时出现,例如,在就诊文本信息中,患者同时购买了项目A和项目B,项目共现条件的前提即前件中项目和后件中项目同时出现。Among them, the association rule has the form of condition A → condition B, which means that condition B can be obtained when condition A is satisfied. Here, condition A and condition B are the antecedents and postconditions contained in the association rule respectively. In the association rule , the item on the left of the arrow is the antecedent of the association rule, and the item on the right of the arrow is the aftermath of the association rule. Here, the antecedent and the aftermath can be one item or multiple items, and the set of items can involve different fields. For example, for drug items and inspection items in the medical field, and for payment items and evaluation items in the online shopping field, a large amount of user text information can be obtained through a pre-set interface channel, and a large amount of user text information can be aggregated to form an item collection . The item co-occurrence condition is that the antecedent and the subsequent item in the association rules appear at the same time. For example, in the medical consultation text information, the patient purchased item A and item B at the same time. The premise of the item co-occurrence condition is that the item in the antecedent and the latter item Items in the file appear at the same time.
具体使用项目共现条件从项目集合中挖掘得到候选关联规则的过程中,首先是项目集合的生成,可使用PF growth算法生成,然后从项目集合的全排列中筛选出满足预设条件的关联规则,这里预设条件为支持度和置信度同时大于给定阈值,支持度定义为前件和后件共现的频率,置信度定义为前件和后件共现的频率与前件概率之比,前件概率是前件中所有项目的共现频率。Specifically, in the process of using item co-occurrence conditions to mine candidate association rules from the item set, the first is the generation of the item set, which can be generated using the PF growth algorithm, and then filter out the association rules that meet the preset conditions from the full arrangement of the item set , where the preset condition is that the support and confidence are greater than a given threshold at the same time, the support is defined as the co-occurrence frequency of the antecedent and the consequent, and the confidence is defined as the ratio of the co-occurrence frequency of the antecedent and the latter to the probability of the antecedent , the antecedent probability is the co-occurrence frequency of all items in the antecedent.
对于前件和后件共现的频率,项目集合中共有1000个检查项目,其中,血常规检查和尿常规检查都做的就诊文本信息有800次,那么血常规和尿常规项目的共现频率为0.8,对于前件概率,前件是一个项目集合,可以有一个项目或者多个项目,如果是一个项目,前件概率为项目的出现概率,如果是多个项目,前件概率为多个项目共现的频率。这里给定阈值可根据实际项目需求设置,如果实际项目需求为质检,则违反了规则判定为违规样本,需要设置高置信度、低支持度的预设条件。For the frequency of co-occurrence of the antecedent and the posterior, there are 1,000 inspection items in the item collection, among which, there are 800 medical visit text information for both blood routine examination and urine routine examination, so the co-occurrence frequency of blood routine and urine routine items is 0.8. For the probability of the antecedent, the antecedent is a collection of items. There can be one item or multiple items. If it is one item, the antecedent probability is the occurrence probability of the item. If it is multiple items, the antecedent probability is multiple The frequency with which items co-occur. The given threshold here can be set according to the actual project requirements. If the actual project requirement is quality inspection, it will be judged as a violation sample if it violates the rules. It is necessary to set the preset conditions of high confidence and low support.
在本申请实施例中,执行主体可以为基于机器学习对关联规则进行评估的装置,具体应用在服务器端,这里使用项目共现条件从项目集合中挖掘关联规则符合预设条件,可作为关联规则的初步筛选,可以分辨项目集合中存在的关联关系。In this embodiment of the application, the execution subject can be a device that evaluates association rules based on machine learning, and is specifically applied on the server side. Here, the item co-occurrence condition is used to mine the association rules from the item collection to meet the preset conditions, which can be used as association rules Preliminary screening of , which can distinguish the association relationship existing in the project collection.
上述服务器可以是独立的服务器,也可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。The above servers can be independent servers, or provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery network (Content Delivery Network) , CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
102、利用预先训练的文本信息编码器和前件预测机对收集的项目文本信息进行特征提取,得到项目文本信息的编码向量表示。102. Use the pre-trained text information encoder and antecedent predictor to perform feature extraction on the collected item text information, and obtain the encoded vector representation of the item text information.
其中,文本信息编码器用于对关联规则中后件是否出现进行预测,可以使用TextCNN、BERT等自然语言模型,输入参数为项目文本信息,输出参数为项目文本信息的编码向量表示,进一步对项目文本信息的编码向量表示进行分类,还可以输出关联规则中后件是否在项目文本信息中出现的预测值,前件预测机用于对关联规则中前件是否出现进行预测,可以使用深度神经网络模型,即多层感知机,第L层的输入为第(L-1)层的输出,其计算公式为,z l=ReLU(w lz l-1+b l),其中w l和b l是第L层的模型参数,RELU为激活函数,其计算公式为max(0,x),输入参数为项目文本信息的编码向量表示以及关联规则中后件是否在项目文本信息中出现的预测值,输出参数为关联规则中前件是否在项目文本信息中出现的预测值。 Among them, the text information encoder is used to predict whether the consequent appears in the association rules. Natural language models such as TextCNN and BERT can be used. The input parameter is the item text information, and the output parameter is the encoded vector representation of the item text information. Further, the item text The coded vector representation of information is used to classify, and it can also output the predicted value of whether the subsequent part in the association rule appears in the item text information. The antecedent predictor is used to predict whether the antecedent appears in the association rule, and a deep neural network model can be used , that is, a multi-layer perceptron, the input of the Lth layer is the output of the (L-1)th layer, and its calculation formula is, z l = ReLU(w l z l-1 + b l ), where w l and b l is the model parameter of the L-th layer, RELU is the activation function, its calculation formula is max(0,x), the input parameter is the coded vector representation of the item text information and the predicted value of whether the consequent appears in the item text information in the association rule , the output parameter is the predicted value of whether the antecedent in the association rule appears in the item text information.
这里项目文本信息可以为医疗文本数据,该医疗文本数据可以是医疗电子记录(Electronic Healthcare Reccord)、电子化的个人健康记录,包括病例、心电图、医学影像等一系列具有保存备查价值的电子化记录。The item text information here can be medical text data, and the medical text data can be electronic healthcare records (Electronic Healthcare Record), electronic personal health records, including medical records, electrocardiograms, medical images, and a series of electronic records that are valuable for future reference. .
这里文本信息编码器和前件预测机可使用人工智能的机器算法结合关联规则中前件和后件是否在项目文本信息中出现作为标签数据对网络模型进行训练,以将项目文本信息进行向量化表示为编码向量表示,并在训练过程中文本信息编码器与前件预测机进行对抗学习,即两者的优化目标是相反的,通过对抗学习,可以使得项目文本信息的编码向量表示中去除仅与关联规则中前件相关的信息,保留前件和后件相关的信息。Here, the text information encoder and the antecedent prediction machine can use the machine algorithm of artificial intelligence to combine whether the antecedent and the aftermath in the association rules appear in the item text information as label data to train the network model to vectorize the item text information It is expressed as an encoded vector representation, and the text information encoder and the antecedent predictor perform adversarial learning during the training process, that is, the optimization goals of the two are opposite. Through adversarial learning, only The information related to the predecessor in the association rule, retains the information related to the former and the latter.
103、响应于关联规则的评估指令,根据所述项目文本信息的编码向量表示对每条关联规则进行评估,得到反映关联规则中前件和后件之间因果关系的评估结果。103. In response to an association rule evaluation instruction, evaluate each association rule according to the coded vector representation of the item text information, and obtain an evaluation result reflecting the causal relationship between the antecedent and the consequent in the association rule.
可以理解的是,针对每条关联规则,在文本信息编码器与前件预测机的对抗学习过程中,使用项目文本信息的编码向量表示预测关联规则的后件,再利用关联规则的后件与项 目文本信息的编码向量表示预测关联规则的前件,可以去除项目文本信息中与后件不相关的信息,实现项目文本信息中的因果矫正,进一步通过矫正后的文本项目信息,从相似文本项目信息中评估出关联规则中前件对后件的因果贡献。It is understandable that, for each association rule, in the confrontational learning process between the text information encoder and the antecedent predictor, the coded vector of the item text information is used to represent the aftermath of the predicted association rule, and then the latter and the antecedent of the association rule are used. The coding vector of the item text information represents the antecedent of the predicted association rules, which can remove the information irrelevant to the subsequent item in the item text information, and realize the causal correction in the item text information. Further, through the corrected text item information, similar text items The information evaluates the causal contribution of the antecedent to the consequent in association rules.
具体针对每条关联规则,项目文本信息中包含大量文本,可以选取前件发生的项目文本信息作为文本样本集,这里每条关联规则会筛选出多个文本样本,然后针对文本样本集中每个文本样本,遍历项目文本信息的编码向量表示,查询与文本样本的编码向量表示最相似的K个项目文本信息,这里可以通过计算编码向量之间的距离得到,进一步针对每个文本样本,计算K个项目文本信息中关联规则的后件出现的概率值,并求取所有文本样本所计算的后件出现的概率平均值,作为反映关联规则中前件和后件之间因果关系的评估数值,该评估数值即为关联规则中前件与后件之间因果关系的表征。Specifically for each association rule, the project text information contains a large amount of text, and the project text information that occurred in the previous item can be selected as the text sample set. Here, each association rule will filter out multiple text samples, and then target each text in the text sample set Sample, traversing the coded vector representation of the item text information, querying the K items of text information that are most similar to the coded vector representation of the text sample, which can be obtained by calculating the distance between the coded vectors, and further calculating K for each text sample The probability value of the subsequent occurrence of the association rule in the project text information, and calculate the average value of the occurrence probability of the subsequent occurrence calculated by all text samples, as the evaluation value reflecting the causal relationship between the antecedent and the subsequent in the association rule, the The evaluation value is the representation of the causal relationship between the antecedent and the subsequent in association rules.
本申请实施例提供的一种基于机器学习对关联规则进行评估的方法,使用项目共现条件从项目集合中挖掘关联规则,该关联规则包括前件和后件,项目共现条件为前件与后件中项目同时出现,并利用预先训练的文本信息编码器和前件预测机对收集的项目文本信息进行特征提取,得到项目文本信息的编码向量表示,该文本信息编码器用于对关联规则中后件是否出现进行预测,该前件预测机用于对关联规则中前件是否出现进行预测,响应于关联规则的评估指令,根据项目文本信息的编码向量表示对每条关联规则进行评估,得到反映关联规则中前件和后件之间因果关系的评估结果。与现有技术中通过数据挖掘得到的关联规则的方式相比,本申请通过引入因果矫正的方式对挖掘得到关联规则进行因果关系评估,去除仅跟关联规则前件或后件相关的特征,得到后件对于前件的因果解释,以增加关联规则的可解释性,从而减少关联规则存在的假阳性,避免主观因素对关联规则筛选的影响。The embodiment of the present application provides a method for evaluating association rules based on machine learning, using item co-occurrence conditions to mine association rules from item collections, the association rules include antecedents and subsequent items, and the item co-occurrence conditions are antecedents and The items in the subsequent items appear at the same time, and the pre-trained text information encoder and the antecedent predictor are used to extract the features of the collected item text information, and the encoded vector representation of the item text information is obtained. The text information encoder is used for the association rules. Predict whether the consequent appears. The antecedent predictor is used to predict whether the antecedent appears in the association rules. In response to the evaluation instruction of the association rules, each association rule is evaluated according to the coded vector representation of the item text information, and the obtained It reflects the evaluation result of the causal relationship between the antecedent and the consequent in the association rule. Compared with the method of association rules obtained through data mining in the prior art, this application introduces the method of causal correction to evaluate the causal relationship of the association rules obtained by mining, and removes the features only related to the antecedent or the latter of the association rules, and obtains The causal interpretation of the latter to the antecedent increases the interpretability of association rules, thereby reducing false positives in association rules and avoiding the influence of subjective factors on association rule screening.
本申请实施例提供了另一种基于机器学习对关联规则进行评估的方法,通过使用项目文本信息的编码向量表示针对每条关联规则进行因果关系评估,实现对关联规则的因果筛选,增加关联规则的可解释性,如图2所示,所述方法包括:The embodiment of this application provides another method for evaluating association rules based on machine learning. By using the coded vector representation of item text information to evaluate the causal relationship of each association rule, the causal screening of association rules can be realized, and association rules can be added. Interpretability, as shown in Figure 2, the method includes:
201、对项目集合所包含频繁项目子集进行全排列。201. Perform a full arrangement on the subset of frequent items included in the item set.
其中,项目集合相当于不同物品组成的集合,每个物品为项目集合中的项目,该项目可以为顾客消费项目,例如,牛奶、饼干,医疗支付项目,例如,血常规、尿检。由于项目集合中项目与项目之间的关联能够从一定程度上引导消费或者辅助医疗报销,例如,顾客在购买项目A和项目B的同时会购买项目C,患者在支付医疗项目C的同时会支付医疗项目D和医疗项目E。为了体现项目之间的关联关系,频繁项目子集为包含项目集合中至少一个项目,且所包含项目同时出现在一条记录的次数大于等于最小支持度,具体在全排列过程中,可以根据项目集合中项目数量确定包含不同项目数量的频繁项目子集,并根据项目数量列出所有的项目子集,进一步从项目子集中筛选支持度大于预设阈值的出频繁项子集。具体在筛选频繁项子集的过程可以遵循以下两个原则,如果一个项目子集为频繁项目子集,则该项目子集的子集为频繁项目子集,如果一个项目子集为非频繁项目子集,则 该该项目子集的超集为非频繁项目子集,该过程可以节省频繁项目子集的生成时间。Wherein, the item set is equivalent to a set of different items, and each item is an item in the item set, and the item can be a customer consumption item, for example, milk, biscuits, medical payment items, for example, blood routine, urine test. Since the association between items in the item collection can guide consumption or auxiliary medical reimbursement to a certain extent, for example, customers will purchase item C while purchasing item A and item B, and patients will pay for medical item C while paying Medical Item D and Medical Item E. In order to reflect the relationship between items, the frequent item subset is to contain at least one item in the item set, and the number of times the contained items appear in a record at the same time is greater than or equal to the minimum support. Specifically, in the process of full arrangement, according to the item set The number of items in the medium determines the frequent item subsets containing different item numbers, and lists all the item subsets according to the item number, and further filters out the frequent item subsets whose support degree is greater than the preset threshold from the item subset. Specifically, the process of screening frequent item subsets can follow the following two principles. If an item subset is a frequent item subset, then the item subset is a frequent item subset. If an item subset is an infrequent item Subset, then the superset of the item subset is the infrequent item subset, and this process can save the generation time of the frequent item subset.
例如,项目集合为{A,B,C,D},首先列出包含一个项目的项目子集如下:{A}、{B}、{C}、{D},然后列出包含两个项目的项目子集如下:{A,B}、{A,C}、{A,D}、{B,C}、{B,D}、{C,D},再列出包含三个项目的项目子集如下:{A,B,C}、{A,B,D}、{A,C,D}、{B,C,D},而支持度大于3/5的频繁项子集{A}、{B}、{A,B}、{B,C}、{A,C}、{A,B,C}。For example, the collection of items is {A,B,C,D}, first list the subset of items containing one item as follows: {A}, {B}, {C}, {D}, then list the subset of items containing two items The subset of items in is as follows: {A,B}, {A,C}, {A,D}, {B,C}, {B,D}, {C,D}, and then list the three items The subset of items is as follows: {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D}, and the subset of frequent items with support greater than 3/5{ A}, {B}, {A,B}, {B,C}, {A,C}, {A,B,C}.
202、针对所述频繁项目子集生成候选关联规则,并利用预设参数指标对所述候选关联规则进行过滤,得到符合预设条件的候选规则。202. Generate candidate association rules for the frequent item subset, and filter the candidate association rules by using preset parameter indicators to obtain candidate rules that meet preset conditions.
其中,参数指标至少包括支持度和置信度,支持度为前件和后件的共现频率,置信度为支持度与前件概率的比值。针对频繁项目子集生成的候选关联规则相当于频繁项目子集中项目的推导关系,例如,频繁项目子集为{A,B,C},项目的推导关系可以包括:A,B=>C、A,C=>B、B,C=>A、A=>B,C、B=>A,C、C=>A,B。Among them, the parameter index includes support and confidence at least, the support is the co-occurrence frequency of the antecedent and the consequent, and the confidence is the ratio of the support to the probability of the antecedent. The candidate association rules generated for the frequent item subset are equivalent to the derivation relationship of the items in the frequent item subset. For example, the frequent item subset is {A, B, C}, and the derivation relationship of the items may include: A, B=>C, A, C => B, B, C => A, A => B, C, B => A, C, C => A, B.
进一步地,为了生成项目之间有效的关联规则,需要计算项目之间所形成后候选关联规则是否满足参数指标,对于支持度或者置信度不符合要求的候选关联规则,说明频繁项目子集中项目之间的关联性较弱,并不具有参考性,这里通过对参数指标阈值设置阈值的方式作为预设条件,以过滤掉关联关系较弱的候选关联规则,提高关联规则的可靠性。Furthermore, in order to generate effective association rules between items, it is necessary to calculate whether the candidate association rules formed between items meet the parameter indicators. The correlation between them is weak and not referential. Here, the threshold value of the parameter index threshold is set as a preset condition to filter out candidate association rules with weak correlation and improve the reliability of the association rules.
具体每个频繁项目子集中项目可以形成多条候选关联规则,针对每个候选关联规则都可以分别计算置信度和支持度,如果置信度和支持度均符合预设条件,即两者均大于设置的置信度阈值以及支持度阈值,说明该候选关联规则具有较强的关联性,可以保留,否则,将该候选关联规则进行过滤。Specifically, the items in each frequent item subset can form multiple candidate association rules, and the confidence and support can be calculated separately for each candidate association rule. If the confidence and support meet the preset conditions, that is, both are greater than the set Confidence threshold and support threshold of , indicating that the candidate association rule has strong relevance and can be retained; otherwise, the candidate association rule is filtered.
203、针对每条关联规则,使用预先确定所述前件和所述后件是否在项目文本信息中出现作为标签数据。203. For each association rule, use a pre-determined whether the antecedent and the consequent appear in the item text information as tag data.
在本申请中,每个项目文本信息中都会包含有至少一个项目,具体针对每条项目关联规则,前件在项目文本信息中出现相当于前件中项目均在项目文本信息中均发生,例如,前件为项目血常规和尿常规,而项目文本信息中如果包含血常规和尿常规,即视为前件在项目文本信息中出现,同理,后件是否在项目文本信息中出现即为后件中项目均在项目文本信息中出现。In this application, each item text information will contain at least one item, specifically for each item association rule, the appearance of the preceding item in the item text information is equivalent to the occurrence of all items in the item text information in the item, for example , the antecedent is the blood routine and urine routine of the project, and if the project text information contains the blood routine and urine routine, it is considered that the antecedent appears in the project text information, and similarly, whether the latter appears in the project text information is Items in the aftermath appear in the item text information.
204、将携带有标签数据的项目文本信息输入至第一网络模型中进行训练,构建文本信息编码器。204. Input the item text information carrying the label data into the first network model for training, and construct a text information encoder.
具体在训练过程中,文本信息编码器可获取到项目文本信息的编码向量表示,并使用编码向量表示对关联规则中后件是否在项目文本信息中出现进行预测,文本信息编码器的优化目标为最大化预测关联规则中后件是否在项目文本信息中出现。也就是说,针对每条关联规则,会利用关联规则中后件在项目文本信息中出现的标签数据进行训练,并在训练过程中结合多标签损失函数,这里每个标签对应一个交叉熵损失函数,多个标签为多个交叉熵损失函数相加,具体损失函数公示表示为:Specifically, during the training process, the text information encoder can obtain the coded vector representation of the item text information, and use the coded vector representation to predict whether the consequence of the association rule appears in the item text information. The optimization goal of the text information coder is Maximize the prediction of whether the consequents in the association rules appear in the item text information. That is to say, for each association rule, the label data that appears in the item text information of the subsequent items in the association rule will be used for training, and the multi-label loss function will be combined during the training process, where each label corresponds to a cross-entropy loss function , multiple labels are added for multiple cross-entropy loss functions, and the specific loss function is publicly expressed as:
Figure PCTCN2022071425-appb-000001
Figure PCTCN2022071425-appb-000001
其中,y为关联规则中后件是否在项目文本信息中出现,
Figure PCTCN2022071425-appb-000002
是编码器输出关联规则中后件是否在项目文本信息中出现的预测值,x为关联规则中前件是否在项目文本信息中出现,
Figure PCTCN2022071425-appb-000003
是前件预测机输出关联规则中前件是否在项目文本信息中出现的预测值。
Among them, y is whether the consequent appears in the item text information in the association rule,
Figure PCTCN2022071425-appb-000002
is the predicted value of whether the consequent appears in the item text information in the encoder output association rule, x is whether the antecedent appears in the item text information in the association rule,
Figure PCTCN2022071425-appb-000003
is the predicted value of whether the antecedent appears in the project text information in the association rules output by the antecedent predictor.
205、将所述第一网络模型输出项目文本信息的编码向量表示以及关联规则中后件是否在项目文本信息中出现的预测值输入至第二网络模型进行训练,构建前件预测机。205. Input the encoded vector representation of the item text information output by the first network model and the prediction value of whether the consequent appears in the item text information in the association rules to the second network model for training, and build an antecedent predictor.
具体在训练过程中,前件预测机可使用编码向量表示以及关联规则中后件在用户文本信息中出现的预测值对关联规则中前件是否在项目文本信息中出现进行预测。前件预测机的优化目标为最大化预测关联规则中前件是否在项目文本信息中出现。也就是说,针对每条关联规则,会利用关联规则中前件在项目文本信息中出现的标签数据进行训练,并在训练过程中结合多标签损失函数,该损失函数同样为多标签问题的损失函数,公式表示为:Specifically, during the training process, the antecedent predictor can predict whether the antecedent in the association rule appears in the item text information by using the coded vector representation and the predicted value of the latter in the user text information in the association rule. The optimization goal of the antecedent predictor is to maximize the prediction of whether the antecedent appears in the project text information in the association rules. That is to say, for each association rule, the label data that appears in the item text information of the previous item in the association rule will be used for training, and the multi-label loss function will be combined during the training process. The loss function is also the loss of the multi-label problem function, the formula is expressed as:
Figure PCTCN2022071425-appb-000004
Figure PCTCN2022071425-appb-000004
206、利用预先训练的文本信息编码器和前件预测机对收集的项目文本信息进行特征提取,得到项目文本信息的编码向量表示。206. Use the pre-trained text information encoder and antecedent predictor to perform feature extraction on the collected item text information, and obtain an encoded vector representation of the item text information.
应说明的是,这里文本信息编码器与前件预测机在训练过程中进行对抗学习,以使得项目文本信息中去除与关联规则中前件相关的信息,并保留关联规则中前件与后件相关的信息。It should be noted that here, the text information encoder and the antecedent predictor perform confrontational learning during the training process, so that the information related to the antecedent in the association rules is removed from the item text information, and the antecedent and the aftermath in the association rules are retained. Related information.
207、响应于关联规则的评估指令,针对每条关联规则,根据所述项目文本信息的编码向量表示计算反映关联规则中前件和后件之间因果关系的评估数值。207. In response to the evaluation instruction of the association rule, for each association rule, calculate an evaluation value reflecting the causal relationship between the antecedent and the consequent in the association rule according to the coded vector representation of the item text information.
其中,项目文本信息包含有多个文本,具体可以针对每条关联规则,从项目文本信息中选取关联规则中前件出现的文本作为样本文本,然后遍历项目文本信息的编码向量表示,查询与每个样本文本的编码向量表示符合相似度条件的文本,作为每个样本文本的相似目标文本,针对每个样本文本的相似目标文本,计算反映关联规则中前件和后件之间因果关系的评估数值。Among them, the item text information contains multiple texts. Specifically, for each association rule, the text that appears in the preceding item in the association rule can be selected from the item text information as the sample text, and then the encoded vector representation of the item text information is traversed, and the query and each The coded vectors of each sample text represent the texts that meet the similarity condition, as the similar target text of each sample text, for each sample text similar target text, calculate the evaluation reflecting the causal relationship between the antecedent and the consequent in the association rule value.
具体在计算反映关联规则中前件和后件之间因果关系的评估数值过程中,可以针对每个样本文本的相似目标文本,计算目标相似文本中出现关联规则中后件的概率值,得到每个样本文本符合评估条件的概率值,并通过加权平均各个样本文本符合评估条件的概率值,得到反映关联规则中前件和后件之间因果关系的评估数值。Specifically, in the process of calculating the evaluation value reflecting the causal relationship between the antecedent and the consequent in the association rules, we can calculate the probability value of the latter in the association rules in the target similar texts for the similar target texts of each sample text, and obtain each The probability value of each sample text meeting the evaluation condition is obtained by weighting the probability value of each sample text meeting the evaluation condition to obtain the evaluation value reflecting the causal relationship between the antecedent and the latter in the association rules.
例如,项目文本信息包含100个文本,针对每条关联规则,选取前件发生的项目文本信息中包含10个样本文本,即样本文本1-10,针对样本文本1,遍历100个文本的编码向量表示,寻找5个与样本文本的编码向量表示相似的目标文本,进一步计算在5个目标文本中关联规则的后件出现的概率值为a1,如果概率值为0.8,则说明5个目标文本中4个出现了关联规则的后件,同理,针对样本文本2-10都可以计算出符合条件的概率值a2、a3、a4、a5,进一步加权求取概率平均值即(a1+a2+a3+a4+a5)/5,得到反映关联规则中前件和后件之间因果关系的评估数值。For example, the project text information contains 100 texts. For each association rule, the project text information that occurs before the selected item contains 10 sample texts, that is, sample text 1-10. For sample text 1, traverse the encoding vectors of 100 texts means, find 5 target texts that are similar to the coded vector representation of the sample text, and further calculate the probability value of the occurrence of the consequent of the association rule in the 5 target texts, if the probability value is 0.8, it means that in the 5 target texts There are 4 postconditions of association rules. Similarly, for the sample texts 2-10, the qualified probability values a2, a3, a4, and a5 can be calculated, and the probability value is further weighted to obtain the average value (a1+a2+a3 +a4+a5)/5, to obtain the evaluation value reflecting the causal relationship between the antecedent and the consequent in the association rule.
208、若所述评估数值大于预设阈值,则判定所述关联规则中前件和后件之间具有因 果关系。208. If the evaluation value is greater than the preset threshold, it is determined that there is a causal relationship between the antecedent and the consequent in the association rule.
可以理解的是,这里的评估数值可表征关联规则中前件与后件之间相互的因果解释,能够更直观反映前件与后件之间是否具有因果关系,增加关联规则的可解释性。对于评估数值大于预设阈值,说明关联规则中前件与后件之间的因果解释性较强,说明前件与后件之间具有因果关系,否则,说明关联规则的解释性较弱,即关联规则虽然被挖掘出来,但是该关联规则中前件与后件的合理性较差。It can be understood that the evaluation value here can represent the mutual causal interpretation between the antecedent and the subsequent in the association rules, which can more intuitively reflect whether there is a causal relationship between the antecedents and the latter, and increase the interpretability of the association rules. If the evaluation value is greater than the preset threshold, it means that the causal explanatory power between the antecedent and the subsequent in the association rule is strong, indicating that there is a causal relationship between the antecedent and the latter; otherwise, the explanatory power of the association rule is weak, namely Although the association rules have been mined, the antecedents and consequents in the association rules are less rational.
在实际应用场景中,关联规则的评估可以用来对挖掘出来的关联规则进行过滤或者解释,以实现数据搭配和数据预测,例如,针对购物场景中的服装搭配,针对畜牧养殖场景的疫情判断,针对页面访问场景的业务推送等等。In practical application scenarios, the evaluation of association rules can be used to filter or explain the mined association rules to achieve data collocation and data prediction, for example, for clothing collocation in shopping scenarios, epidemic situation judgment for livestock breeding scenarios, Business push for page access scenarios, etc.
本申请中针对每条关联规则,利用预先训练的文本信息编码器和前件预测机对预先收集的项目文本信息进行特征提取,并使用提取到项目文本信息的向量编码表示来对关联规则中前件和后件之间是否具有因果关系进行评估,能够去除关联规则中前件或后件的相关特征,得到后件对于前件的因果解释,从而减少潜在规则的假阳性,降低关联规则挖掘中的主观性,使用信息编码器和前件预测机,可以快速和稳定的迭代,提高关联规则的可解释性。In this application, for each association rule, the pre-trained text information encoder and antecedent predictor are used to extract the features of the pre-collected item text information, and the vector encoding representation of the extracted item text information is used to extract the antecedents in the association rules. Whether there is a causal relationship between the item and the subsequent item can be evaluated, and the relevant features of the antecedent or subsequent item in the association rules can be removed, and the causal explanation of the latter item for the anterior item can be obtained, thereby reducing false positives of potential rules and reducing the time spent in mining association rules. Subjectivity, using information encoders and antecedent predictors, enables fast and stable iterations, improving the interpretability of association rules.
进一步地,作为图1所述方法的具体实现,本申请实施例提供了一种基于机器学习对关联规则进行评估的装置,如图3所示,所述装置包括:挖掘单元31、提取单元32、评估单元33。Further, as a specific implementation of the method described in FIG. 1 , an embodiment of the present application provides a device for evaluating association rules based on machine learning. As shown in FIG. 3 , the device includes: a mining unit 31, an extraction unit 32 , evaluation unit 33 .
挖掘单元31,可以用于使用项目共现条件从项目集合中挖掘关联规则,所述关联规则包括前件和后件,所述项目共现条件为前件和后件中项目同时出现;The mining unit 31 can be used to mine association rules from the item collection using item co-occurrence conditions, the association rules include antecedents and subsequent items, and the item co-occurrence conditions are that items in the antecedent and subsequent items appear simultaneously;
提取单元32,可以用于利用预先训练的文本信息编码器和前件预测机对收集的项目文本信息进行特征提取,得到项目文本信息的编码向量表示,所述文本信息编码器用于对所述关联规则中后件是否出现进行预测,所述前件预测机用于对所述关联规则中前件是否出现进行预测;The extraction unit 32 can be used to perform feature extraction on the collected item text information by using a pre-trained text information encoder and antecedent predictor to obtain a coded vector representation of the item text information, and the text information encoder is used for the associated Predict whether the latter appears in the rule, and the antecedent predictor is used to predict whether the antecedent appears in the association rule;
评估单元33,可以用于响应于关联规则的评估指令,根据所述项目文本信息的编码向量表示对每条关联规则进行评估,得到反映关联规则中前件和后件之间因果关系的评估结果。The evaluation unit 33 may be configured to respond to the evaluation instruction of the association rules, evaluate each association rule according to the coded vector representation of the item text information, and obtain an evaluation result reflecting the causal relationship between the antecedent and the subsequent in the association rule .
本申请实施例提供的一种基于机器学习对关联规则进行评估的装置,使用项目共现条件从项目集合中挖掘关联规则,该关联规则包括前件和后件,项目共现条件为前件与后件中项目同时出现,并利用预先训练的文本信息编码器和前件预测机对收集的项目文本信息进行特征提取,得到项目文本信息的编码向量表示,该文本信息编码器用于对关联规则中后件是否出现进行预测,该前件预测机用于对关联规则中前件是否出现进行预测,响应于关联规则的评估指令,根据项目文本信息的编码向量表示对每条关联规则进行评估,得到反映关联规则中前件和后件之间因果关系的评估结果。与现有技术中通过数据挖掘得到的关联规则的方式相比,本申请通过引入因果矫正的方式对挖掘得到关联规则进行因果关系 评估,去除仅跟关联规则前件或后件相关的特征,得到后件对于前件的因果解释,以增加关联规则的可解释性,从而减少关联规则存在的假阳性,避免主观因素对关联规则筛选的影响。The embodiment of the present application provides a device for evaluating association rules based on machine learning, using item co-occurrence conditions to mine association rules from item collections, the association rules include antecedents and subsequent items, and the item co-occurrence conditions are The items in the subsequent items appear at the same time, and the pre-trained text information encoder and the antecedent predictor are used to extract the features of the collected item text information, and the encoded vector representation of the item text information is obtained. The text information encoder is used for the association rules. Predict whether the consequent appears. The antecedent predictor is used to predict whether the antecedent appears in the association rules. In response to the evaluation instruction of the association rules, each association rule is evaluated according to the coded vector representation of the item text information, and the obtained It reflects the evaluation result of the causal relationship between the antecedent and the consequent in the association rule. Compared with the method of association rules obtained through data mining in the prior art, this application introduces the method of causal correction to evaluate the causal relationship of the association rules obtained by mining, and removes the features only related to the antecedent or the latter of the association rules, and obtains The causal interpretation of the latter to the antecedent increases the interpretability of association rules, thereby reducing false positives in association rules and avoiding the influence of subjective factors on association rule screening.
作为图3中所示基于机器学习对关联规则进行评估的装置的进一步说明,图4是根据本申请实施例另一种基于机器学习对关联规则进行评估的装置的结构示意图,如图4所示,所述项目共现条件为关联规则中前件与后件同时出现,所述挖掘单元31包括:As a further description of the device for evaluating association rules based on machine learning shown in FIG. 3 , FIG. 4 is a schematic structural diagram of another device for evaluating association rules based on machine learning according to an embodiment of the present application, as shown in FIG. 4 , the item co-occurrence condition is that the preceding item and the subsequent item appear simultaneously in the association rule, and the mining unit 31 includes:
排列模块311,可以用于对项目集合所包含频繁项目子集进行全排列;The arrangement module 311 can be used to perform full arrangement on the subset of frequent items included in the item set;
选取模块312,可以用于针对所述频繁项目子集生成候选关联规则,并利用预设参数指标对所述候选关联规则进行过滤,得到符合预设条件的候选规则,所述参数指标至少包括支持度和置信度,所述支持度为前件和后件的共现频率,所述置信度为支持度与前件概率的比值。The selection module 312 can be used to generate candidate association rules for the subset of frequent items, and use preset parameter indicators to filter the candidate association rules to obtain candidate rules that meet the preset conditions, and the parameter indicators include at least supporting degree and confidence, the support degree is the co-occurrence frequency of the antecedent and the consequent, and the confidence degree is the ratio of the support degree to the probability of the antecedent.
在具体应用场景中,如图4所示,所述装置还包括:In a specific application scenario, as shown in Figure 4, the device further includes:
生成单元34,可以用于在所述利用预先训练的文本信息编码器和前件预测机对收集的项目文本信息进行特征提取,得到项目文本信息的编码向量表示之前,针对每条关联规则,使用预先确定所述前件和所述后件是否在项目文本信息中出现作为标签数据;The generation unit 34 can be used to perform feature extraction on the collected item text information using the pre-trained text information encoder and antecedent predictor, and obtain the coded vector representation of the item text information. For each association rule, use Predetermining whether said antecedent and said consequent appear in item text information as tag data;
第一构建单元35,可以用于将携带有标签数据的项目文本信息输入至第一网络模型中进行训练,构建文本信息编码器,所述文本信息编码器的优化目标为最大化预测关联规则中后件是否在项目文本信息中出现;The first construction unit 35 can be used to input the item text information carrying the label data into the first network model for training, and construct a text information encoder whose optimization goal is to maximize the prediction in the association rules. Whether the aftermath appears in the project text information;
第二构建单元36,可以用于将所述第一网络模型输出项目文本信息的编码向量表示以及关联规则中后件是否在项目文本信息中出现的预测值输入至第二网络模型进行训练,构建前件预测机,所述前件预测机的优化目标为最大化预测关联规则中前件是否在项目文本信息中出现。The second construction unit 36 can be used to input the encoded vector representation of the first network model output item text information and the predicted value of whether the consequent appears in the item text information in the association rules to the second network model for training, and construct An antecedent predictor, the optimization objective of the antecedent predictor is to maximize the prediction of whether the antecedent in the association rule appears in the project text information.
在具体应用场景中,所述文本信息编码器与所述前件预测机在训练过程中进行对抗学习,以使得项目文本信息中去除与关联规则中前件相关的信息,并保留关联规则中前件与后件相关的信息。In a specific application scenario, the text information encoder and the antecedent predictor perform adversarial learning during the training process, so that the information related to the antecedents in the association rules is removed from the item text information, and the antecedents in the association rules are retained. Information related to the item and the subsequent item.
在具体应用场景中,如图4所示,所述评估单元33包括:In a specific application scenario, as shown in FIG. 4, the evaluation unit 33 includes:
计算模块331,可以用于针对每条关联规则,根据所述项目文本信息的编码向量表示计算反映关联规则中前件和后件之间因果关系的评估数值;The calculation module 331 can be used to calculate, for each association rule, an evaluation value that reflects the causal relationship between the antecedent and the consequent in the association rule according to the coded vector representation of the item text information;
判定模块332,可以用于若所述评估数值大于预设阈值,则判定所述关联规则中前件和后件之间具有因果关系。The determination module 332 may be configured to determine that there is a causal relationship between the antecedent and the consequent in the association rule if the evaluation value is greater than a preset threshold.
在具体应用场景中,如图4所示,所述项目文本信息包含有多个文本,所述计算模块331包括:In a specific application scenario, as shown in Figure 4, the item text information includes a plurality of texts, and the calculation module 331 includes:
选取子模块3311,可以用于针对每条关联规则,从所述项目文本信息中选取关联规则中前件出现的文本作为样本文本;The selection sub-module 3311 can be used to select the text that appears in the preceding item in the association rule from the item text information as the sample text for each association rule;
查询子模块3312,可以用于遍历项目文本信息的编码向量表示,查询与每个样本文 本的编码向量表示符合相似度条件的文本,作为每个样本文本的相似目标文本;The query sub-module 3312 can be used to traverse the coded vector representation of the item text information, and query the text that meets the similarity condition with the coded vector representation of each sample text, as the similar target text of each sample text;
计算子模块3313,可以用于针对每个样本文本的相似目标文本,计算反映关联规则中前件和后件之间因果关系的评估数值。The calculation sub-module 3313 can be used to calculate the evaluation value reflecting the causal relationship between the antecedent and the consequent in the association rule for the similar target text of each sample text.
在具体应用场景中,如图4所示,所述计算子模块3313,具体可以用于针对每个样本文本的相似目标文本,计算所述目标相似文本中出现关联规则中后件的概率值,得到每个样本文本符合评估条件的概率值;In a specific application scenario, as shown in FIG. 4 , the calculation submodule 3313 can specifically be used to calculate the probability value of the occurrence of the consequent in the association rule in the similar target text of each sample text, Obtain the probability value of each sample text meeting the evaluation conditions;
所述计算子模块3313,具体还可以用于通过加权平均各个样本文本符合评估条件的概率值,得到反映关联规则中前件和后件之间因果关系的评估数值。The calculation sub-module 3313 can also be used to obtain the evaluation value reflecting the causal relationship between the antecedent and the consequent in the association rule by weighting the average probability value of each sample text meeting the evaluation condition.
需要说明的是,本实施例提供的一种基于机器学习对关联规则进行评估的装置所涉及各功能单元的其他相应描述,可以参考图1、图2中的对应描述,在此不再赘述。It should be noted that for other corresponding descriptions of the functional units involved in the apparatus for evaluating association rules based on machine learning provided in this embodiment, reference may be made to the corresponding descriptions in FIG. 1 and FIG. 2 , and details are not repeated here.
基于上述如图1、图2所示方法,相应的,本实施例还提供了一种可读存储介质,所述可读存储介质可以是非易失性的,也可以是易失性的,其上存储有计算机可读指令,该计算机可读指令被处理器执行时实现上述如图1、图2所示的基于机器学习对关联规则进行评估的方法。Based on the method shown in Figure 1 and Figure 2 above, correspondingly, this embodiment also provides a readable storage medium, the readable storage medium may be non-volatile or volatile, and Computer-readable instructions are stored on it, and when the computer-readable instructions are executed by the processor, the above-mentioned method for evaluating association rules based on machine learning as shown in FIG. 1 and FIG. 2 is realized.
基于这样的理解,本申请的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施场景所述的方法。Based on this understanding, the technical solution of the present application can be embodied in the form of software products, which can be stored in a non-volatile storage medium (which can be CD-ROM, U disk, mobile hard disk, etc.), including several The instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in various implementation scenarios of the present application.
基于上述如图1、图2所示的方法,以及图3、图4所示的虚拟装置实施例,为了实现上述目的,本申请实施例还提供了一种计算机设备,具体可以为个人计算机、服务器、网络设备等,该实体设备包括可读存储介质和处理器;可读存储介质,用于存储计算机可读指令;处理器,用于执行计算机可读指令以实现上述如图1、图2所示的基于机器学习对关联规则进行评估的方法Based on the method shown in Figure 1 and Figure 2 above, and the virtual device embodiment shown in Figure 3 and Figure 4, in order to achieve the above purpose, the embodiment of this application also provides a computer device, which can be a personal computer, Servers, network devices, etc., the physical device includes a readable storage medium and a processor; the readable storage medium is used to store computer-readable instructions; the processor is used to execute computer-readable instructions to achieve the above as shown in Figure 1 and Figure 2 The method for evaluating association rules based on machine learning shown in
可选地,该计算机设备还可以包括用户接口、网络接口、摄像头、射频(Radio Frequency,RF)电路,传感器、音频电路、WI-FI模块等等。用户接口可以包括显示屏(Display)、输入单元比如键盘(Keyboard)等,可选用户接口还可以包括USB接口、读卡器接口等。网络接口可选的可以包括标准的有线接口、无线接口(如蓝牙接口、WI-FI接口)等。Optionally, the computer device may also include a user interface, a network interface, a camera, a radio frequency (Radio Frequency, RF) circuit, a sensor, an audio circuit, a WI-FI module, and the like. The user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the like, and optional user interfaces may also include a USB interface, a card reader interface, and the like. Optionally, the network interface may include a standard wired interface, a wireless interface (such as a Bluetooth interface, a WI-FI interface) and the like.
本领域技术人员可以理解,本实施例提供的基于机器学习对关联规则进行评估的装置的实体设备结构并不构成对该实体设备的限定,可以包括更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the physical device structure of the device for evaluating association rules based on machine learning provided in this embodiment does not constitute a limitation on the physical device, and may include more or fewer components, or combine some components, or different component arrangements.
可读存储介质中还可以包括操作系统、网络通信模块。操作系统是管理上述计算机设备硬件和软件资源的程序,支持信息处理程序以及其它软件和/或程序的运行。网络通信模块用于实现可读存储介质内部各组件之间的通信,以及与该实体设备中其它硬件和软件之间通信。The readable storage medium may also include an operating system and a network communication module. The operating system is a program that manages the hardware and software resources of the above-mentioned computer equipment, and supports the operation of information processing programs and other software and/or programs. The network communication module is used to realize the communication among various components inside the readable storage medium, and communicate with other hardware and software in the physical device.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到本申请可以借助软件加必要的通用硬件平台的方式来实现,也可以通过硬件实现。通过应用本申请的技术方案,与目前现有技术相比,本申请中通过引入因果矫正的方式对挖掘得到关联规则进行因果关系评估,去除仅跟关联规则前件或后件相关的特征,得到后件对于前件的因果解释,以增加关联规则的可解释性,从而减少关联规则存在的假阳性,避免主观因素对关联规则筛选的影响。Through the above description of the embodiments, those skilled in the art can clearly understand that the present application can be realized by means of software plus a necessary general-purpose hardware platform, or by hardware. By applying the technical solution of this application, compared with the current prior art, this application introduces the method of causal correction to evaluate the causal relationship of the association rules obtained by mining, and removes the features only related to the antecedent or the latter of the association rules, and obtains The causal interpretation of the latter to the antecedent increases the interpretability of association rules, thereby reducing false positives in association rules and avoiding the influence of subjective factors on association rule screening.
本领域技术人员可以理解附图只是一个优选实施场景的示意图,附图中的模块或流程并不一定是实施本申请所必须的。本领域技术人员可以理解实施场景中的装置中的模块可以按照实施场景描述进行分布于实施场景的装置中,也可以进行相应变化位于不同于本实施场景的一个或多个装置中。上述实施场景的模块可以合并为一个模块,也可以进一步拆分成多个子模块。Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred implementation scenario, and the modules or processes in the accompanying drawings are not necessarily necessary for implementing the present application. Those skilled in the art can understand that the modules in the devices in the implementation scenario can be distributed among the devices in the implementation scenario according to the description of the implementation scenario, or can be located in one or more devices different from the implementation scenario according to corresponding changes. The modules of the above implementation scenarios can be combined into one module, or can be further split into multiple sub-modules.
上述本申请序号仅仅为了描述,不代表实施场景的优劣。以上公开的仅为本申请的几个具体实施场景,但是,本申请并非局限于此,任何本领域的技术人员能思之的变化都应落入本申请的保护范围。The serial numbers of the above application are for description only, and do not represent the pros and cons of the implementation scenarios. The above disclosures are only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any changes conceivable by those skilled in the art shall fall within the protection scope of the present application.

Claims (20)

  1. 一种基于机器学习对关联规则进行评估的方法,其中,所述方法包括:A method for evaluating association rules based on machine learning, wherein the method includes:
    使用项目共现条件从项目集合中挖掘关联规则,所述关联规则包括前件和后件,所述项目共现条件为前件和后件中项目同时出现;Using item co-occurrence conditions to mine association rules from the item collection, the association rules include antecedents and subsequent items, and the item co-occurrence conditions are that the items in the antecedent and subsequent items appear simultaneously;
    利用预先训练的文本信息编码器和前件预测机对收集的项目文本信息进行特征提取,得到项目文本信息的编码向量表示,所述文本信息编码器用于对所述关联规则中后件是否出现进行预测,所述前件预测机用于对所述关联规则中前件是否出现进行预测;Use the pre-trained text information encoder and antecedent predictor to extract the features of the collected item text information, and obtain the coded vector representation of the item text information, and the text information encoder is used to determine whether the consequent appears in the association rules Forecasting, the antecedent predictor is used to predict whether the antecedent appears in the association rule;
    响应于关联规则的评估指令,根据所述项目文本信息的编码向量表示对每条关联规则进行评估,得到反映关联规则中前件和后件之间因果关系的评估结果。In response to the evaluation instruction of the association rules, each association rule is evaluated according to the coded vector representation of the item text information, and an evaluation result reflecting the causal relationship between the antecedent and the subsequent in the association rule is obtained.
  2. 根据权利要求1所述的方法,其中,所述项目共现条件为关联规则中前件与后件同时出现,所述使用项目共现条件从项目集合中挖掘关联规则,具体包括:The method according to claim 1, wherein the item co-occurrence condition is that the antecedent and the subsequent item in the association rule appear simultaneously, and the use of the item co-occurrence condition to mine the association rule from the item collection specifically includes:
    对项目集合所包含频繁项目子集进行全排列;Perform a full permutation of the frequent item subsets contained in the item set;
    针对所述频繁项目子集生成候选关联规则,并利用预设参数指标对所述候选关联规则进行过滤,得到符合预设条件的候选规则,所述参数指标至少包括支持度和置信度,所述支持度为前件和后件的共现频率,所述置信度为支持度与前件概率的比值。Generate candidate association rules for the subset of frequent items, and use preset parameter indicators to filter the candidate association rules to obtain candidate rules that meet preset conditions, the parameter indicators include at least support and confidence, and the The support is the co-occurrence frequency of the antecedent and the consequent, and the confidence is the ratio of the support to the probability of the antecedent.
  3. 根据权利要求1所述的方法,其中,在所述利用预先训练的文本信息编码器和前件预测机对收集的项目文本信息进行特征提取,得到项目文本信息的编码向量表示之前,所述方法还包括:The method according to claim 1, wherein, before said utilizing the pre-trained text information encoder and antecedent predictor to perform feature extraction on the collected item text information, and obtain the coded vector representation of the item text information, said method Also includes:
    针对每条关联规则,使用预先确定所述前件和所述后件是否在项目文本信息中出现作为标签数据;For each association rule, predetermining whether the antecedent and the consequent appear in item text information is used as tag data;
    将携带有标签数据的项目文本信息输入至第一网络模型中进行训练,构建文本信息编码器,所述文本信息编码器的优化目标为最大化预测关联规则中后件是否在项目文本信息中出现;Input the item text information carrying the labeled data into the first network model for training, and build a text information encoder whose optimization goal is to maximize the prediction of whether the consequent appears in the item text information in the association rules ;
    将所述第一网络模型输出项目文本信息的编码向量表示以及关联规则中后件是否在项目文本信息中出现的预测值输入至第二网络模型进行训练,构建前件预测机,所述前件预测机的优化目标为最大化预测关联规则中前件是否在项目文本信息中出现。The coded vector representation of the output item text information of the first network model and the predicted value of whether the consequent appears in the item text information in the association rules are input to the second network model for training, and the antecedent predictor is constructed, and the antecedent The optimization goal of the predictor is to maximize whether the antecedents in the association rules appear in the project text information.
  4. 根据权利要求3所述的方法,其中,所述文本信息编码器与所述前件预测机在训练过程中进行对抗学习,以使得项目文本信息中去除与关联规则中前件相关的信息,并保留关联规则中前件与后件相关的信息。The method according to claim 3, wherein the text information encoder and the antecedent predictor perform confrontational learning during the training process, so that information related to the antecedent in the association rule is removed from the item text information, and Preserve the information related to the former and the latter in the association rules.
  5. 根据权利要求1-4中任一项所述的方法,其中,所述根据所述项目文本信息的编码向量表示对每条关联规则进行评估,得到反映关联规则中前件和后件之间因果关系的评估结果,具体包括:The method according to any one of claims 1-4, wherein, each association rule is evaluated according to the encoded vector representation of the item text information, and the causality between the antecedent and the subsequent in the association rule is obtained The results of the assessment of the relationship, including:
    针对每条关联规则,根据所述项目文本信息的编码向量表示计算反映关联规则中前 件和后件之间因果关系的评估数值;For each association rule, according to the coding vector representation of the item text information, the evaluation value reflecting the causal relationship between the antecedent and the subsequent part in the association rule is calculated;
    若所述评估数值大于预设阈值,则判定所述关联规则中前件和后件之间具有因果关系。If the evaluation value is greater than the preset threshold, it is determined that there is a causal relationship between the antecedent and the consequent in the association rule.
  6. 根据权利要求5所述的方法,其中,所述项目文本信息包含有多个文本,所述针对每条关联规则,根据所述项目文本信息的编码向量表示计算反映关联规则中前件和后件之间因果关系的评估数值,具体包括:The method according to claim 5, wherein the item text information contains a plurality of texts, and for each association rule, according to the encoding vector representation of the item text information, the antecedent and the subsequent item in the association rule are reflected The estimated value of the causal relationship between, specifically includes:
    针对每条关联规则,从所述项目文本信息中选取关联规则中前件出现的文本作为样本文本;For each association rule, select the text that appears in the preceding item in the association rule from the item text information as the sample text;
    遍历项目文本信息的编码向量表示,查询与每个样本文本的编码向量表示符合相似度条件的文本,作为每个样本文本的相似目标文本;Traverse the coded vector representation of the item text information, query the text that meets the similarity condition with the coded vector representation of each sample text, and use it as the similar target text of each sample text;
    针对每个样本文本的相似目标文本,计算反映关联规则中前件和后件之间因果关系的评估数值。For the similar target text of each sample text, an evaluation value reflecting the causal relationship between the antecedent and the consequent in the association rule is calculated.
  7. 根据权利要求6所述的方法,其中,所述针对每个样本文本的相似目标文本,计算反映关联规则中前件和后件之间因果关系的评估数值,具体包括:The method according to claim 6, wherein, for the similar target text of each sample text, calculating an evaluation value reflecting the causal relationship between the antecedent and the consequent in the association rule specifically includes:
    针对每个样本文本的相似目标文本,计算所述目标相似文本中出现关联规则中后件的概率值,得到每个样本文本符合评估条件的概率值;For the similar target text of each sample text, calculate the probability value of the consequent in the association rule in the target similar text, and obtain the probability value that each sample text meets the evaluation condition;
    通过加权平均各个样本文本符合评估条件的概率值,得到反映关联规则中前件和后件之间因果关系的评估数值。By weighting and averaging the probability values of each sample text meeting the evaluation conditions, the evaluation value reflecting the causal relationship between the antecedent and the subsequent in association rules is obtained.
  8. 一种基于机器学习对关联规则进行评估的装置,其中,所述装置包括:A device for evaluating association rules based on machine learning, wherein the device includes:
    挖掘单元,用于使用项目共现条件从项目集合中挖掘关联规则,所述关联规则包括前件和后件,所述项目共现条件为前件和后件中项目同时出现;A mining unit, configured to use item co-occurrence conditions to mine association rules from the item collection, the association rules include antecedents and subsequent items, and the item co-occurrence conditions are that items in the antecedent and subsequent items appear simultaneously;
    提取单元,用于利用预先训练的文本信息编码器和前件预测机对收集的项目文本信息进行特征提取,得到项目文本信息的编码向量表示,所述文本信息编码器用于对所述关联规则中后件是否出现进行预测,所述前件预测机用于对所述关联规则中前件是否出现进行预测;The extraction unit is used to perform feature extraction on the collected item text information by using the pre-trained text information encoder and antecedent predictor to obtain the coded vector representation of the item text information, and the text information encoder is used for the association rule Predict whether the consequent appears, and the antecedent predictor is used to predict whether the antecedent appears in the association rules;
    评估单元,用于响应于关联规则的评估指令,根据所述项目文本信息的编码向量表示对每条关联规则进行评估,得到反映关联规则中前件和后件之间因果关系的评估结果。The evaluation unit is configured to respond to the evaluation instruction of the association rules, evaluate each association rule according to the coded vector representation of the item text information, and obtain an evaluation result reflecting the causal relationship between the antecedent and the consequent in the association rule.
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现基于机器学习对关联规则进行评估的方法的步骤,包括:A computer device, comprising a memory and a processor, wherein the memory stores computer-readable instructions, wherein, when the processor executes the computer-readable instructions, the steps of a method for evaluating association rules based on machine learning are implemented, including :
    使用项目共现条件从项目集合中挖掘关联规则,所述关联规则包括前件和后件,所述项目共现条件为前件和后件中项目同时出现;Using item co-occurrence conditions to mine association rules from the item collection, the association rules include antecedents and subsequent items, and the item co-occurrence conditions are that the items in the antecedent and subsequent items appear simultaneously;
    利用预先训练的文本信息编码器和前件预测机对收集的项目文本信息进行特征提取,得到项目文本信息的编码向量表示,所述文本信息编码器用于对所述关联规则中后件是否出现进行预测,所述前件预测机用于对所述关联规则中前件是否出现进行预测;Use the pre-trained text information encoder and antecedent predictor to extract the features of the collected item text information, and obtain the coded vector representation of the item text information, and the text information encoder is used to determine whether the consequent appears in the association rules Forecasting, the antecedent predictor is used to predict whether the antecedent appears in the association rule;
    响应于关联规则的评估指令,根据所述项目文本信息的编码向量表示对每条关联规则 进行评估,得到反映关联规则中前件和后件之间因果关系的评估结果。In response to the evaluation instruction of the association rules, each association rule is evaluated according to the coded vector representation of the item text information, and an evaluation result reflecting the causal relationship between the antecedent and the subsequent in the association rule is obtained.
  10. 根据权利要求9所述的计算机设备,其中,所述项目共现条件为关联规则中前件与后件同时出现,所述使用项目共现条件从项目集合中挖掘关联规则,具体包括:The computer device according to claim 9, wherein the item co-occurrence condition is that the antecedent and the subsequent item in the association rule appear at the same time, and the use of the item co-occurrence condition to mine the association rule from the item collection specifically includes:
    对项目集合所包含频繁项目子集进行全排列;Perform a full permutation of the frequent item subsets contained in the item set;
    针对所述频繁项目子集生成候选关联规则,并利用预设参数指标对所述候选关联规则进行过滤,得到符合预设条件的候选规则,所述参数指标至少包括支持度和置信度,所述支持度为前件和后件的共现频率,所述置信度为支持度与前件概率的比值。Generate candidate association rules for the subset of frequent items, and use preset parameter indicators to filter the candidate association rules to obtain candidate rules that meet preset conditions, the parameter indicators include at least support and confidence, and the The support is the co-occurrence frequency of the antecedent and the consequent, and the confidence is the ratio of the support to the probability of the antecedent.
  11. 根据权利要求9所述的计算机设备,其中,在所述利用预先训练的文本信息编码器和前件预测机对收集的项目文本信息进行特征提取,得到项目文本信息的编码向量表示之前,所述方法还包括:The computer device according to claim 9, wherein, before the feature extraction of the collected item text information by using the pre-trained text information encoder and antecedent predictor to obtain the coded vector representation of the item text information, the Methods also include:
    针对每条关联规则,使用预先确定所述前件和所述后件是否在项目文本信息中出现作为标签数据;For each association rule, predetermining whether the antecedent and the consequent appear in item text information is used as tag data;
    将携带有标签数据的项目文本信息输入至第一网络模型中进行训练,构建文本信息编码器,所述文本信息编码器的优化目标为最大化预测关联规则中后件是否在项目文本信息中出现;Input the item text information carrying the labeled data into the first network model for training, and build a text information encoder whose optimization goal is to maximize the prediction of whether the consequent appears in the item text information in the association rules ;
    将所述第一网络模型输出项目文本信息的编码向量表示以及关联规则中后件是否在项目文本信息中出现的预测值输入至第二网络模型进行训练,构建前件预测机,所述前件预测机的优化目标为最大化预测关联规则中前件是否在项目文本信息中出现。The coded vector representation of the output item text information of the first network model and the predicted value of whether the consequent appears in the item text information in the association rules are input to the second network model for training, and the antecedent predictor is constructed, and the antecedent The optimization goal of the predictor is to maximize whether the antecedents in the association rules appear in the project text information.
  12. 根据权利要求11所述的计算机设备,其中,所述文本信息编码器与所述前件预测机在训练过程中进行对抗学习,以使得项目文本信息中去除与关联规则中前件相关的信息,并保留关联规则中前件与后件相关的信息。The computer device according to claim 11, wherein the text information encoder and the antecedent predictor perform adversarial learning during the training process, so that information related to the antecedent in the association rule is removed from the item text information, And retain the information related to the former and the latter in the association rules.
  13. 根据权利要求9-12中任一项所述的计算机设备,其中,所述根据所述项目文本信息的编码向量表示对每条关联规则进行评估,得到反映关联规则中前件和后件之间因果关系的评估结果,具体包括:The computer device according to any one of claims 9-12, wherein, each association rule is evaluated according to the coded vector representation of the item text information, and the relationship between the antecedent and the consequent in the association rule is obtained The results of the assessment of causality, including:
    针对每条关联规则,根据所述项目文本信息的编码向量表示计算反映关联规则中前件和后件之间因果关系的评估数值;For each association rule, calculate an evaluation value reflecting the causal relationship between the antecedent and the subsequent in the association rule according to the coded vector representation of the item text information;
    若所述评估数值大于预设阈值,则判定所述关联规则中前件和后件之间具有因果关系。If the evaluation value is greater than the preset threshold, it is determined that there is a causal relationship between the antecedent and the consequent in the association rule.
  14. 根据权利要求13所述的计算机设备,其中,所述项目文本信息包含有多个文本,所述针对每条关联规则,根据所述项目文本信息的编码向量表示计算反映关联规则中前件和后件之间因果关系的评估数值,具体包括:The computer device according to claim 13, wherein the item text information contains a plurality of texts, and for each association rule, according to the coded vector representation of the item text information, the calculation reflects the antecedent and the subsequent item in the association rule Estimates of the causal relationship between events, including:
    针对每条关联规则,从所述项目文本信息中选取关联规则中前件出现的文本作为样本文本;For each association rule, select the text that appears in the preceding item in the association rule from the item text information as the sample text;
    遍历项目文本信息的编码向量表示,查询与每个样本文本的编码向量表示符合相似度条件的文本,作为每个样本文本的相似目标文本;Traverse the coded vector representation of the item text information, query the text that meets the similarity condition with the coded vector representation of each sample text, and use it as the similar target text of each sample text;
    针对每个样本文本的相似目标文本,计算反映关联规则中前件和后件之间因果关系的 评估数值。For the similar target text of each sample text, the evaluation value reflecting the causal relationship between the antecedent and the consequent in the association rule is calculated.
  15. 一种可读存储介质,其上存储有计算机可读指令,其中,所述计算机可读指令被处理器执行时实现基于机器学习对关联规则进行评估的方法的步骤,包括:A readable storage medium, on which computer-readable instructions are stored, wherein, when the computer-readable instructions are executed by a processor, the steps of the method for evaluating association rules based on machine learning are implemented, including:
    使用项目共现条件从项目集合中挖掘关联规则,所述关联规则包括前件和后件,所述项目共现条件为前件和后件中项目同时出现;Using item co-occurrence conditions to mine association rules from the item collection, the association rules include antecedents and subsequent items, and the item co-occurrence conditions are that the items in the antecedent and subsequent items appear simultaneously;
    利用预先训练的文本信息编码器和前件预测机对收集的项目文本信息进行特征提取,得到项目文本信息的编码向量表示,所述文本信息编码器用于对所述关联规则中后件是否出现进行预测,所述前件预测机用于对所述关联规则中前件是否出现进行预测;Use the pre-trained text information encoder and antecedent predictor to extract the features of the collected item text information, and obtain the coded vector representation of the item text information, and the text information encoder is used to determine whether the consequent appears in the association rules Forecasting, the antecedent predictor is used to predict whether the antecedent appears in the association rule;
    响应于关联规则的评估指令,根据所述项目文本信息的编码向量表示对每条关联规则进行评估,得到反映关联规则中前件和后件之间因果关系的评估结果。In response to the evaluation instruction of the association rules, each association rule is evaluated according to the coded vector representation of the item text information, and an evaluation result reflecting the causal relationship between the antecedent and the subsequent in the association rule is obtained.
  16. 根据权利要求15所述的可读存储介质,其中,所述项目共现条件为关联规则中前件与后件同时出现,所述使用项目共现条件从项目集合中挖掘关联规则,具体包括:The readable storage medium according to claim 15, wherein the item co-occurrence condition is that the antecedent and the subsequent item in the association rule appear at the same time, and the use of the item co-occurrence condition to mine the association rule from the item set specifically includes:
    对项目集合所包含频繁项目子集进行全排列;Perform a full permutation of the frequent item subsets contained in the item set;
    针对所述频繁项目子集生成候选关联规则,并利用预设参数指标对所述候选关联规则进行过滤,得到符合预设条件的候选规则,所述参数指标至少包括支持度和置信度,所述支持度为前件和后件的共现频率,所述置信度为支持度与前件概率的比值。Generate candidate association rules for the subset of frequent items, and use preset parameter indicators to filter the candidate association rules to obtain candidate rules that meet preset conditions, the parameter indicators include at least support and confidence, and the The support is the co-occurrence frequency of the antecedent and the consequent, and the confidence is the ratio of the support to the probability of the antecedent.
  17. 根据权利要求15所述的可读存储介质,其中,在所述利用预先训练的文本信息编码器和前件预测机对收集的项目文本信息进行特征提取,得到项目文本信息的编码向量表示之前,所述方法还包括:The readable storage medium according to claim 15, wherein, before performing feature extraction on the collected item text information by using the pre-trained text information encoder and antecedent predictor to obtain the coded vector representation of the item text information, The method also includes:
    针对每条关联规则,使用预先确定所述前件和所述后件是否在项目文本信息中出现作为标签数据;For each association rule, predetermining whether the antecedent and the consequent appear in item text information is used as tag data;
    将携带有标签数据的项目文本信息输入至第一网络模型中进行训练,构建文本信息编码器,所述文本信息编码器的优化目标为最大化预测关联规则中后件是否在项目文本信息中出现;Input the item text information carrying the labeled data into the first network model for training, and build a text information encoder whose optimization goal is to maximize the prediction of whether the consequent appears in the item text information in the association rules ;
    将所述第一网络模型输出项目文本信息的编码向量表示以及关联规则中后件是否在项目文本信息中出现的预测值输入至第二网络模型进行训练,构建前件预测机,所述前件预测机的优化目标为最大化预测关联规则中前件是否在项目文本信息中出现。The coded vector representation of the output item text information of the first network model and the predicted value of whether the consequent appears in the item text information in the association rules are input to the second network model for training, and the antecedent predictor is constructed, and the antecedent The optimization goal of the predictor is to maximize whether the antecedents in the association rules appear in the project text information.
  18. 根据权利要求17所述的可读存储介质,其中,所述文本信息编码器与所述前件预测机在训练过程中进行对抗学习,以使得项目文本信息中去除与关联规则中前件相关的信息,并保留关联规则中前件与后件相关的信息。The readable storage medium according to claim 17, wherein the text information encoder and the antecedent predictor perform confrontational learning during the training process, so that items related to the antecedent in the association rule are removed from the item text information. information, and retain the information related to the former and the latter in the association rules.
  19. 根据权利要求15-18中任一项所述的可读存储介质,其中,所述根据所述项目文本信息的编码向量表示对每条关联规则进行评估,得到反映关联规则中前件和后件之间因果关系的评估结果,具体包括:The readable storage medium according to any one of claims 15-18, wherein, each association rule is evaluated according to the encoded vector representation of the item text information, and an antecedent and a consequent in the association rule are obtained The results of the assessment of the causal relationship between, specifically include:
    针对每条关联规则,根据所述项目文本信息的编码向量表示计算反映关联规则中前件和后件之间因果关系的评估数值;For each association rule, calculate an evaluation value reflecting the causal relationship between the antecedent and the subsequent in the association rule according to the coded vector representation of the item text information;
    若所述评估数值大于预设阈值,则判定所述关联规则中前件和后件之间具有因果关系。If the evaluation value is greater than the preset threshold, it is determined that there is a causal relationship between the antecedent and the consequent in the association rule.
  20. 根据权利要求19所述的可读存储介质,其中,所述项目文本信息包含有多个文本,所述针对每条关联规则,根据所述项目文本信息的编码向量表示计算反映关联规则中前件和后件之间因果关系的评估数值,具体包括:The readable storage medium according to claim 19, wherein the item text information contains a plurality of texts, and for each association rule, according to the encoding vector representation of the item text information, the calculation reflects the antecedent in the association rule The estimated value of the causal relationship between the event and the consequent, including:
    针对每条关联规则,从所述项目文本信息中选取关联规则中前件出现的文本作为样本文本;For each association rule, select the text that appears in the preceding item in the association rule from the item text information as the sample text;
    遍历项目文本信息的编码向量表示,查询与每个样本文本的编码向量表示符合相似度条件的文本,作为每个样本文本的相似目标文本;Traverse the coded vector representation of the item text information, query the text that meets the similarity condition with the coded vector representation of each sample text, and use it as the similar target text of each sample text;
    针对每个样本文本的相似目标文本,计算反映关联规则中前件和后件之间因果关系的评估数值。For the similar target text of each sample text, an evaluation value reflecting the causal relationship between the antecedent and the consequent in the association rule is calculated.
PCT/CN2022/071425 2021-08-25 2022-01-11 Association rule assessment method and apparatus based on machine learning WO2023024411A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110980623.XA CN113656558B (en) 2021-08-25 2021-08-25 Method and device for evaluating association rule based on machine learning
CN202110980623.X 2021-08-25

Publications (1)

Publication Number Publication Date
WO2023024411A1 true WO2023024411A1 (en) 2023-03-02

Family

ID=78492802

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071425 WO2023024411A1 (en) 2021-08-25 2022-01-11 Association rule assessment method and apparatus based on machine learning

Country Status (2)

Country Link
CN (1) CN113656558B (en)
WO (1) WO2023024411A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116069674A (en) * 2023-04-06 2023-05-05 江苏国保信息系统测评中心有限公司 Security assessment method and system for grade assessment
CN116805039A (en) * 2023-08-21 2023-09-26 腾讯科技(深圳)有限公司 Feature screening method, device, computer equipment and data disturbance method
CN117933831A (en) * 2024-03-25 2024-04-26 山东山科数字经济研究院有限公司 Machine learning trainable project performance evaluation method and system

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656558B (en) * 2021-08-25 2023-07-21 平安科技(深圳)有限公司 Method and device for evaluating association rule based on machine learning
CN114550885A (en) * 2021-12-28 2022-05-27 杭州火树科技有限公司 Main diagnosis and main operation matching detection method and system based on federal association rule mining

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989095A (en) * 2015-02-12 2016-10-05 香港理工大学深圳研究院 Association rule significance test method and device capable of considering data uncertainty
EP3168791A1 (en) * 2015-11-10 2017-05-17 Fujitsu Limited Method and system for data validation in knowledge extraction apparatus
WO2018167826A1 (en) * 2017-03-13 2018-09-20 三菱電機株式会社 Causal relationship evaluation device, causal relationship evaluation system and causal relationship evaluation method
CN110297141A (en) * 2019-07-01 2019-10-01 武汉大学 Fault Locating Method and system based on multilayer assessment models
CN113656558A (en) * 2021-08-25 2021-11-16 平安科技(深圳)有限公司 Method and device for evaluating association rule based on machine learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI634441B (en) * 2016-11-29 2018-09-01 財團法人工業技術研究院 Method to enhance association rules, apparatus using the same and computer readable medium
CN108364106A (en) * 2018-02-27 2018-08-03 平安科技(深圳)有限公司 A kind of expense report Risk Forecast Method, device, terminal device and storage medium
US11556838B2 (en) * 2019-01-09 2023-01-17 Sap Se Efficient data relationship mining using machine learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989095A (en) * 2015-02-12 2016-10-05 香港理工大学深圳研究院 Association rule significance test method and device capable of considering data uncertainty
EP3168791A1 (en) * 2015-11-10 2017-05-17 Fujitsu Limited Method and system for data validation in knowledge extraction apparatus
WO2018167826A1 (en) * 2017-03-13 2018-09-20 三菱電機株式会社 Causal relationship evaluation device, causal relationship evaluation system and causal relationship evaluation method
CN110297141A (en) * 2019-07-01 2019-10-01 武汉大学 Fault Locating Method and system based on multilayer assessment models
CN113656558A (en) * 2021-08-25 2021-11-16 平安科技(深圳)有限公司 Method and device for evaluating association rule based on machine learning

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116069674A (en) * 2023-04-06 2023-05-05 江苏国保信息系统测评中心有限公司 Security assessment method and system for grade assessment
CN116069674B (en) * 2023-04-06 2023-06-20 江苏国保信息系统测评中心有限公司 Security assessment method and system for grade assessment
CN116805039A (en) * 2023-08-21 2023-09-26 腾讯科技(深圳)有限公司 Feature screening method, device, computer equipment and data disturbance method
CN116805039B (en) * 2023-08-21 2023-12-05 腾讯科技(深圳)有限公司 Feature screening method, device, computer equipment and data disturbance method
CN117933831A (en) * 2024-03-25 2024-04-26 山东山科数字经济研究院有限公司 Machine learning trainable project performance evaluation method and system
CN117933831B (en) * 2024-03-25 2024-06-11 山东山科数字经济研究院有限公司 Machine learning trainable project performance evaluation method and system

Also Published As

Publication number Publication date
CN113656558A (en) 2021-11-16
CN113656558B (en) 2023-07-21

Similar Documents

Publication Publication Date Title
WO2023024411A1 (en) Association rule assessment method and apparatus based on machine learning
Pathan et al. Analyzing the impact of feature selection on the accuracy of heart disease prediction
Kamaleswaran et al. Applying artificial intelligence to identify physiomarkers predicting severe sepsis in the PICU
Biswas et al. An XAI based autism detection: the context behind the detection
Corradi et al. Prediction of incident delirium using a random forest classifier
Krittanawong et al. Machine learning and deep learning to predict mortality in patients with spontaneous coronary artery dissection
US11599800B2 (en) Systems and methods for enhanced user specific predictions using machine learning techniques
Miller The medical AI insurgency: what physicians must know about data to practice with intelligent machines
Tharsanee et al. Deep convolutional neural network–based image classification for COVID-19 diagnosis
CN109635044A (en) Hospitalization data method for detecting abnormality, device, equipment and readable storage medium storing program for executing
WO2013192593A2 (en) Clinical predictive analytics system
Keltch et al. Comparison of AI techniques for prediction of liver fibrosis in hepatitis patients
Hong et al. Prediction of cardiac arrest in the emergency department based on machine learning and sequential characteristics: model development and retrospective clinical validation study
US20210192365A1 (en) Computer device, system, readable storage medium and medical data analysis method
Kwon et al. Deep learning algorithm to predict need for critical care in pediatric emergency departments
CN112465231A (en) Method, apparatus and readable storage medium for predicting regional population health status
Shrestha et al. Supervised machine learning for early predicting the sepsis patient: modified mean imputation and modified chi-square feature selection
Alturki et al. Predictors of readmissions and length of stay for diabetes related patients
AlZubi RETRACTED ARTICLE: Big data analytic diabetics using map reduce and classification techniques
Archana et al. Automated cardioailment identification and prevention by hybrid machine learning models
WO2022057057A1 (en) Method for detecting medicare fraud, and system and storage medium
CN116884612A (en) Intelligent analysis method, device, equipment and storage medium for disease risk level
Samadi et al. A hybrid modeling framework for generalizable and interpretable predictions of ICU mortality across multiple hospitals
Naemi et al. Prediction of length of stay using vital signs at the admission time in emergency departments
Naresh Patel et al. Disease categorization with clinical data using optimized bat algorithm and fuzzy value

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22859772

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE