CN114756656A

CN114756656A - Mining method of association rules for hydraulic engineering safety hazard description based on improved Apriori algorithm

Info

Publication number: CN114756656A
Application number: CN202210471547.4A
Authority: CN
Inventors: 余晨晨; 陈卓越; 李子轩
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-07-15

Abstract

The invention discloses an association rule mining method for the description of potential safety hazards of water conservancy projects based on an improved Apriori algorithm. First, Jieba word segmentation is used to preprocess a large number of texts describing potential safety hazards of water conservancy projects of different categories, and the TextRank algorithm is used to extract keywords and analyze the results of word segmentation. Part-of-speech screening to obtain the keywords of the hidden danger description text; then based on the one-hot encoding, the keyword itemset is transformed into a Boolean matrix, the Apriori association rules are integrated and improved, the frequent itemsets in various hidden dangers are mined, and the association rules are analyzed; finally, the confidence level is used. In order to evaluate the value, the strong correlation between the descriptions of hidden dangers is mined, and the trend of hidden dangers is predicted. The invention performs data preprocessing for a large number of unstructured hidden danger descriptions, improves the traditional Apriori algorithm, improves the computing efficiency, can timely analyze the description of the hidden dangers of water conservancy projects, and is helpful for the prediction, investigation and management of hidden dangers of water conservancy projects.

Description

Mining method of association rules for hydraulic engineering safety hazard description based on improved Apriori algorithm

技术领域technical field

本发明涉及水利工程安全隐患排查技术领域，具体涉及一种基于改进Apriori算法的水利工程安全隐患描述关联规则挖掘方法。The invention relates to the technical field of investigation of potential safety hazards in water conservancy projects, in particular to a mining method for association rules for the description of potential safety hazards in water conservancy projects based on an improved Apriori algorithm.

背景技术Background technique

水利工程点多面广，存在大量的安全隐患，隐患之间常常存在关联性，挖掘各类安全隐患之间的关联关系具有重要意义。Apriori算法作为最经典的关联规则挖掘算法，在水利工程领域中已有相关应用。但传统的Apriori算法对于较大的数据集存在效率较低、耗时长的技术问题，且对于安全隐患描述这类非结构化文本难以进行准确分析。Water conservancy projects are multi-faceted and wide-ranging, and there are a large number of hidden dangers. There is often correlation between hidden dangers. It is of great significance to explore the correlation between various safety hazards. As the most classic association rule mining algorithm, Apriori algorithm has been applied in the field of hydraulic engineering. However, the traditional Apriori algorithm has technical problems of low efficiency and time-consuming for large data sets, and it is difficult to accurately analyze unstructured texts such as descriptions of security risks.

为了解决上述存在的问题，如黄黎明等在《基于关联规则的水利工程建设质量安全监管数据挖掘与分析》中，根据问题描述将工程所存在技术问题划分为50个问题类别，采用Apriori算法对工程属性与问题描述进行关联规则挖掘，该方法对原始隐患数据进行了合并归类的预处理，实现了隐患描述的结构化表达，但该方法处理原始隐患数据需要进行人工操作，对于大量数据需要耗费大量时间，且具有一定的主观性，影响结果的准确性；又如陈述等在《水电工程施工安全隐患关联规则挖掘》中，提出基于短语提取的关联规则优化算法，基于短语提取技术对隐患文本进行关键短语提取，同时将短语度量值作为评价指标，优化Apriori算法中的支持度，从而挖掘隐患属性之间的关联规则，该方法通过短语提取技术挖掘文本中的新词，提高了分词质量，同时重新定义支持度，使得关联规则挖掘更加科学，但该方法采取的短语提取技术，所需的规则库搭建及维护需要大量人工成本，挖掘出的新词与水利工程领域的专有名词存在一定偏差，同时存在无明确含义的分词结果，如易造成，未设置等，且在Apriori算法迭代方面没有进行改进，运算效率较低。In order to solve the above problems, for example, Huang Liming et al. in "Data Mining and Analysis of Water Conservancy Project Construction Quality and Safety Supervision Based on Association Rules", according to the problem description, the technical problems existing in the project are divided into 50 problem categories, and the Apriori algorithm is used to analyze the problems. Engineering attributes and problem descriptions are used for association rule mining. This method preprocesses the original hidden danger data by merging and classifying them, and realizes the structured expression of the hidden danger description. It consumes a lot of time, and has a certain degree of subjectivity, which affects the accuracy of the results; another example is the statement in "Association Rule Mining of Hidden Safety Hazards in Hydropower Engineering Construction", which proposes an optimization algorithm for association rules based on phrase extraction, based on phrase extraction technology. The text is extracted for key phrases, and the phrase measurement value is used as the evaluation index to optimize the support in the Apriori algorithm, so as to mine the association rules between hidden attributes. This method uses phrase extraction technology to mine new words in the text and improve the quality of word segmentation. , at the same time redefine the support degree to make association rule mining more scientific, but the phrase extraction technology adopted by this method requires a lot of labor costs to build and maintain the required rule base, and the new words excavated and the proper nouns in the field of water conservancy engineering exist There is a certain deviation, and there are word segmentation results with no clear meaning, such as easy to cause, not set, etc., and there is no improvement in the iteration of the Apriori algorithm, and the operation efficiency is low.

因此，构建一种高效的水利工程安全隐患描述关联规则挖掘方法是必要的，本发明所构建的一种基于改进Apriori算法的水利工程安全隐患描述关联规则挖掘方法，与现有方法相比，具有两个显著特征：(1)构建水利工程领域自定义词库对隐患文本进行分词，采用TextRank算法进行词性筛选及关键词提取，并基于独热编码将分词结果转换为布尔矩阵，预处理后的隐患文本只保留了一些重要关键词。(2)优化Apriori算法，在逐层迭代之前，对布尔矩阵先进行一次频繁项集筛选。在迭代中进行剪枝优化，同时动态缩小数据集，关联规则挖掘运算效率明显得到了提高。Therefore, it is necessary to construct an efficient mining method of association rules for the description of potential safety hazards in water conservancy projects. Compared with the existing method, the method for mining association rules for the description of potential safety hazards in water conservancy projects constructed in the present invention based on the improved Apriori algorithm has the advantages of Two notable features: (1) Constructing a custom thesaurus in the field of hydraulic engineering to segment hidden texts, using TextRank algorithm for part-of-speech screening and keyword extraction, and converting the segmentation results into Boolean matrices based on one-hot encoding. The hidden text only retains some important keywords. (2) Optimize the Apriori algorithm. Before iterating layer by layer, perform a frequent itemset screening on the Boolean matrix. The pruning optimization is carried out in the iteration, and the data set is dynamically reduced at the same time, and the operation efficiency of association rule mining has been significantly improved.

发明内容SUMMARY OF THE INVENTION

本发明目的在于提出一种基于改进Apriori算法的水利工程安全隐患描述关联规则挖掘方法，解决水利工程安全施工中隐患描述难以结构化表达及关联规则挖掘算法运算效率低下的问题，及时挖掘水利工程施工中安全隐患之间的关联规则，为水利工程安全隐患的预测及排查治理提供辅助支持。The purpose of the present invention is to propose an association rule mining method based on the improved Apriori algorithm for the description of potential safety hazards in water conservancy projects, so as to solve the problems of difficulty in structurally expressing the description of hidden dangers in the safety construction of water conservancy projects and the low computing efficiency of the association rule mining algorithm, so as to excavate the construction of water conservancy projects in time. The association rules between hidden safety hazards in the water conservancy project provide auxiliary support for the prediction, investigation and management of safety hazards in water conservancy projects.

本发明的一种基于改进Apriori算法的水利工程安全隐患描述关联规则挖掘方法，包括以下步骤：A method for mining association rules for the description of potential safety hazards in water conservancy projects based on the improved Apriori algorithm of the present invention includes the following steps:

步骤S1，构建水利工程领域自定义词库，采用jieba分词对隐患数据集进行分词、去停用词，通过TextRank算法筛选词性，提取出的关键词作为输入数据集；Step S1, constructing a custom thesaurus in the field of water conservancy engineering, using jieba word segmentation to segment the hidden danger data set, removing stop words, filtering parts of speech through the TextRank algorithm, and using the extracted keywords as the input data set;

步骤S2，步骤S2，基于独热one-hot编码对步骤S1得到的输入数据集进行布尔矩阵转换；Step S2, step S2, based on one-hot one-hot encoding, the input data set obtained in step S1 is subjected to Boolean matrix conversion;

步骤S3，通过改进Apriori算法计算各项集支持度，筛选出不小于最小支持度的项集，即为频繁项集；In step S3, the support degree of each item set is calculated by improving the Apriori algorithm, and the itemsets that are not less than the minimum support degree are screened out, which are frequent itemsets;

步骤S4，对得到的频繁项集进行置信度计算，筛选出不小于最小置信度的关联规则，即为强关联规则A＝>B，即项集A出现的情况下，项集B也会尽可能出现。Step S4, perform confidence calculation on the obtained frequent itemsets, and filter out the association rules not less than the minimum confidence, that is, strong association rules A=>B, that is, when itemset A appears, itemset B will also be exhausted. may appear.

优选地，所述步骤S1具体包括：Preferably, the step S1 specifically includes:

步骤S11，搜集水利工程领域词条，构建水利工程领域自定义词库，采用jieba分词，对隐患数据集进行分词，并标注词性；Step S11, collecting entries in the field of water conservancy engineering, constructing a custom thesaurus in the field of water conservancy engineering, using jieba word segmentation, segmenting the hidden danger data set, and marking the part of speech;

步骤S12，载入停用词库，剔除分词结果中的语气助词、副词、介词、连接词等通常自身并无明确意义的词；Step S12, load the stop word database, and remove words that usually have no clear meaning themselves, such as modal particles, adverbs, prepositions, conjunctions, etc. in the word segmentation result;

步骤S13，采用TextRank算法，对分词重要性进行排序，然后提取分词中词性为名词、动词、动名词、地名的关键词，作为最终的输入数据集。Step S13, using the TextRank algorithm to sort the importance of word segmentation, and then extracting keywords whose parts of speech are nouns, verbs, gerunds, and place names in the word segmentation, as the final input data set.

经过上述步骤S11-S13，对数据集中难以结构化表达的隐患描述文本进行了预处理。After the above steps S11-S13, the hidden danger description texts that are difficult to be expressed in a structured way in the data set are preprocessed.

优选地，所述步骤S2具体包括：Preferably, the step S2 specifically includes:

步骤S21，首先将输入数据集D＝{D₁,D₁…,D_n}，转化为DataFrame数据框形式的矩阵T，矩阵T的形式为:Step S21, first convert the input data set D={D ₁ , D ₁ . . . , D _n } into a matrix T in the form of a DataFrame data frame, and the form of the matrix T is:

其中，n为隐患描述数据数量，m为n条数据中最多的分词数量，T_ij为第i条数据的第j个关键词，若为空则为Null；D_i为数据集中的第i条数据，即集合{T_i1,T_i2,…,T_in}；Among them, n is the number of hidden danger description data, m is the maximum number of word segmentations in the n data, T _ij is the j-th keyword of the i-th data, if it is empty, it is Null; D _i is the i-th item in the data set data, that is, the set {T _i1 ,T _i2 ,...,T _in };

步骤S22，对数据框矩阵T进行独热one-hot编码，转换成布尔矩阵M；设I＝{I₁,I₂,…,I_t}为数据集D中所有不同的项组成的集合，布尔矩阵M的形式为：Step S22, perform one-hot one-hot encoding on the data frame matrix T, and convert it into a Boolean matrix M; let I={I ₁ , I ₂ ,...,I _t } be the set composed of all different items in the data set D, The Boolean matrix M has the form:

其中，t为数据集D中所有不同的关键词的数量，M_ij为第i条数据对于第j个关键词的布尔值，若D_i包含I_j，即第i条数据包含第j个关键词，则M_ij的值为1(True)，否则为0(False)。Among them, t is the number of all different keywords in the data set D, M _ij is the Boolean value of the i-th data for the j-th keyword, if D _i contains I _j , that is, the i-th data contains the j-th key word, the value of _Mij is 1 (True), otherwise it is 0 (False).

通过步骤S2将经步骤S1预处理的中文文本转换为容易识别的布尔矩阵。The Chinese text preprocessed by step S1 is converted into an easily recognizable Boolean matrix through step S2.

优选地，所述步骤S3具体包括：Preferably, the step S3 specifically includes:

步骤S31，在进行迭代之前，对步骤S22转换得到的布尔矩阵，进行一次遍历，对布尔矩阵中的每一列分别进行计数求和，快速挖掘1维频繁项集，删除非频繁项集；Step S31, before performing the iteration, perform a traversal on the Boolean matrix converted in step S22, count and sum up each column in the Boolean matrix, quickly mine 1-dimensional frequent itemsets, and delete non-frequent itemsets;

步骤S32，在进行迭代时，k-1维频繁项集集合L_k-1连接形成k维候选项集集合，记作C_k；设I₁和I₂是L_k-1中的项集，若他们中有k-2个项相同，则I₁、I₂可连接产生结果项集，该结果项集为候选项集集合C_k之一；Step S32, during iteration, the k-1-dimensional frequent itemset set L _k-1 is connected to form a k-dimensional candidate item set set, denoted as C _k ; let I ₁ and I ₂ be the itemsets in L _k-1 , If k-2 items in them are the same, then I ₁ and I ₂ can be connected to generate a result item set, which is one of the candidate item set sets C _k ;

步骤S33，对候选项集集合C_k进行剪枝，C_k为L_k的超集；对于L_k-1中的每个项集T，依次遍历C_k中的所有候选项集，并基于布尔矩阵对每个候选项集进行计数。遍历全部结束后，通过计数结果对C_k中的每个候选项集进行裁剪，若计数小于k则删除该候选项集，同时删除数据集中所有包含该项集的项集；否则保留并进行下一步；Step S33, prune the candidate item set C _k , where C _k is a superset of L _k ; for each item set T in L _k-1 , traverse all the candidate item sets in C _k in turn, and based on Boolean The matrix counts each candidate item set. After the traversal is complete, trim each candidate item set in C _k according to the count result. If the count is less than k, delete the candidate item set, and delete all itemsets that contain the item set in the dataset; otherwise, keep it and proceed to the next step. step;

步S34，对剪枝后的每个候选项集进行支持度计算，支持度(support)公式为：Step S34, calculate the support degree for each candidate item set after pruning, and the support degree (support) formula is:

support(A＝>B)＝P(A∪B)support(A=>B)=P(A∪B)

式中：A、B为I中任意项组成的项集，即表示数据包含项集A和B中的每个项的概率；In the formula: A and B are the itemsets composed of any items in I, that is, the probability that the data contains each item in the itemsets A and B;

若候选项集支持度不小于最小支持度，则为k-频繁项集，并加入L_k，否则删除该项集，同时删除数据集中所有包含该项集的项集。筛选完后返回步骤S32，直至找不到k-频繁项集。If the support degree of the candidate item set is not less than the minimum support degree, it is a k-frequent itemset, and is added to L _k , otherwise, the item set is deleted, and all itemsets containing this item set are deleted in the data set at the same time. After screening, go back to step S32 until no k-frequent itemsets are found.

通过步骤2和S31，在迭代之前，无需对各个关键词进行支持度计算，直接对布尔矩阵中的每一列分别进行计数求和，快速挖掘1维频繁项集，删除非频繁项集；基于布尔矩阵对大量的数据集进行快速筛选，从而大大缩小数据集，提高了迭代初期庞大数据集情况下的迭代运算效率。在进行迭代时，k-1维频繁项集集合L_k-1连接形成k维候选项集集合C_k，对于L_k-1中的每个项集T，依次遍历C_k中的所有候选项集，并基于布尔矩阵对每个候选项集进行计数。遍历全部结束后，通过计数结果对C_k中的每个候选项集进行裁剪，若计数小于k则删除该候选项集，同时删除数据集中所有包含该项集的项集。通过步骤S32优化剪枝环节，将时间复杂度从O(n²)降到O(n)，提高了整体的迭代运算效率。Through steps 2 and S31, before the iteration, there is no need to calculate the support degree of each keyword, directly count and sum each column in the Boolean matrix, quickly mine 1-dimensional frequent itemsets, and delete non-frequent itemsets; based on Boolean The matrix can quickly filter a large number of data sets, thereby greatly reducing the data set, and improving the iterative operation efficiency in the case of a huge data set in the early stage of iteration. During iteration, the k-1-dimensional frequent itemset set L _k-1 is connected to form a k-dimensional candidate item set C _k , and for each item set T in L _k-1 , all candidate items in C _k are traversed in turn sets and counts each candidate item set based on a Boolean matrix. After the traversal is completed, each candidate item set in C _k is trimmed through the counting result. If the count is less than k, the candidate item set is deleted, and all itemsets containing this item set are deleted in the dataset at the same time. By optimizing the pruning link in step S32, the time complexity is reduced from O(n ² ) to O(n), and the overall iterative operation efficiency is improved.

传统的Apriori算法，对于每一个候选项集的k-1维子集，均要重新搜索一遍L_k-1，这增加了运算时间，影响了算法效率。改进的Apriori算法，主要在于迭代之前，先进行快速挖掘频繁项集，在迭代中进行剪枝优化，同时动态缩小数据集，从而显著提高算法的运算效率。In the traditional Apriori algorithm, for each k-1 dimensional subset of the candidate item set, L _k-1 needs to be searched again, which increases the operation time and affects the algorithm efficiency. The improved Apriori algorithm mainly lies in the rapid mining of frequent itemsets before the iteration, pruning and optimization in the iteration, and the dynamic reduction of the data set at the same time, thereby significantly improving the computational efficiency of the algorithm.

与现有技术相比，本发明的有益效果在于：Compared with the prior art, the beneficial effects of the present invention are:

(1)本发明通过构建水利工程领域自定义词库，采用TextRank算法进行关键词提取和词性过滤，通过机器自动分词，实现了对于难以结构化表达的隐患描述文本的预处理，克服了现有技术中人工成本高及无法识别水利工程领域性术语的问题。(1) In the present invention, by constructing a custom thesaurus in the field of water conservancy engineering, using the TextRank algorithm for keyword extraction and part-of-speech filtering, and through automatic word segmentation by machines, the preprocessing of the hidden danger description text that is difficult to structurally express is realized, and the existing The problems of high labor cost and inability to recognize the terminology of hydraulic engineering in the technology.

(2)本发明在迭代之前，基于布尔矩阵对大量的数据集进行快速筛选，从而大大缩小数据集，提高了迭代初期庞大数据集情况下的迭代运算效率；在迭代时，通过优化剪枝环节，将时间复杂度从O(n²)降到O(n)，提高了整体的迭代运算效率。(2) Before the iteration, the present invention quickly screens a large number of data sets based on the Boolean matrix, thereby greatly reducing the data set and improving the iterative operation efficiency in the case of a huge data set in the early iteration; , which reduces the time complexity from O(n ² ) to O(n) and improves the overall iterative operation efficiency.

(3)本发明通过改进迭代前和迭代时的算法，大大提升了Apriori算法的运算效率，克服了前述现有技术中庞大数据集情况下算法运算效率较低的问题。(3) The present invention greatly improves the operation efficiency of the Apriori algorithm by improving the algorithms before and during iteration, and overcomes the problem of low algorithm operation efficiency in the case of huge data sets in the prior art.

附图说明Description of drawings

图1为本发明的整体工作流程图。Fig. 1 is the overall working flow chart of the present invention.

具体实施方式Detailed ways

下面结合实施例对本发明技术方案进行详细说明，但是本发明的保护范围不局限于所述实施例。The technical solutions of the present invention will be described in detail below with reference to the embodiments, but the protection scope of the present invention is not limited to the embodiments.

实施例1Example 1

如图1所示，本发明所述方法具体包括以下步骤：As shown in Figure 1, the method of the present invention specifically comprises the following steps:

步骤S1，构建水利工程领域自定义词库，采用jieba分词对隐患数据集进行分词、去停用词，通过TextRank算法筛选词性，提取出的关键词作为输入数据集，具体步骤为：Step S1, construct a custom thesaurus in the field of water conservancy engineering, use jieba word segmentation to segment the hidden danger data set, remove stop words, filter the parts of speech through the TextRank algorithm, and use the extracted keywords as the input data set. The specific steps are:

步骤S11，搜集水利工程隐患描述语句1917条作为样本数据集，部分样本数据集样本数据如表1所示，搜集水利工程领域词条7480条，构建水利工程领域自定义词库，采用jieba分词，对样本数据进行分词，并标注词性。Step S11, collect 1917 water conservancy project hidden danger description sentences as a sample data set, some of the sample data set sample data are shown in Table 1, collect 7480 water conservancy engineering field entries, build a custom thesaurus in the water conservancy engineering field, use jieba word segmentation, Segment the sample data and mark the part of speech.

部分词条如：暗灯、按地电阻值、暗敷、暗管、暗涵、暗合页、暗间、暗梁、暗炉片、安全出口、安全出口灯、安全出口灯出线口、安全岛、安全阀门调试记录、安全防范设施、安全管理、安全接地、安全警报系统等。Some entries such as: dark light, ground resistance value, dark application, dark pipe, dark culvert, dark hinge, dark room, dark beam, dark furnace piece, safety exit, safety exit light, safety exit light outlet, safety island, Safety valve debugging records, safety precautions, safety management, safety grounding, safety alarm system, etc.

表1Table 1

步骤S12，载入停用词库，剔除分词结果中的语气助词、副词、介词、连接词等通常自身并无明确意义的词，如“@”、“阿”、“而且”、“与”等。Step S12, load the stop word database, and remove the modal particles, adverbs, prepositions, conjunctions and other words that usually have no clear meaning in the word segmentation result, such as "@", "A", "and", "and" Wait.

步骤S13，采用TextRank算法，对分词重要性进行排序，提取分词中词性为名词(n)、动词(v)、动名词(vn)、地名(ns)的关键词。Step S13, using the TextRank algorithm to sort the importance of word segmentation, and extracting keywords whose parts of speech are noun (n), verb (v), gerund (vn), and place name (ns) in the word segmentation.

经步骤S11-S13得到如表2所示的最终的分词结果作为输入数据集。After steps S11-S13, the final word segmentation results shown in Table 2 are obtained as the input data set.

表2分词结果Table 2 word segmentation results

步骤S2，对步骤S1得到的输入数据集进行布尔矩阵编码，即对于每条数据，若包含数据集中对应项集，则标记为1(True)，否则为0(False)，具体步骤为：In step S2, the Boolean matrix encoding is performed on the input data set obtained in step S1, that is, for each piece of data, if it contains the corresponding item set in the data set, it is marked as 1 (True), otherwise it is 0 (False). The specific steps are:

步骤S21，首先将输入数据集转化为DataFrame数据框形式的1917*20矩阵，其中，行数为1917，即隐患描述数据数量，列数为20，即1917条数据中最多的分词数量，第i行第j列个词语即为第i条数据的第j个关键词，若为空则为Null，DataFrame数据框如表3所示。Step S21, first convert the input data set into a 1917*20 matrix in the form of a DataFrame data frame, wherein the number of rows is 1917, that is, the number of hidden danger description data, and the number of columns is 20, that is, the maximum number of word segmentations in the 1917 pieces of data, the i-th The word in the jth column of the row is the jth keyword of the ith data. If it is empty, it is Null. The DataFrame data frame is shown in Table 3.

表3DataFrame数据框Table 3DataFrame data frame

步骤S22，对DataFrame数据框进行独热one-hot编码，转换成布尔矩阵，矩阵为1917*1139，其中，行数为1917，即隐患描述数据数量，列数为1139，即数据集中所有不同的关键词的数量，若i条数据包含第j个关键词，则第i行第j列的值为1(True)，否则为0(False)。编码后的布尔矩阵如表4所示。Step S22, perform one-hot encoding on the DataFrame data frame, and convert it into a Boolean matrix, the matrix is 1917*1139, wherein the number of rows is 1917, that is, the number of hidden danger description data, and the number of columns is 1139, that is, all different data in the data set. The number of keywords. If the i data contains the jth keyword, the value of the ith row and the jth column is 1 (True), otherwise it is 0 (False). The encoded Boolean matrix is shown in Table 4.

表4布尔矩阵结果Table 4 Boolean matrix results

步骤S3，通过改进Apriori算法计算各项集支持度，筛选出不小于最小支持度的项集即为频繁项集，由于数据量高达1917条，本实施例最小支持度阈值设置为0.01。筛选长度大于等于3的频繁项集，结果如表5所示。In step S3, the item set support is calculated by improving the Apriori algorithm, and the itemsets that are not less than the minimum support are selected as frequent itemsets. Since the data volume is as high as 1917, the minimum support threshold in this embodiment is set to 0.01. The frequent itemsets whose length is greater than or equal to 3 are filtered, and the results are shown in Table 5.

表5频繁项集结果Table 5 Results of frequent itemsets

通过频繁项集结果反映，在该数据集所对应的水利工程安全施工过程中，拌和站检修车间、砂石加工厂配电室等地点经常出现安全隐患，脚手架密目网脱落、电缆绝缘老化、施工现场人员佩戴等安全隐患问题较为频繁出现。According to the results of frequent itemsets, during the safety construction process of the water conservancy project corresponding to this data set, there are often hidden safety hazards in the maintenance workshop of the mixing station, the power distribution room of the sand and gravel processing plant, etc. Safety hazards such as wearing by construction site personnel occur frequently.

步骤S4，对得到的频繁项集进行置信度计算，筛选出不小于最小置信度的关联规则，即为强关联规则。使用association_rules()函数对步骤S3得到的频繁项集进行进一步关联规则挖掘，以置信度为评价值进行挖掘。设置最低置信度为0.8。强关联规则结果如表6所示。置信度为1则为绝对关联项。In step S4, a confidence level is calculated on the obtained frequent itemsets, and an association rule not less than the minimum confidence level is screened out, which is a strong association rule. Use the association_rules() function to perform further association rule mining on the frequent itemsets obtained in step S3, and perform mining with confidence as the evaluation value. Set the minimum confidence level to 0.8. The results of strong association rules are shown in Table 6. A confidence level of 1 is an absolute correlation item.

表6强关联规则结果Table 6 Results of strong association rules

通过强关联规则结果反映，在该数据集所对应的水利工程安全施工过程中，脚手架密目网经常出现脱落的安全隐患问题，施工现场的人员经常出现佩戴相关问题，拌合站检修车间是安全隐患较为频繁发生的地方。The results of the strong association rule reflect that during the safe construction of the water conservancy project corresponding to this dataset, the scaffolding mesh net often falls off, and the personnel at the construction site often have problems with wearing it. The maintenance workshop of the mixing station is safe A place where hidden dangers occur more frequently.

在本实施例中，本发明的有益效果具体体现在：(1)对于本实施例采用的1917条数据，通过机器代码实现自动分词，分词结果如表2所示，而前述的现有技术中的短语提取技术由于未引入水利工程领域自定义词库，且未进行词性过滤，产生了如”易造成”、”未要求”、”未设置”等没有明确含义和词性的词语，降低了分词质量，影响关联规则挖掘的准确性。(2)基于本实施例采用的1917条数据，通过PyCharm进行代码测试，传统的Apriori算法需要花费0.1587965秒，而本发明中提出的改进Apriori算法只需要花费0.0328939秒，大大提升了算法的运算效率。In this embodiment, the beneficial effects of the present invention are embodied in: (1) for the 1917 pieces of data used in this embodiment, automatic word segmentation is realized through machine codes, and the word segmentation results are shown in Table 2, while in the aforementioned prior art Since the phrase extraction technology of the company did not introduce a custom thesaurus in the field of water conservancy engineering and did not perform part-of-speech filtering, words with no clear meaning and part of speech such as "easy to cause", "not required", and "not set" were generated, which reduced the number of word segmentations. quality, which affects the accuracy of association rule mining. (2) Based on the 1917 pieces of data used in this embodiment, code testing is performed through PyCharm. The traditional Apriori algorithm takes 0.1587965 seconds, while the improved Apriori algorithm proposed in the present invention only takes 0.0328939 seconds, which greatly improves the operational efficiency of the algorithm .

Claims

1. The hydraulic engineering potential safety hazard description association rule mining method based on the improved Apriori algorithm is characterized by comprising the following steps of:

step S1, building a self-defined word bank in the hydraulic engineering field, carrying out word segmentation on the hidden danger data set by adopting jieba word segmentation, removing stop words, screening out the part of speech by a TextRank algorithm, and taking the extracted keywords as an input data set;

step S2, performing Boolean matrix conversion on the input data set obtained in the step S1 based on the one-hot code;

step S3, calculating the support degree of each item set by improving Apriori algorithm, and screening out frequent item sets not less than the minimum support degree;

and step S4, performing confidence calculation on the obtained frequent item set, and screening out a strong association rule not less than the minimum confidence.

2. The hydraulic engineering potential safety hazard description association rule mining method based on the improved Apriori algorithm as claimed in claim 1, wherein the step S1 specifically includes:

step S11, collecting entries in the hydraulic engineering field, constructing a self-defined word bank in the hydraulic engineering field, adopting jieba word segmentation, segmenting the hidden danger data set, and labeling the part of speech;

step S12, loading a stop word bank, and eliminating words which usually have no definite meaning per se, such as mood auxiliary words, adverbs, prepositions, conjunctions and the like in the word segmentation result;

and step S13, sorting the importance of the participles by adopting a TextRank algorithm, and extracting the keywords of which the parts of speech are nouns, verbs, vernouns and place names in the participles to serve as a final input data set.

3. The hydraulic engineering potential safety hazard description association rule mining method based on the improved Apriori algorithm as claimed in claim 2, wherein the step S2 specifically includes:

in step S21, the input data set D is first set to { D ═ D₁,D₁…,D_nAnd converting the matrix into a matrix T in a DataFrame data frame form, wherein the matrix T is in the form of:

wherein n is the number of the hidden danger description data, m is the maximum word segmentation number in n pieces of data, and T_ijThe j key word of the ith data is Null if the j key word is Null; d_iFor the ith data in the data set, i.e. the set { T }_i1,T_i2,…,T_in}；

Step S22, carrying out one-hot coding on the data frame matrix T, and converting the data frame matrix T into a Boolean matrix M; if I is ═ I₁,I₂,…,I_tThe boolean matrix M is a set of all the different entries in the dataset D, the form of the boolean matrix M being:

where t is the number of all different keywords in the dataset D, M_ijBoolean value for the ith data to the jth keyword, if D_iComprising I_jThat is, the ith data contains the jth keyword, then M_ijIs 1(True), otherwise is 0 (False).

4. The hydraulic engineering potential safety hazard description association rule mining method based on the improved Apriori algorithm as claimed in claim 3, wherein the step S3 specifically comprises:

step S31, before iteration, performing one-time traversal on the Boolean matrix obtained by conversion in the step S22, counting and summing each column in the Boolean matrix respectively, quickly mining a 1-dimensional frequent item set, and deleting a non-frequent item set;

step S32, while iterating, k-1 dimension frequent item set L_k-1Connected to form a k-dimensional candidate set denoted C_k(ii) a Let I₁And I₂Is L_k-1If k-2 items in the item set are the same, I₁、I₂Concatenating produces a set of result items, which is a set of candidate items C_kOne of them;

step S33, for candidate item set C_kPruning, C_kIs L_kA superset of (c); for L_k-1Each item set T in (1), sequentially traversing C_kAnd counting each candidate based on the boolean matrix. After traversing is finished, counting result pairs C_kCutting each candidate item set, if the count is less than k, deleting the candidate item set, and simultaneously deleting all item sets containing the item set in the data set; otherwise, reserving and carrying out the next step;

step S34, calculating a support degree (support) for each of the pruned candidate sets, where the support degree formula is:

support(A＝>B)＝P(A∪B)

in the formula: A. b is a set of items consisting of any item in I, namely representing the probability that the data contains each item in the sets of items A and B;

if the support degree of the candidate item set is not less than the minimum support degree, the candidate item set is a k-dimensional frequent item set, and L is added_kOtherwise, deleting the item set and deleting all item sets containing the item set in the data set. And returning to the step S32 after the screening is finished until the k-dimensional frequent item set cannot be found.