CN114579720A

CN114579720A - Hydropower project progress intelligent assessment method based on text mining

Info

Publication number: CN114579720A
Application number: CN202210147742.1A
Authority: CN
Inventors: 沈扬; 李明超; 李文伟; 吕沅庚; 田丹; 刘奉霞; 张栋梁
Original assignee: Tianjin University; China Three Gorges Corp
Current assignee: Tianjin University; China Three Gorges Corp
Priority date: 2022-02-17
Filing date: 2022-02-17
Publication date: 2022-06-03

Abstract

The invention discloses a hydropower project progress intelligent assessment method based on text mining, which comprises the following steps: s1: collecting a construction progress management text, extracting text contents related to progress management in construction data, and intensively transferring the text contents to a data file; s2: preprocessing the document set data, dividing the sentence into words, and removing stop words and non-text characters in the text; s3: extracting words with the same theme in the text by taking the BTM theme model as an analysis method, and forming main and auxiliary processes contained in the project after arrangement; s4: searching a progress evaluation index quantitative value related to the working procedure in the text according to the main and auxiliary working procedures formed by the arrangement; s5: developing a construction progress evaluation program, and intelligently analyzing the process construction progress by adopting a win value method in the program; the invention improves the construction management efficiency and the utilization rate of the unstructured construction management text, realizes the intelligent management of the construction text and promotes the intelligent development of the hydropower engineering construction management.

Description

Hydropower project progress intelligent assessment method based on text mining

Technical Field

The invention relates to the technical field of construction safety management of large-scale foundation construction projects such as hydraulic engineering, constructional engineering and the like, in particular to a hydropower project progress intelligent evaluation method based on text mining.

Background

In the construction process of the engineering project, a constructor coordinates the construction speed, the construction quality and the construction cost of the project through progress management at different stages so as to achieve the purpose of completing tasks within a specified construction period. The progress management has an important role in reasonably configuring resources in engineering construction, the out-of-control progress can not only cause great waste to the resources, but also the construction quality can not be guaranteed, and the completion of other control targets of engineering projects is influenced. For hydropower engineering, progress management needs to consider various complex engineering information, formulate a detailed construction plan and globally control the construction process. The construction of the hydropower engineering has the self properties of longer construction period, severe construction environment, higher organization difficulty and the like, so that the construction progress management has very high requirements. The hydropower engineering construction process can produce a large amount of text data of recording construction details, and these text data mainly regard unstructured or semi-structured data record as the owner, and it is big to read the analysis degree of difficulty, when making the analysis such as cross-track picture, winning the value to the text, often need the manual work to browse a large amount of texts, and is time-consuming and power-consuming. At present, the management of hydropower engineering projects is in a new stage of intelligent management, wherein the fine management usually requires that a large amount of construction texts are fed back to a manager as efficient and visible information, so that an intelligent method for automatically extracting analysis progress data is urgently needed.

The current common engineering progress management methods mainly include a Gantt chart method, a network chart method and a winning value method. The Gantt chart method has an early origin and is widely applied, activities are displayed through graphs, tables and the like, the Gantt chart method has the advantages of simplicity, eye-catching and the like, but has certain limitation, only three constraints (time, cost and range) of project management can be displayed in the chart, and visual expression is lacked for other constraints; in the face of huge and complicated construction links of hydropower engineering, the Gantt chart method is difficult to reflect the mutual restriction relationship among all the works, and the chart reading is also difficult. The network graph method is to realize the understanding and judgment of the project progress from the integral angle, but when facing the large-scale project with complex procedures, such as the hydropower project, the network graph analysis difficulty is higher, and the evaluation of the project progress development trend is difficult to realize. The winning value method makes up the disadvantages of the network graph in the progress evaluation, can detect the progress of the project moment, and realizes the evaluation of the current engineering construction progress.

Most of the existing progress data mainly comprises unstructured or semi-structured data records, and because the data content is too much, the reading and analyzing difficulty is high, and a large amount of texts are required to be browsed manually. When the progress text is analyzed by a value winning method, most of the information is manually searched from the database, and the time and the labor are wasted, so that an intelligent method is designed to realize automatic data extraction and analysis, and the method is necessary.

Disclosure of Invention

The invention aims to overcome the defects and provides a hydropower engineering progress intelligent assessment method based on text mining, which combines natural language processing and computer program development in the hydropower engineering progress management and data mining, extracts and classifies mass construction keywords in texts, improves the construction management efficiency and the utilization rate of unstructured construction management texts, realizes the intelligent management of construction texts, and promotes the hydropower engineering construction management to develop towards intelligence.

In order to solve the technical problems, the invention adopts the technical scheme that: a hydropower project progress intelligent assessment method based on text mining comprises the following steps:

s1: collecting a construction progress management text, extracting text contents related to progress management in construction data, and intensively transferring the text contents to a data file to be used as a document set for subsequent topic model sampling;

s2: preprocessing the document set data, dividing the sentence into words, and removing stop words and non-text characters in the text for subsequent text sampling;

s3: processing the preprocessed text by using a BTM (Biterm Topic model) Topic model as an analysis method, extracting words with the same Topic in the text, and forming main and auxiliary processes contained in the project after arrangement;

s4: searching a progress evaluation index quantitative value related to the working procedure in the text according to the main and auxiliary working procedures formed by the arrangement;

s5: and developing a construction progress evaluation program based on the extracted main and auxiliary processes and the progress evaluation index quantized values, and intelligently analyzing the process construction progress by adopting a winning value method in the program.

Further, the step S1 specifically includes the following steps:

s11: acquiring related electronic version files of construction progress recorded by a construction unit, wherein the files comprise supervision weekly reports, supervision monthly reports and construction organization design files;

s12: and extracting characters related to the construction progress in the file, wherein the construction texts are provided with uniform templates, so that the related texts can be extracted by adopting a regular expression or a searching method and recorded into a data file.

Further, the step S2 specifically includes the following steps:

s21: performing word segmentation on the text generated in the step S1 by using a jieba library in python, firstly supplementing a word segmentation dictionary of the jieba library, and supplementing process characteristic words in the text into the dictionary to obtain higher word segmentation accuracy;

s22: and (3) text cleaning, namely removing stop words and non-texts in the text, adopting a stop word list, supplementing some non-process characteristic words appearing in the supervision cycle report into the stop word list, and deleting words and characters contained in the stop word list in the text.

Further, the step S3 specifically includes the following steps:

s31: giving an index to each word in a word set generated after word segmentation, generating a dictionary, dividing a text by taking a natural segment as a unit, and combining every two words to generate a word pair;

s32: determining related parameters of a sampling model, determining hyper-parameters of topic distribution and word distribution according to experience, selecting a hyper-parameter alpha value to be 50/K and selecting a hyper-parameter beta to be 0.01 by default; the determination of the number of topics selects perplexity as a measurement index, the perplexity is a judgment index for topic extraction accuracy in the topic classification process and is used for estimating the optimal number of topics in the text, and for corpus D, the calculation formula is as follows:

where p (b) is the frequency of occurrence of each word pair in the corpus, and in the model, p (z | d) × p (w)_i|z)×p(w_j| z); z is a trained topic; d is each document of the sum test set; w is a_iIs the ith word in the text; w is a_jIs the jth word in the text; b is the number of word pairs in the corpus;

s33: the model parameters and word pairs obtained above are sampled and analyzed by Gibbs sampling, and the distribution parameters theta and the word distribution parameters of the theme are solved by adopting a Gibbs sampling algorithm

The estimated values of the parameters θ and φ are:

wherein, theta_kThe generation probability of the kth topic, B is the number of word pairs in the corpus, alpha_kAs the k topicAlpha hyperparameter, n_kThe number of word pairs in the kth subject;

probability of generation of the nth word for the kth topic, n_knFor the number of nth word pairs of the kth topic, beta_nA beta hyper-parameter for the nth word pair; taking a certain number of subject words to display, and obtaining process characteristic words with the same subject;

s34: in order to make the BTM topic model calculation result more convenient for people to analyze, a tool for visualizing the topic model calculation result, namely LDAvis, is adopted.

Further, the step S4 specifically includes the following steps:

s41: directly transferring a part of progress evaluation index quantized values stored in the structured table into an Excel table; extracting the content in the text by adopting search or mutual information theory for the progress evaluation index quantitative value hidden in the text;

s42: and sorting and storing the information into a data table.

Further, the step S5 specifically includes the following steps:

s51: developing a hydropower project progress intelligent evaluation system, compiling a WinForm program by adopting a C # programming language, and packaging the progress evaluation index quantitative value into a system for searching and using by construction management personnel;

s52: the construction data analysis function is added into the system, the winning value method is adopted to analyze the construction data, construction managers can find progress problems in construction in time, construction site progress management is guided, and engineering management efficiency is improved.

The invention has the beneficial effects that:

1. the invention provides an intelligent construction progress intelligent evaluation method aiming at the problem that massive construction progress management texts are difficult to efficiently and effectively utilize.

2. The method combines the natural language processing and computer program development in the hydraulic engineering progress management and data mining, extracts and classifies the massive construction keywords in the text, improves the construction management efficiency and the utilization rate of the unstructured construction management text, realizes the intelligent management of the construction text, and promotes the intelligent development of the hydraulic engineering construction management.

3. The method adopts a BTM (Biterm Topic model) Topic model to calculate Topic distribution and word distribution in the text, extracts process characteristic words in a construction progress management text, and searches construction progress indexes and quantization values related to the process characteristic words; the text mining technology is integrated into the schedule management of the hydropower project, so that the intelligent management is accelerated;

4. the construction progress evaluation system based on the winning value method is developed, and on the basis of the research, the construction process words and the quantized values are combined, so that the construction progress intelligent evaluation system is developed, the time-consuming and labor-consuming process of manual operation is effectively avoided, and the efficiency of text extraction and analysis is improved.

Drawings

FIG. 1 is a flow chart of a hydropower project progress intelligent assessment method based on text mining;

FIG. 2 is a diagram of a BTM topic model architecture;

FIG. 3 is a diagram illustrating LDAvis calculation results;

FIG. 4 is a winning value cost versus time evaluation graph;

FIG. 5 is a schematic diagram of a data source;

FIG. 6 is a construction progress evaluation program main interface diagram;

FIG. 7 is a chart of a construction progress evaluation program weekly progress query interface.

Detailed Description

The invention is described in further detail below with reference to the figures and specific embodiments.

As shown in FIG. 1, the intelligent hydropower project progress evaluation method based on text mining comprises the following steps:

s11: acquiring related electronic version files of construction progress recorded by a construction unit, wherein the files comprise a supervision weekly report, a supervision monthly report and a construction organization design file;

s22: and (3) text cleaning, namely removing stop words and non-texts in the text, adopting a Harbin university industry stop word list, supplementing some non-process characteristic words such as 'the week', 'accumulation' and the like appearing in the supervision weekly report into the stop word list, and deleting words and characters contained in the stop word list in the text.

s32: relevant parameters of the sampling model are determined. The hyper-parameters of the topic distribution and the word distribution are determined according to experience, the default selected hyper-parameter alpha value is 50/K, the selected hyper-parameter beta value is 0.01, the hyper-parameters alpha and beta do not have great influence on the experimental result, and the function of smoothing data is mainly played. The determination of the number of topics usually selects perplexity (perplexity) as a measurement index, the perplexity is a judgment index for topic extraction accuracy in the topic classification process, and is used to estimate the optimal number of topics in the text, and for corpus D, the calculation formula is as follows:

The estimated values of the parameters θ and φ are:

wherein, theta_kThe generation probability of the kth topic, B is the number of word pairs in the corpus, alpha_kAlpha over parameter, n, for the kth topic_kThe number of word pairs in the kth subject;

probability of generation of the nth word for the kth topic, n_knFor the number of nth word pairs of the kth topic, beta_nA beta hyper-parameter for the nth word pair; displaying a certain number of subject words to obtain the process characteristic of the same subjectSign words;

s34: in order to make the calculation results of the BTM topic model more convenient for people to analyze, a tool for visualizing the calculation results of the topic model, LDAvis, can be adopted. LDAvis is a web-based visual interactive system by which the BTM topic model results (as shown in FIG. 3) can be made more clearly understood.

LDAvis has two main functions. First, by selecting the topic sequence number, the topic word related to the topic can be displayed, and compared with the direct display of the topic word of the traditional model result, the LDAvis system can intuitively display the word frequency of the topic word. The light color bar is the frequency of the subject word appearing in the corpus, and the dark color bar is the probability of the subject word appearing in the subject. The left side of the page is the distribution condition of the theme, the size of the theme bubble is related to the content of the theme content corpus, and the overlapped part of the theme content of the theme bubble is generated. Second, LDAvis can detect the distribution of a word on different topics by hovering the cursor over the word, and the distribution of the word on different topics is shown by the size of the bubble.

s42: and sorting and storing the information into a data table.

s52: the construction data analysis function is added into the system, the construction data is analyzed by adopting a winning value method, construction managers can find progress problems in construction in time, the progress management of a construction site is guided, and the project management efficiency is improved.

The winning value method does not take the engineering quantity as the only standard for measuring the engineering progress, but takes the engineering quantity and the engineering cost as the engineering progress measuring index together by observing the achievement of converting the capital into the engineering quantity. The basic parameters of the winning value method comprise finished work Budget Cost (BCWP), planned work Budget Cost (BCWS), finished work Actual Cost (ACWP), and evaluation indexes comprise cost deviation (CV), progress deviation (SV), progress performance index (SPI) and Cost Performance Index (CPI). The BCWP, ACWP and BCWS curves are drawn, and the CV and SV sizes can be visually seen in the graph (as shown in FIG. 4).

Examples

The data adopted by the sample is the construction supervision report data of a certain hydropower station. The sample uses python language to sample the construction text. The construction period of the project is 1603 days, 221 supervision reports are generated together, the content of each supervision report exceeds 10000 characters and contains descriptions of management elements such as the progress, quality and safety of the hydropower station project, the completion condition of each unit project in the construction process of the project is recorded in detail, the current construction content and the project amount are described in a form of combining texts and tables, text content related to progress management in construction data such as the supervision reports is extracted and is centrally stored in a data file, and the example is shown in figure 5.

The method comprises the steps of preprocessing progress management texts, and mainly comprises the steps of progress management related text word segmentation, word deactivation and non-text processing and the like. The word segmentation needs to adopt a specific word segmentation algorithm, and the word segmentation is carried out on the text by adopting a jieba full mode. Due to the fact that the text is high in specialty, a user-defined dictionary needs to be added before word segmentation, and process characteristic words possibly appearing in the text are added into the user-defined dictionary, so that the expected effect can be achieved after word segmentation. When text content is cleaned and screened, some non-process words, spaces, punctuations and the like which influence the calculation effect are discarded so as to ensure the effect of theme extraction. The method adopts a stop vocabulary of Harbin university of industry, supplements some non-process characteristic vocabularies such as 'the week', 'accumulation' and the like appearing in the supervision weekly report into the stop vocabulary, and removes vocabularies contained in the stop vocabulary in the text.

And giving an index to each word in the word set generated after the word segmentation to generate a dictionary. And dividing the text by taking the natural segment as a unit, combining the word quantities to generate word pairs, and constructing 167780 word pairs. Before calculation, the topic distribution hyper-parameter alpha is determined to be 0.5, and the word pair distribution hyper-parameter beta is determined to be 0.05. The number of text-preferred topics needs to be estimated before the calculation. In the present study, model perplexity values with topic numbers of 10-25 were calculated, taking into account the range of class numbers of the same topic process words in the text. When the number of the themes is 12, the confusion degree is the lowest, and the number of the themes required by the initial selection calculation is 12.

The model is trained from the above that a dictionary, word pairs and a determined number of good topics have been generated. Calculating parameters theta and theta of the topic distribution and the word distribution through the above formula

. The iteration times are selected to be 100 times, the problem of overfitting can occur when the iteration times are too large, and the situation that the classification effect does not reach the standard can occur when the iteration times are too small. The number of the displayed subject words is 10, the subject names are obtained according to the analysis of the subject words, and the sampling results are shown in the table 1.

TABLE 1 results of sampling

And after the theme information of the construction progress text is extracted, retrieving a construction progress quantized value in the progress text according to the keywords, taking joint grouting as an example, and displaying an information retrieval result and an effect presented by a program. Table 2 shows the partial completion of the joint grouting.

TABLE 2 Joint grouting week completion (alternate)

A WinForm program is written by adopting a C # programming language, a program interface is shown in FIG. 6, and a construction progress evaluation program is developed by adopting the WinForm program. And packaging the data into an application program, and analyzing and calculating the data in the program based on a winning value method.

Referring to fig. 7, which is a weekly progress query interface of the construction progress assessment program, a process to be queried for progress can be selected in a drop-down list at "process", and a time to be queried can be selected in a drop-down list at "time (year/week)". After the working procedure and time are selected, the right inquiry button is clicked, and the result can be output below the inquiry button to generate the performance indexes of project contract quantity, cumulative completion quantity of work starting, percentage of total completed quantity, weekly completion planned quantity, weekly completion actual quantity and weekly progress.

By means of the construction progress evaluation program, after construction, a worker can inquire the progress of the past week and year by means of the platform. The construction condition of a week and a year can be analyzed by construction workers, progress problems in construction can be found in time, progress management of a construction site is guided, and engineering management efficiency is improved.

The above-described embodiments are merely preferred technical solutions of the present invention, and should not be construed as limiting the present invention, and the embodiments and features in the embodiments in the present application may be arbitrarily combined with each other without conflict. The protection scope of the present invention is defined by the claims, and includes equivalents of technical features of the claims. I.e., equivalent alterations and modifications within the scope hereof, are also intended to be within the scope of the invention.

Claims

1. A hydropower project progress intelligent assessment method based on text mining is characterized by comprising the following steps: it comprises the following steps:

s2: preprocessing the document set data, dividing sentences into words, and removing stop words and non-text characters in the text for subsequent text sampling;

2. The intelligent assessment method for the progress of the hydropower project based on text mining as claimed in claim 1, characterized in that: the step S1 specifically includes the following steps:

3. The intelligent hydropower project progress evaluation method based on text mining as claimed in claim 1, characterized in that: the step S2 specifically includes the following steps:

4. The intelligent assessment method for the progress of the hydropower project based on text mining as claimed in claim 1, characterized in that: the step S3 specifically includes the following steps:

s32: determining relevant parameters of a sampling model, determining hyperparameters of topic distribution and word distribution according to experience, selecting a hyperparameter alpha value of 50/K by default and selecting a hyperparameter beta value of 0.01 by default; the determination of the number of topics selects perplexity as a measurement index, the perplexity is a judgment index for topic extraction accuracy in the topic classification process and is used for estimating the optimal number of topics in the text, and for corpus D, the calculation formula is as follows:

s33: sampling and analyzing the obtained model parameters and word pairs by Gibbs sampling, and solving a theme distribution parameter theta and a word distribution parameter by adopting a Gibbs sampling algorithm

The estimated values of the parameters θ and φ are:

s34: in order to enable the BTM topic model calculation result to be more convenient for people to analyze, a tool for visualizing the topic model calculation result, namely LDAvis, is adopted.

5. The intelligent assessment method for the progress of the hydropower project based on text mining as claimed in claim 1, characterized in that: the step S4 specifically includes the following steps:

s42: and sorting and storing the information into a data table.

6. The intelligent assessment method for the progress of the hydropower project based on text mining as claimed in claim 1, characterized in that: the step S5 specifically includes the following steps: