CN112949304A

CN112949304A - Construction case knowledge reuse query method and device

Info

Publication number: CN112949304A
Application number: CN202110313320.2A
Authority: CN
Inventors: 邓逸川; 邓晖; 苏成; 王煜; 宋建炜
Original assignee: Sino Singapore International Joint Research Institute
Current assignee: Sino Singapore International Joint Research Institute
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2021-06-11

Abstract

The invention discloses a construction case knowledge reuse query method and device. The method includes the following steps: collecting construction safety specification documents and construction safety accident reports, and electronically digitizing these materials to establish a construction safety field case database; Language processing technology performs text segmentation and stop word removal for construction safety specification documents and construction safety accident reports, and then calculates feature words through word frequency inverse text algorithm; Carry out synonym expansion query of feature words; carry out similarity calculation of similar construction safety cases based on vector space model and cosine function improvement method; integrate the entire database and query system into a local server or intelligent equipment; the present invention can integrate past construction safety cases Knowledge reuse provides decision-making assistance for new risks, which will greatly improve the level of construction safety management, while saving query time and improving query efficiency.

Description

Construction case knowledge reuse query method and device

Technical Field

The invention relates to the technical field of construction case knowledge management, in particular to a construction case knowledge reuse query method and a construction case knowledge reuse query device based on a natural language processing technology.

Background

The reform is open, and the development of engineering projects is enhanced in China. The construction of the engineering project is a comprehensive production activity of multiple categories, the construction period of the engineering project is long, and a plurality of uncertain factors exist in the construction process.

In recent years, although the construction safety situation of China is better, various safety accidents happen, the construction safety problem is not ignored, and the construction safety management level needs to be further improved. However, because of the large amount of zero-fragmentation information and variable factors in the civil engineering industry, although construction safety accident reports are accumulated continuously, the information cannot be fully utilized in the traditional construction safety management, the reason is that a means for converting the information into reusable knowledge is lacked, and if a construction case knowledge reuse query system is established, decision-making help is provided for new risks through the reuse of the construction case knowledge in the past, so that the level of construction safety management can be greatly improved.

Currently, construction cases are mainly concentrated in accident safety reports and news reports of websites of a department of construction, and similar construction safety cases are searched in unstructured texts with low efficiency, so that a great gap still exists in the field of reuse of construction safety cases.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides a construction case knowledge reuse query method and a construction case knowledge reuse query device.

In order to achieve the purpose, the invention provides a construction case knowledge reuse query method, which comprises the following steps:

step S1, collecting construction safety standard documents and construction safety accident reports, electronizing the documents, and establishing a case library in the construction safety field;

step S2, performing text word segmentation and word removal processing on the construction safety standard document and the construction safety accident report based on a natural language processing technology, and calculating characteristic words through a word frequency inverse text algorithm;

step S3, performing synonym expansion query of the feature words through a self-built common term lexicon and a continuous bag-of-words model in the construction industry;

step S4, similarity calculation of similar construction safety cases is carried out based on a vector space model and a cosine function improvement method;

step S5, the whole database and query system is integrated into a local server or an intelligent device.

Preferably, the step S2 includes the following steps:

step S21, performing word segmentation on the construction safety accident case through jieba, wherein the jieba uses a prefix tree to classify words for improving the retrieval efficiency;

step S22, removing the null words existing in the construction safety accident case text by self-building a stop word library, wherein the stop words are extremely common words and have little value for helping to calculate the similarity of the text, and the size of the library can be greatly reduced and the retrieval efficiency can be improved by deleting the meaningless words;

and step S23, selecting a word frequency inverse text algorithm to extract the feature words through algorithm comparison and selection, calculating the weight of the feature words, and extracting the feature words in the construction safety accident case.

Preferably, the step S23 includes the following steps:

step S231, calculating a weight according to the importance of the words on the basis of the word frequency, wherein the weight is called 'inverse text frequency', and the size of the weight is inversely proportional to the common degree of the words;

step S232, different weights are given to different words, larger weight is given to less common words, smaller weight is given to more common words, minimum weight is given to most common words, and the word frequency and the inverse text frequency are multiplied to obtain TF-IDF values of the words;

and step S233, the higher the importance of the word to the text, the larger the TF-IDF value of the word is, and the feature value extraction of the text can be completed according to the descending order of the TF-IDF value.

Preferably, the word frequency, the inverse text frequency, and the word frequency-inverse text frequency are calculated as follows:

word frequency TF: the number of times a feature value appears in the text, i.e. if ti, k appears ni, k times in the text di

In practical applications, to avoid statistical deviations due to too long text, a normalization process, Σ, is generally required_m n_m，kI.e. the total number of words of the text:

inverse text frequency IDF: the frequency of the feature items appearing in the total text set D is that if the total text set has M texts and the feature items ti, k appear in mi, k texts

Wherein alpha is an empirical constant, and is generally 0.01; the more common the denominator of the word is, the smaller the inverse text frequency is; the reason for the denominator plus a is to avoid being 0, i.e. all text does not contain the word;

word frequency-inverse text frequency IF-IDF: the IF-IDF calculation method is that the word frequency is multiplied by the inverse text frequency

w_i，k＝TF_i，k*IDF_i，k

The word frequency-inverse text frequency is inversely proportional to the occurrence frequency of a word in the whole total text library and is directly proportional to the occurrence frequency of the word in a specific text, so that the word frequency-inverse text frequency of the word is calculated, and the characteristic values are extracted by descending order.

Preferably, the step S3 includes the following steps:

step S31, giving a training text, namely a construction safety accident case library and Chinese Wikipedia, using one-hot codes as input of a CBOW model, setting the dimension of a self-setting word vector as 100, setting a window as 5, setting the minimum occurrence frequency as 5, setting the number of threads used by the training word vector as 9, embedding words through the CBOW model, accumulating the input word vectors, and finally finishing vectorization representation of the words through a two-classifier;

and step S32, reading the feature words extracted in the step S2, obtaining word vectors of the feature words by using the trained word vectors, calculating the first 5 words most similar to the feature words by using cosine distance, and performing synonym expansion.

Preferably, the CBOW model is a three-layer neural network model;

the first layer of the CBOW model is an input layer, and word vectors with known contexts are input;

the middle layer of the CBOW model is called a linear hidden layer and accumulates all input word vectors;

the third layer of the CBOW model is a two-classifier softmax, and corresponding word near-meaning word expansion is obtained through training.

Preferably, the step S4 includes: after the feature words and the synonyms are obtained, a vector space model is utilized, a cosine function is improved, the similarity between the building construction safety cases is calculated, the cosine coefficient algorithm result is accurate and is the most common calculation method in VSM, the similarity between the input cases and the texts is calculated by using a similarity model in a third party tool genesis of Python, the texts are sequenced from large to small according to the similarity value, and finally the first 10 texts are used as output results;

the Sim (t _1, t _0) is the original query, the Sim (t _1, t _ k) is the expanded query, so the value between 0< lambda <1 is taken, and after multiple times of verification, lambda is set to be 0.7.

The invention also provides a construction case knowledge reuse query device, which is characterized by comprising the following steps:

the construction safety case acquisition and processing module is used for collecting construction safety standard documents and construction safety accident reports, electronizing the documents and establishing a case library in the construction safety field; performing text word segmentation and word removal processing on the construction safety standard document and the construction safety accident report based on a natural language processing technology, and calculating characteristic words by a word frequency inverse text algorithm;

the synonym expansion query module is connected with the construction safety case acquisition and processing module and is used for carrying out synonym expansion query on the characteristic words through a self-built common term word bank and a continuous word bag model in the building industry;

the similar case retrieval module is connected with the synonym expansion query module and is used for calculating the similarity of the similar construction safety cases based on a vector space model and a cosine function improvement method;

the construction safety case obtaining and processing module comprises a crawler algorithm and word segmentation and stop words; the synonym expansion query module comprises a text vectorization and continuous bag-of-words model; the similar case retrieval module comprises similarity calculation based on vector space model texts.

Preferably, the system further comprises a local server or an intelligent device, and the whole database and the query system are stored in the local server or the intelligent device.

Compared with the prior art, the invention has the beneficial effects that:

1. the query method and the query device can query the prior construction safety accident cases at any time, are realized based on the natural language processing technology, can reuse the prior construction safety case knowledge to provide decision help for new risks, can greatly improve the safety management level of construction site managers and constructors, effectively reduce the safety accident rate, and have better significance for improving the construction safety management level of the whole construction industry.

2. The invention can use the mobile phone or the tablet to inquire, the user can inquire the construction safety accident case by inputting the accident problem, the invention can directly output the similar construction safety accident case after inputting the daily report, effectively avoids the low efficiency and the complexity of the web search, and improves the multiplexing efficiency of the construction safety accident knowledge.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating steps of a construction case knowledge reuse query method provided by the present invention;

fig. 2 is an exemplary analysis schematic diagram of a construction case knowledge reuse query method provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are one embodiment of the present invention, and not all embodiments of the present invention. All other embodiments obtained by a person skilled in the art based on the embodiments of the present invention without any creative work belong to the protection scope of the present invention.

Example one

Referring to fig. 1 and fig. 2, an embodiment of the present invention provides a construction case knowledge reuse query method, including the following steps:

step S1, collecting construction safety accident reports and construction safety failure cases, collecting risk cases by a web search method, collecting the risk cases from construction safety accident condition reports of administrative department websites, collecting the risk cases from documents, electronizing the data, and establishing a construction safety accident case database in the construction safety field.

And step S2, performing text word segmentation and word removal processing on the construction safety standard document and the construction safety accident report based on a natural language processing technology, and calculating characteristic words through a word frequency inverse text algorithm (TF-IDF).

Specifically, the step S2 includes the following steps:

and step S21, performing word segmentation on the construction safety accident case through jieba, wherein the jieba uses a prefix tree (also called a dictionary tree) to classify words for improving the searching efficiency.

Assuming that a computer searches for the word "building equipment", generally speaking, the computer will scan all text Chinese character strings, which is inefficient, but the prefix tree can be searched from top to bottom, each time a Chinese character is determined, if the next node of a certain node does not meet the search requirement, the search will be stopped, and the method can greatly improve the efficiency. In addition, the prefix tree can be combined with a directed acyclic graph, so that the problem of dual understanding words is solved efficiently.

And step S22, removing the null words existing in the construction safety accident case text by self-building a stop word library, wherein the stop words are extremely common words and have little value for helping to calculate the similarity of the text, and deleting the meaningless words can greatly reduce the size of the library and improve the retrieval efficiency.

Because the current NLP technology still has some limitations, some meaningless word symbols, such as symbol underlines, are generated after word segmentation, deleting the meaningless word symbol with the largest occurrence number can effectively reduce the data amount, and the operation of removing the stop word can be realized by importing the stop word list and then removing the words in the word list.

And step S23, selecting a word frequency inverse text algorithm (TF-IDF) to extract the characteristic words through algorithm comparison and selection, calculating the weight of the characteristic words, and extracting the characteristic words in the construction contract rules.

For example, in an accident report, the three words "unit", "fall", "collapse" occur as many times (word frequency), but their importance is different. "Fall" and "collapse" are more representative of the text than "units", that is, "fall" and "collapse" need to be ranked before "units" when ranking the keywords.

One way to solve this problem is to use TF-IDF (word frequency-inverse text frequency), i.e. a weight is calculated based on the word frequency according to the importance of the word, this weight is called "inverse text frequency", the size being inversely proportional to the degree of prevalence of the word. Less common words (e.g., "fall", "collapse") are given greater weight, more common words (e.g., "unit") are given lesser weight, and most common words (e.g., "yes") are given least weight. And multiplying the word frequency (TF) and the inverse text frequency (IDF) to obtain a TF-IDF value of the word. The higher the importance of a word to a text, the larger its TF-IDF value. Therefore, the feature value extraction of the text can be completed according to the large-to-small ordering of the TF-IDF values.

The calculation method of the word frequency, the inverse text frequency and the word frequency-inverse text frequency is as follows:

word frequency (TF): the number of times a feature value appears in the text, i.e. if ti, k appears ni, k times in the text di

TF_i，k＝n_i，k

inverse text frequency (IDF): the frequency of the feature items appearing in the total text set D is that if the total text set has M texts and the feature items ti, k appear in mi, k texts

word frequency-inverse text frequency (IF-IDF): the IF-IDF calculation method is that the word frequency is multiplied by the inverse text frequency

w_i，k＝TF_i，k*IDF_i，k

Step S3, performing synonym expansion query of the feature words through a self-built construction industry common term lexicon and a Continuous Bag of words Model (CBOW).

The CBOW model is a three-layer neural network model;

the third layer of the CBOW model is a two-classifier softmax, and corresponding word near-meaning word expansion is obtained through training. Such as "fall", "fall" and "drop" are words of similar import.

Specifically, the step S3 includes the following steps:

step S31, giving a training text, namely a construction safety accident case library and Chinese Wikipedia, using one-hot codes as input of a CBOW model, setting the dimension of a self-setting word vector as 100, setting a window as 5, setting the minimum occurrence frequency as 5, setting the number of threads used by the training word vector as 9, embedding words through the CBOW model, accumulating the input word vectors, and finally finishing vectorization representation of the words through a two-classifier.

And step S32, reading the feature words extracted in the step S2, obtaining word vectors of the feature words by using the trained word vectors, calculating the first 5 words most similar to the feature words by using cosine distance, and performing synonym expansion. Such as "fall", "fall" and "drop" are words of similar import.

And step S4, calculating the similarity of the similar construction safety cases based on the vector space model and the cosine function improvement method.

Specifically, after the feature words and the synonyms are obtained, the similarity between the construction safety cases is calculated by utilizing a vector space model and improving a cosine function, the cosine coefficient algorithm result is accurate and is the most common calculation method in VSM, and therefore a cosine coefficient method is used for calculating the similarity. Calculating similarity between an input case and a text by using a similarity model in a third-party tool genesis of Python, sequencing the input case and the text from large to small according to the similarity value, and finally taking the first 10 texts as output results;

For example, the query can be performed on a construction site by using a mobile phone or a tablet, and the database and the query system are local, so that the query can be performed regardless of whether a network exists, and the required safety knowledge can be queried in real time even for projects in remote mountainous areas.

More specifically, the construction safety accident case can be inquired by using a mobile phone or a tablet, and the user can input the accident problem and then directly output the similar construction safety accident case after inputting the daily report, so that the low efficiency and the complexity of webpage search are effectively avoided, and the multiplexing efficiency of construction safety accident knowledge is improved.

Example two

The second embodiment of the invention provides a construction case knowledge reuse query device, which comprises:

The system also comprises a local server or intelligent equipment, wherein the whole database and the query system are stored in the local server or the intelligent equipment.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A construction case knowledge reuse query method is characterized by comprising the following steps:

2. The construction case knowledge reuse query method according to claim 1, wherein the step S2 includes the following steps:

3. The construction case knowledge reuse query method according to claim 2, wherein the step S23 includes the following steps:

4. The construction case knowledge reuse query method according to claim 3, wherein the word frequency, the inverse text frequency, and the word frequency-inverse text frequency are calculated as follows:

TF_i，k＝n_i，k

In practical applications, to avoid statistical deviations due to too long text, a normalization process, Σ, is generally required_mn_m，kI.e. the total number of words of the text:

w_i，k＝TF_i，k*IDF_i，k

The word frequency-inverse text frequency is inversely proportional to the occurrence number of a word in the whole total text library and is directly proportional to the occurrence number of the word in a specific text, so that the word frequency-inverse text frequency of the word is calculated, and the characteristic value is extracted by descending order.

5. The construction case knowledge reuse query method according to claim 1, wherein the step S3 includes the following steps:

6. The construction case knowledge reuse query method according to claim 5, wherein the CBOW model is a three-layer neural network model;

7. The construction case knowledge reuse query method according to claim 1, wherein the step S4 includes: after the feature words and the synonyms are obtained, a vector space model is utilized, a cosine function is improved, the similarity between the building construction safety cases is calculated, the cosine coefficient algorithm result is accurate and is the most common calculation method in VSM, the similarity between the input cases and the texts is calculated by using a similarity model in a third party tool genesis of Python, the texts are sequenced from large to small according to the similarity value, and finally the first 10 texts are used as output results;

where Sim (t _1, t _0) is the original query and Sim (t _1, t _ k) is the extended query, so that values between 0< λ <1 are taken, and after multiple verifications, λ is set to 0.7.

8. A construction case knowledge reuse inquiry device is characterized by comprising:

9. The construction case knowledge reuse query device according to claim 8, further comprising a local server or an intelligent device, wherein the entire database and the query system are stored in the local server or the intelligent device.