CN112749278B

CN112749278B - Classification method for building engineering change instructions

Info

Publication number: CN112749278B
Application number: CN202011629638.3A
Authority: CN
Inventors: 刘发贵; 吴怡
Original assignee: Guangdong Zhuwuzhilian Technology Co ltd; South China University of Technology SCUT
Current assignee: Guangdong Zhuwuzhilian Technology Co ltd; South China University of Technology SCUT
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2022-10-18
Anticipated expiration: 2040-12-30
Also published as: CN112749278A

Abstract

The invention discloses a method for classifying a construction engineering change instruction. The method comprises the following steps: carrying out document conversion processing aiming at the characteristics of the construction engineering documents, extracting the text content of each document, marking the category of each document, and establishing a construction engineering corpus; constructing a building field dictionary, and performing word segmentation pretreatment on texts in the building engineering corpus by combining a word segmentation tool with the building field dictionary to obtain a text word sequence; establishing a word vector training corpus, and training a distributed word vector by adopting a word vector model; the method comprises the steps of representing global word features and local syntactic features of a text based on a distributed method, and constructing text fusion semantic representation; based on text fusion semantic representation, a classification model is constructed by adopting a supervised machine learning algorithm; and predicting the document to be classified, and quickly acquiring an instruction related to engineering change. The invention solves the problem of low efficiency of manually classifying engineering documents in the building field by utilizing a natural language processing technology, and is beneficial to construction management of building engineering.

Description

Method for classifying building engineering change instructions

Technical Field

The invention belongs to the field of natural language processing and constructional engineering, and particularly relates to a method for classifying constructional engineering change instructions.

Background

The urbanization process of China is already in a high-speed stage, the quantity of newly increased projects of municipal facilities, houses and traffic roads is obviously increased, the scale of monomers of newly built projects is continuously increased, and the building structure forms tend to be diversified. The construction engineering is complex, and relates to a plurality of factors such as management, technology, finance and the like, and multiple parties are needed to cooperate. The work contact list is used as a communication file and a construction basis among all units and all departments, and information such as scheme confirmation, progress coordination, construction feedback, government documents and the like in the building construction process is recorded. For a type of notification contact list related to engineering change, due to the fact that the content of the notification contact list is related to construction object change, construction materials and personnel input increase and decrease, a constructor needs to respond to a change instruction in time, and waste is avoided. Therefore, the engineering change instruction can be quickly and accurately screened out from a large number of work contact lists, and the method has great significance for successfully completing the construction engineering. However, at present, the construction industry still adopts a manual review mode to obtain the engineering change instruction in the engineering work contact list. Not only consumes time and labor, but also is easy to generate careless omission and even mistakes, thereby causing a great deal of manpower and material resource waste.

Natural language processing is a combination of computer science and language science used to study the interaction between computer languages and human natural language. The identification of the construction project engineering change instruction relates to a text classification technology in natural language processing, automatic classification marking can be carried out on a work contact list in a construction project through a classification process, engineering change information can be quickly and accurately acquired, and the construction project information management level can be improved.

Currently, the application of natural language processing technology in the building field is still in the beginning. Bell billows et al (2018) invented a method and system for classifying building quality complaint texts (application publication No. CN 1085631A), and a classification model is established based on a convolutional neural network. But considering the number of documents of the actual project, this method is not suitable for the problem solved by the present invention. At present, shallow machine learning is mainly adopted in text classification research in the building field. The text representation is the key in machine learning, the semantic and syntactic relevance of words are not considered in a classical word bag model, and the semantic representation of the text has a high-dimensional sparse problem.

Disclosure of Invention

The invention aims to solve the problem of low efficiency of manually classifying construction engineering documents by using a natural language processing technology. The invention provides a classification method of a construction engineering change instruction, which uses a distributed word vector to represent global word characteristics of a text to obtain global semantic representation of the text; text syntactic characteristics are extracted based on dependency syntactic analysis, characteristics with high degree of association with the change trigger words are enhanced through an attention mechanism, and text local semantic representation is obtained; and splicing the two parts to obtain text fusion semantics, and meanwhile, constructing a classification model by adopting a supervised machine learning algorithm according to the number level of actual engineering documents to realize the classification of the construction engineering change instructions.

The purpose of the invention is realized by at least one of the following technical solutions.

A classification method of a construction engineering change instruction comprises the following steps:

s1, carrying out document conversion processing aiming at the characteristics of the construction engineering documents, extracting the text content of each document, marking the category of each document, and establishing a construction engineering corpus;

s2, building a building field dictionary, and performing word segmentation pretreatment on the text in the building engineering corpus by combining a word segmentation tool with the building field dictionary to obtain a text word sequence;

s3, establishing a word vector training corpus, and training a distributed word vector by adopting a word vector model;

s4, representing global word features and local syntactic features of the text based on a distributed method, and constructing text fusion semantic representation;

s5, based on the text fusion semantic representation obtained in the step S4, a classification model is constructed by adopting a supervised machine learning algorithm; and predicting the document to be classified, and quickly acquiring an instruction related to engineering change.

Further, step S1 comprises the steps of:

s1.1, identifying characters in the construction engineering document by using an Optical Character Recognition (OCR) technology and storing the characters in a text format (.txt) through an image PDF generated by a scanning mode;

s1.2, sorting and perfecting the recognized text of each document, removing fixed contents of a header and a footer of a table, and correcting text contents of a table body;

s1.3, manually marking the category of each text according to whether the text content relates to an engineering change instruction;

s1.4, building a construction engineering corpus in a sample form by taking the 'category text content'.

Further, step S2 includes the steps of:

s2.1, segmenting the texts in the constructional engineering corpus based on the general dictionary of the segmentation tool to obtain a word sequence after each text is segmented;

s2.2, calculating the word sequence obtained in the step S2.1 by utilizing the forward maximum matching algorithm idea, setting the maximum length to be 4, and adding the combined word into a field candidate word set if the combined word appears in a plurality of texts in a construction engineering corpus;

s2.3, manually checking the field candidate word set, eliminating words which do not accord with language logic and forming words which refer to specific people and objects, and generating a building field dictionary;

s2.4, searching the network building word bank, and adding the words in the network building word bank into the building field dictionary generated based on the corpus in the step S2.3;

and S2.5, combining the general dictionary and the building field dictionary constructed in the step S2.4, and performing word segmentation on the texts in the building engineering corpus again by using a word segmentation tool to obtain a final text word sequence of each text in the preprocessing stage.

Further, step S3 includes the steps of:

s3.1, establishing word vector training corpora, wherein the word vector training corpora comprises corpora in the building engineering corpus established in the step S1, wiki encyclopedia Chinese corpora and related current national standards of the building industry, and performing word segmentation pretreatment based on the building field dictionary established in the step S2;

and S3.2, training the corpus by adopting a word vector model to obtain a distributed word vector.

Further, step S4 includes the steps of:

s4.1, using the distributed word vector to represent the global word features of the text to obtain a global semantic representation C of the text _g ；

S4.2, using distributed word vectors to represent local syntactic characteristics of texts to obtain local semantic representation C of the texts _l ；

S4.3, combining the text global semantic representation in the step S4.1 and the text local semantic representation in the step S4.2 in a splicing mode to obtain text fusion semantic representation:

further, the specific steps of step S4.1 are as follows:

s4.1.1, representing the text word sequence by using the distributed word vectors in the step S3.2 to obtain a text word vector matrix, namely X = { X = ₁ ,x ₂ ,…,x _m Where m is the number of words in the text word sequence, x _i A word vector representing the ith word, the range of i being 1-m;

s4.1.2, calculating a word sequence dimension average value to obtain a text global semantic representation:

further, the specific steps of step S4.2 are as follows:

s4.2.1, identifying the dependency relationship among the words in the text word sequence of each text by using a dependency analysis tool to form a dependency relationship structure which takes a predicate verb as a root node and takes other words as child nodes to directly or indirectly depend on the root node;

s4.2.2, by utilizing a dependency relationship structure of sentences, extracting a root node and a supported matched word on a dependency arc of a main predicate relationship (SBV) and a moving object relationship (VOB) of the root node, then respectively taking a verb with a parallel relationship (COO) with the root node as a father node, extracting the father node and the supported matched word on the dependency arc of the main predicate relationship and the moving object relationship of the father node, and finally arranging all the extracted words in the original sequence as a text local syntactic characteristic;

s4.2.3, expressing the syntactic characteristics extracted in the step S4.3.2 by using the distributed word vectors in the step S3.2 to obtain a word vector matrix of the syntactic characteristics of each text, namely

Where n is the number of extracted syntactic characteristic words,

a word vector representing the jth syntactic characteristic word, wherein the range of j is 1-n;

s4.2.4, enabling verbs to be in a dominant position in sentences, establishing connection relations among other words, and from the semantic point of view, describing actions for changing the states of the participles on the moving object relation dependency arcs, namely relating to a construction engineering change instruction and called as a change trigger word; constructing attention based on a change trigger word dictionary, and calculating the association degree alpha of the text syntactic characteristic words and the change trigger words _j And j ranges from 1 to n, and then the text local semantic representation is obtained:

further, in step s4.2.4, the specific steps of calculating the association degree weight of the syntactic characteristic word and the alteration trigger word are as follows:

s4.2.4.1, collecting verbs describing engineering change in the construction engineering language database text, and establishing a change trigger dictionary;

s4.2.4.2, based on the Chinese synonym library, searching for 5 words with the maximum similarity with each verb in the changed trigger word dictionary in the step S4.2.4, and expanding the changed trigger word dictionary;

s4.2.4.3, using the distributed word vectors in the step S3.2, representing each word in the change trigger word dictionary as a word vector and respectively serving as a key vector, using each word vector in a text syntactic characteristic word vector matrix as a query vector, calculating the association degree between the syntactic characteristic word and the trigger word, and enhancing the semantic closer to the weight of the word representing state change; the attention score is calculated by a dot product method, namely:

wherein the content of the first and second substances,

triggering a word vector of a k word in a word dictionary for change;

taking the maximum value of the attention scores of the syntactic characteristic words and all words in the change trigger word dictionary as the attention based on the change trigger word dictionary, namely:

the attention weight of the corresponding syntactic characteristic word based on the change trigger word dictionary is as follows:

further, step S5 includes the steps of:

s5.1, dividing the text fusion semantic vector obtained in the step S4.3 into a training set and a test set according to a certain proportion;

s5.2, inputting the fusion semantic vector and the class label of each text in the training set into different machine learning classifiers for model training based on a supervised machine learning algorithm; random search and cross validation are adopted in training to obtain the optimal hyper-parameters of each classification model;

s5.3, testing the text fusion semantic vectors in the test set by the classification models obtained in the step S5.2, evaluating the classification models by three indexes of accuracy, recall rate and F1 measurement, and selecting the classification model with the best effect as a final classification model of the construction engineering change instruction;

and S5.4, preprocessing the document to be classified according to the steps S1 to S4, inputting the classification model of the construction engineering change instruction obtained in the step S5.3 for prediction, and quickly obtaining the construction engineering change instruction.

Further, in step S5.2, the supervised machine learning algorithm includes support vector machine, naive bayes and K nearest neighbor.

Compared with the prior art, the invention has the following advantages and effects:

1. the invention generates words with a special expression mode of the construction engineering based on the construction corpus, and simultaneously constructs a construction field dictionary containing special words in the construction engineering corpus and construction industry terms by combining the network construction lexicon, thereby solving the timeliness problem of dictionary construction. The method can effectively improve the accuracy of word segmentation results by segmenting words of texts based on the building field dictionary;

2. according to the invention, through training word vectors, word features of a text are embedded into a low-dimensional vector space by using a distributed method, so that the dimension of the feature space is reduced, and the semantic effect of words is exerted. The context semantic relevance of a word is of concern compared to the classical bag of words model.

3. The dependency relationship structure is obtained by analyzing the dependency syntax of the text, so that words on the dependent arcs are extracted as local syntax characteristics and are fused with the global characteristics of the text, and the semantic representation performance of the text is improved.

Drawings

FIG. 1 is a flow chart of the present invention for building engineering change order classification;

FIG. 2 is a diagram of a dependency syntax structure for text in an embodiment of the method of the present invention;

fig. 3 is a schematic diagram of obtaining a text fusion semantic vector in the embodiment of the method of the present invention.

Detailed Description

In order to make the technical solutions and advantages of the present invention more apparent, the following detailed description is made with reference to the accompanying drawings, but the present invention is not limited thereto.

The embodiment is as follows:

a method for classifying a building engineering change instruction, as shown in fig. 1, includes the following steps:

s1, aiming at the characteristics of the construction engineering documents, carrying out document conversion processing, extracting the text content of each document, marking the category of each document, and establishing a construction engineering corpus, comprising the following steps:

s1.1, identifying characters in the construction engineering document by using an Optical Character Recognition (OCR) technology and storing the characters in a text format (.txt), wherein the construction engineering document is an image PDF generated in a scanning mode generally;

s1.2, the texts recognized by each document are sorted and perfected, fixed contents of a header and a footer of a table are removed, and the text content of a table body is corrected;

s1.4, building engineering corpus is established in a sample form of 'category text content'.

In the embodiment, a text content finally obtained after processing a building document is' ECN, and an ECN teacher determines to modify five second-floor fire ladders in a coordinated conference in 2018, 5 and 8, specifically, see attachments in detail, and then follows the file construction. "where" ECN "is a category label indicating that the text belongs to an engineering change instruction.

S2, building a building field dictionary, and performing word segmentation pretreatment on the text in the building engineering corpus by combining a word segmentation tool with the building field dictionary to obtain a text word sequence, wherein the method comprises the following steps:

s2.3, manually checking the field candidate word set, eliminating words which do not accord with language logic and forming words which refer to specific persons and objects, such as combined words of 'architect + specific name', and generating a dictionary of the building field;

s2.4, searching a network building word bank, in the embodiment, adopting a building dictionary in the search dog cell word bank, and adding words in the building dictionary generated based on the corpus in the step S2.3;

in this embodiment, the text is segmented before the building domain dictionary is not used, and the obtained result is:

The 'former' refers to architects in the Hongkong and Piao region, and the 'fire ladder' refers to a lifesaving channel in the case of fire, and the two field words are not correctly divided; therefore, the forward maximum matching algorithm idea is used for carrying out length splicing on the preliminarily divided word sequences, namely splicing the 'then' and the 'teacher' into the 'then teacher', and adding the 'then teacher' into the candidate words of the building field dictionary if the 'then teacher' also appears in other texts.

In this embodiment, the text is segmented again in combination with the building domain dictionary, and the obtained result is:

S3, establishing a word vector training corpus, and performing word vector pre-training by adopting a word vector training model, wherein the method comprises the following steps:

and S3.2, adopting a word vector model, in the embodiment, training the corpus by using a word2vec model to obtain a distributed word vector.

S4, representing the global word characteristics and the local syntactic characteristics of the text based on a distributed method, and constructing text fusion semantic representation, wherein the method comprises the following steps;

s4.1, representing the context word sequence by using the distributed word vectors in the step S3.2 to obtain a text word vector matrix, namely X = { X = ₁ ,x ₂ ,…,x _m H, where m is the number of words in the sequence of text words, x _i A word vector representing the ith word, the range of i being 1-m;

s4.2, calculating a word sequence dimension average value to obtain a text global semantic representation:

s4.3, carrying out syntactic analysis on the text word sequence by using a dependency analysis tool, extracting words on related dependency relations as text syntactic characteristics, and finally representing the text local syntactic characteristics by using distributed word vectors to obtain text local semantic representation; as shown in fig. 2, which is a dependency structure diagram in the present embodiment, it can be seen from the diagram that a predicate verb is the center of a natural language sentence and dominates other components of the sentence, and is not itself dominated by any component. In the construction project text, the project change instruction is mainly expressed by a subject-predicate structure, wherein a verb represents an implementation action, a subject represents an implementation subject, and an object represents a subject of implementation, namely, a subject-predicate relationship (SBV) and a subject-predicate relationship (VOB) in a dependency relationship. Therefore, syntactic feature extraction of text includes the following steps:

s4.3.1, identifying the dependency relationship among the words in the text word sequence of each text by using a dependency analysis tool to form a dependency relationship structure which takes a predicate verb as a root node and takes other words as child nodes to directly or indirectly depend on the root node;

s4.3.2, by utilizing a dependency relationship structure of sentences, extracting a root node and a supported matched word on a dependency arc of a main predicate relationship (SBV) and a moving object relationship (VOB) of the root node, then respectively taking a verb with a parallel relationship (COO) with the root node as a father node, extracting the father node and the supported matched word on the dependency arc of the main predicate relationship and the moving object relationship of the father node, and finally arranging all the extracted words in the original sequence as the characteristic of a text to obtain the syntactic characteristic of the text;

in this embodiment, fig. 2 is a diagram of a dependency syntax structure, and the final text syntax features are:

{ the teacher, determine, modify, walk away, see, attach, me, follow-up, file, construct };

s4.3.3, expressing the syntactic characteristics extracted in the step S4.3.2 by using the distributed word vectors in the step S3.2 to obtain a word vector matrix of the syntactic characteristics of each text, namely

Where n is the number of extracted syntactic characteristic words,

a word vector representing the jth syntactic characteristic word, wherein j ranges from 1 to n;

s4.4, collecting verbs describing engineering changes in the construction engineering corpus text, and establishing a change trigger word dictionary;

s4.5, in the embodiment, based on a Chinese synonym toolkit Synonyms, searching for 5 words with the maximum similarity with each verb in the change trigger dictionary in the step S4.4, and expanding the change trigger dictionary;

s4.6, using the distributed word vectors in the step S3.2, representing each word in the change trigger word dictionary as a word vector and respectively serving as a key vector, using each word vector in a text syntactic characteristic word vector matrix as a query vector, calculating the association degree between the syntactic characteristic word and the trigger word, and enhancing the weight of the word with the semantic closer to representing the state change; the attention score is calculated by a dot product method, namely:

wherein the content of the first and second substances,

triggering a word vector of a kth word in the word dictionary for the change;

taking the maximum value of the attention scores of all the words in the syntactic characteristic word and the change trigger word dictionary as the attention based on the change trigger word dictionary, namely:

the attention weight of the corresponding syntactic characteristic word based on the change triggering word dictionary is as follows:

and further obtaining text local semantic representation:

s4.7, combining the text global semantic representation in the step S4.2 and the text local semantic representation in the step S4.6 in a splicing mode to obtain text fusion semantic representation:

s5, building a classification model by adopting a supervised machine learning algorithm based on the text fusion semantic vector obtained in the step S4; the method for predicting the documents to be classified and quickly acquiring the instructions related to engineering change comprises the following steps:

s5.1, fusing the text obtained in the step S4.7 into a semantic vector, wherein in the embodiment, the semantic vector is obtained by performing the following steps according to the sequence of 8:2, dividing the training set into a training set and a testing set;

s5.2, in the embodiment, three algorithms of a support vector machine, naive Bayes and K nearest neighbor are adopted, and the fusion semantic vector and the class label of each text in the training set are input into different machine learning classifiers for model training; random search and cross validation are adopted in training to obtain the optimal hyper-parameters of each model;

and S5.4, preprocessing the document to be classified according to the steps S1 to S4, inputting the classification model of the construction engineering change instruction obtained in the step 5.3 for prediction, and quickly obtaining the construction engineering change instruction.

The above-mentioned procedures are preferred embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention shall be covered by the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A classification method for a construction engineering change instruction is characterized by comprising the following steps:

s3, establishing a word vector training corpus, and training a distributed word vector by adopting a word vector model; the method specifically comprises the following steps:

s3.2, training the corpus by adopting a word vector model to obtain a distributed word vector;

s4, representing global word features and local syntactic features of the text based on a distributed method, and constructing text fusion semantic representation; the method comprises the following steps:

s4.1, using the distributed word vectors to represent the global word features of the text to obtain global semantic representation C of the text _g (ii) a The specific steps of step S4.1 are as follows:

s4.1.2, calculating the dimension average of the word sequence to obtain the global semantic representation of the text:

s4.2, using distributed word vectors to represent local syntactic characteristics of texts to obtain local semantic representation C of the texts _l (ii) a The specific steps of step S4.2 are as follows:

s4.2.2, by utilizing the dependency relationship structure of sentences, extracting the root nodes and the major-predicate relationship thereof, and the supported collocations on the moving-guest relationship dependency arcs, then respectively taking verbs which are in parallel relationship with the root nodes as father nodes, extracting the father nodes and the major-predicate relationship thereof, and the supported collocations on the moving-guest relationship dependency arcs, and finally arranging all the extracted words in the original sequence as the syntactic characteristics of the text;

Where n is the number of extracted syntactic feature words,

the method comprises the following specific steps:

s4.2.4.1, collecting verbs describing engineering change in the construction engineering corpus text, and establishing a change trigger word dictionary;

s4.2.4.2, based on the Chinese synonym library, searching for 5 words with the maximum similarity with each verb in the change trigger word dictionary in the step S4.2.4.1, and expanding the change trigger word dictionary;

s4.2.4.3, using the distributed word vectors in the step S3.2, representing each word in the change trigger word dictionary as a word vector and respectively serving as a key vector, using each word vector in a text syntax characteristic word vector matrix as a query vector, calculating the association degree between the syntax characteristic word and the trigger word, and enhancing the weight of the word with the semantic closer to representing the state change; the attention score is calculated by a dot product method, namely:

wherein the content of the first and second substances,

triggering word dictionaries for changesA word vector of the k-th word;

2. The method for classifying construction engineering change instructions according to claim 1, wherein the step S1 comprises the following steps:

s1.1, recognizing characters in the construction engineering document by using an optical character recognition technology and storing the characters in a text format;

3. The method for classifying building engineering change instructions according to claim 2, wherein the step S2 comprises the steps of:

s2.2, calculating the word sequence obtained in the step S2.1 by utilizing the positive direction maximum matching algorithm idea, setting the maximum length to be 4, and adding the combined words into a field candidate word set if the combined words appear in a plurality of texts in a building engineering corpus;

4. The method for classifying construction engineering change instructions according to claim 1, wherein the step S5 comprises the steps of:

s5.2, inputting the fusion semantic vector and the class label of each text in the training set into different machine learning classifiers for model training based on a supervised machine learning algorithm; obtaining the optimal hyper-parameter of each classification model by adopting random search and cross validation in training;

5. The method for classifying construction engineering change instructions according to any one of claims 1 to 4, wherein in step S5.2, the supervised machine learning algorithm comprises a support vector machine, naive Bayes and K nearest neighbors.