CN112949304A - Construction case knowledge reuse query method and device - Google Patents

Construction case knowledge reuse query method and device Download PDF

Info

Publication number
CN112949304A
CN112949304A CN202110313320.2A CN202110313320A CN112949304A CN 112949304 A CN112949304 A CN 112949304A CN 202110313320 A CN202110313320 A CN 202110313320A CN 112949304 A CN112949304 A CN 112949304A
Authority
CN
China
Prior art keywords
word
words
text
construction safety
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110313320.2A
Other languages
Chinese (zh)
Inventor
邓逸川
邓晖
苏成
王煜
宋建炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sino Singapore International Joint Research Institute
Original Assignee
Sino Singapore International Joint Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sino Singapore International Joint Research Institute filed Critical Sino Singapore International Joint Research Institute
Priority to CN202110313320.2A priority Critical patent/CN112949304A/en
Publication of CN112949304A publication Critical patent/CN112949304A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a construction case knowledge reuse query method and a device thereof, wherein the method comprises the following steps: collecting construction safety standard documents and construction safety accident reports, electronizing the documents, and establishing a case library in the construction safety field; performing text word segmentation and word removal processing on the construction safety standard document and the construction safety accident report based on a natural language processing technology, and calculating characteristic words by a word frequency inverse text algorithm; performing synonym expansion query of the feature words through a self-built common term lexicon and a continuous word bag model in the building industry; similarity calculation of similar construction safety cases is carried out based on a vector space model and a cosine function improvement method; integrating the whole database and the query system into a local server or intelligent equipment; the method can reuse the prior construction safety case knowledge to provide decision help for new risks, greatly improve the level of construction safety management, save the query time and improve the query efficiency.

Description

Construction case knowledge reuse query method and device
Technical Field
The invention relates to the technical field of construction case knowledge management, in particular to a construction case knowledge reuse query method and a construction case knowledge reuse query device based on a natural language processing technology.
Background
The reform is open, and the development of engineering projects is enhanced in China. The construction of the engineering project is a comprehensive production activity of multiple categories, the construction period of the engineering project is long, and a plurality of uncertain factors exist in the construction process.
In recent years, although the construction safety situation of China is better, various safety accidents happen, the construction safety problem is not ignored, and the construction safety management level needs to be further improved. However, because of the large amount of zero-fragmentation information and variable factors in the civil engineering industry, although construction safety accident reports are accumulated continuously, the information cannot be fully utilized in the traditional construction safety management, the reason is that a means for converting the information into reusable knowledge is lacked, and if a construction case knowledge reuse query system is established, decision-making help is provided for new risks through the reuse of the construction case knowledge in the past, so that the level of construction safety management can be greatly improved.
Currently, construction cases are mainly concentrated in accident safety reports and news reports of websites of a department of construction, and similar construction safety cases are searched in unstructured texts with low efficiency, so that a great gap still exists in the field of reuse of construction safety cases.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides a construction case knowledge reuse query method and a construction case knowledge reuse query device.
In order to achieve the purpose, the invention provides a construction case knowledge reuse query method, which comprises the following steps:
step S1, collecting construction safety standard documents and construction safety accident reports, electronizing the documents, and establishing a case library in the construction safety field;
step S2, performing text word segmentation and word removal processing on the construction safety standard document and the construction safety accident report based on a natural language processing technology, and calculating characteristic words through a word frequency inverse text algorithm;
step S3, performing synonym expansion query of the feature words through a self-built common term lexicon and a continuous bag-of-words model in the construction industry;
step S4, similarity calculation of similar construction safety cases is carried out based on a vector space model and a cosine function improvement method;
step S5, the whole database and query system is integrated into a local server or an intelligent device.
Preferably, the step S2 includes the following steps:
step S21, performing word segmentation on the construction safety accident case through jieba, wherein the jieba uses a prefix tree to classify words for improving the retrieval efficiency;
step S22, removing the null words existing in the construction safety accident case text by self-building a stop word library, wherein the stop words are extremely common words and have little value for helping to calculate the similarity of the text, and the size of the library can be greatly reduced and the retrieval efficiency can be improved by deleting the meaningless words;
and step S23, selecting a word frequency inverse text algorithm to extract the feature words through algorithm comparison and selection, calculating the weight of the feature words, and extracting the feature words in the construction safety accident case.
Preferably, the step S23 includes the following steps:
step S231, calculating a weight according to the importance of the words on the basis of the word frequency, wherein the weight is called 'inverse text frequency', and the size of the weight is inversely proportional to the common degree of the words;
step S232, different weights are given to different words, larger weight is given to less common words, smaller weight is given to more common words, minimum weight is given to most common words, and the word frequency and the inverse text frequency are multiplied to obtain TF-IDF values of the words;
and step S233, the higher the importance of the word to the text, the larger the TF-IDF value of the word is, and the feature value extraction of the text can be completed according to the descending order of the TF-IDF value.
Preferably, the word frequency, the inverse text frequency, and the word frequency-inverse text frequency are calculated as follows:
word frequency TF: the number of times a feature value appears in the text, i.e. if ti, k appears ni, k times in the text di
Figure BDA0002990822360000031
In practical applications, to avoid statistical deviations due to too long text, a normalization process, Σ, is generally requiredm nm,kI.e. the total number of words of the text:
Figure BDA0002990822360000032
inverse text frequency IDF: the frequency of the feature items appearing in the total text set D is that if the total text set has M texts and the feature items ti, k appear in mi, k texts
Figure BDA0002990822360000033
Wherein alpha is an empirical constant, and is generally 0.01; the more common the denominator of the word is, the smaller the inverse text frequency is; the reason for the denominator plus a is to avoid being 0, i.e. all text does not contain the word;
word frequency-inverse text frequency IF-IDF: the IF-IDF calculation method is that the word frequency is multiplied by the inverse text frequency
wi,k=TFi,k*IDFi,k
The word frequency-inverse text frequency is inversely proportional to the occurrence frequency of a word in the whole total text library and is directly proportional to the occurrence frequency of the word in a specific text, so that the word frequency-inverse text frequency of the word is calculated, and the characteristic values are extracted by descending order.
Preferably, the step S3 includes the following steps:
step S31, giving a training text, namely a construction safety accident case library and Chinese Wikipedia, using one-hot codes as input of a CBOW model, setting the dimension of a self-setting word vector as 100, setting a window as 5, setting the minimum occurrence frequency as 5, setting the number of threads used by the training word vector as 9, embedding words through the CBOW model, accumulating the input word vectors, and finally finishing vectorization representation of the words through a two-classifier;
and step S32, reading the feature words extracted in the step S2, obtaining word vectors of the feature words by using the trained word vectors, calculating the first 5 words most similar to the feature words by using cosine distance, and performing synonym expansion.
Preferably, the CBOW model is a three-layer neural network model;
the first layer of the CBOW model is an input layer, and word vectors with known contexts are input;
the middle layer of the CBOW model is called a linear hidden layer and accumulates all input word vectors;
the third layer of the CBOW model is a two-classifier softmax, and corresponding word near-meaning word expansion is obtained through training.
Preferably, the step S4 includes: after the feature words and the synonyms are obtained, a vector space model is utilized, a cosine function is improved, the similarity between the building construction safety cases is calculated, the cosine coefficient algorithm result is accurate and is the most common calculation method in VSM, the similarity between the input cases and the texts is calculated by using a similarity model in a third party tool genesis of Python, the texts are sequenced from large to small according to the similarity value, and finally the first 10 texts are used as output results;
Figure BDA0002990822360000051
the Sim (t _1, t _0) is the original query, the Sim (t _1, t _ k) is the expanded query, so the value between 0< lambda <1 is taken, and after multiple times of verification, lambda is set to be 0.7.
The invention also provides a construction case knowledge reuse query device, which is characterized by comprising the following steps:
the construction safety case acquisition and processing module is used for collecting construction safety standard documents and construction safety accident reports, electronizing the documents and establishing a case library in the construction safety field; performing text word segmentation and word removal processing on the construction safety standard document and the construction safety accident report based on a natural language processing technology, and calculating characteristic words by a word frequency inverse text algorithm;
the synonym expansion query module is connected with the construction safety case acquisition and processing module and is used for carrying out synonym expansion query on the characteristic words through a self-built common term word bank and a continuous word bag model in the building industry;
the similar case retrieval module is connected with the synonym expansion query module and is used for calculating the similarity of the similar construction safety cases based on a vector space model and a cosine function improvement method;
the construction safety case obtaining and processing module comprises a crawler algorithm and word segmentation and stop words; the synonym expansion query module comprises a text vectorization and continuous bag-of-words model; the similar case retrieval module comprises similarity calculation based on vector space model texts.
Preferably, the system further comprises a local server or an intelligent device, and the whole database and the query system are stored in the local server or the intelligent device.
Compared with the prior art, the invention has the beneficial effects that:
1. the query method and the query device can query the prior construction safety accident cases at any time, are realized based on the natural language processing technology, can reuse the prior construction safety case knowledge to provide decision help for new risks, can greatly improve the safety management level of construction site managers and constructors, effectively reduce the safety accident rate, and have better significance for improving the construction safety management level of the whole construction industry.
2. The invention can use the mobile phone or the tablet to inquire, the user can inquire the construction safety accident case by inputting the accident problem, the invention can directly output the similar construction safety accident case after inputting the daily report, effectively avoids the low efficiency and the complexity of the web search, and improves the multiplexing efficiency of the construction safety accident knowledge.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic diagram illustrating steps of a construction case knowledge reuse query method provided by the present invention;
fig. 2 is an exemplary analysis schematic diagram of a construction case knowledge reuse query method provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are one embodiment of the present invention, and not all embodiments of the present invention. All other embodiments obtained by a person skilled in the art based on the embodiments of the present invention without any creative work belong to the protection scope of the present invention.
Example one
Referring to fig. 1 and fig. 2, an embodiment of the present invention provides a construction case knowledge reuse query method, including the following steps:
step S1, collecting construction safety accident reports and construction safety failure cases, collecting risk cases by a web search method, collecting the risk cases from construction safety accident condition reports of administrative department websites, collecting the risk cases from documents, electronizing the data, and establishing a construction safety accident case database in the construction safety field.
And step S2, performing text word segmentation and word removal processing on the construction safety standard document and the construction safety accident report based on a natural language processing technology, and calculating characteristic words through a word frequency inverse text algorithm (TF-IDF).
Specifically, the step S2 includes the following steps:
and step S21, performing word segmentation on the construction safety accident case through jieba, wherein the jieba uses a prefix tree (also called a dictionary tree) to classify words for improving the searching efficiency.
Assuming that a computer searches for the word "building equipment", generally speaking, the computer will scan all text Chinese character strings, which is inefficient, but the prefix tree can be searched from top to bottom, each time a Chinese character is determined, if the next node of a certain node does not meet the search requirement, the search will be stopped, and the method can greatly improve the efficiency. In addition, the prefix tree can be combined with a directed acyclic graph, so that the problem of dual understanding words is solved efficiently.
And step S22, removing the null words existing in the construction safety accident case text by self-building a stop word library, wherein the stop words are extremely common words and have little value for helping to calculate the similarity of the text, and deleting the meaningless words can greatly reduce the size of the library and improve the retrieval efficiency.
Because the current NLP technology still has some limitations, some meaningless word symbols, such as symbol underlines, are generated after word segmentation, deleting the meaningless word symbol with the largest occurrence number can effectively reduce the data amount, and the operation of removing the stop word can be realized by importing the stop word list and then removing the words in the word list.
And step S23, selecting a word frequency inverse text algorithm (TF-IDF) to extract the characteristic words through algorithm comparison and selection, calculating the weight of the characteristic words, and extracting the characteristic words in the construction contract rules.
For example, in an accident report, the three words "unit", "fall", "collapse" occur as many times (word frequency), but their importance is different. "Fall" and "collapse" are more representative of the text than "units", that is, "fall" and "collapse" need to be ranked before "units" when ranking the keywords.
One way to solve this problem is to use TF-IDF (word frequency-inverse text frequency), i.e. a weight is calculated based on the word frequency according to the importance of the word, this weight is called "inverse text frequency", the size being inversely proportional to the degree of prevalence of the word. Less common words (e.g., "fall", "collapse") are given greater weight, more common words (e.g., "unit") are given lesser weight, and most common words (e.g., "yes") are given least weight. And multiplying the word frequency (TF) and the inverse text frequency (IDF) to obtain a TF-IDF value of the word. The higher the importance of a word to a text, the larger its TF-IDF value. Therefore, the feature value extraction of the text can be completed according to the large-to-small ordering of the TF-IDF values.
The calculation method of the word frequency, the inverse text frequency and the word frequency-inverse text frequency is as follows:
word frequency (TF): the number of times a feature value appears in the text, i.e. if ti, k appears ni, k times in the text di
TFi,k=ni,k
In practical applications, to avoid statistical deviations due to too long text, a normalization process, Σ, is generally requiredm nm,kI.e. the total number of words of the text:
Figure BDA0002990822360000081
inverse text frequency (IDF): the frequency of the feature items appearing in the total text set D is that if the total text set has M texts and the feature items ti, k appear in mi, k texts
Figure BDA0002990822360000082
Wherein alpha is an empirical constant, and is generally 0.01; the more common the denominator of the word is, the smaller the inverse text frequency is; the reason for the denominator plus a is to avoid being 0, i.e. all text does not contain the word;
word frequency-inverse text frequency (IF-IDF): the IF-IDF calculation method is that the word frequency is multiplied by the inverse text frequency
wi,k=TFi,k*IDFi,k
The word frequency-inverse text frequency is inversely proportional to the occurrence frequency of a word in the whole total text library and is directly proportional to the occurrence frequency of the word in a specific text, so that the word frequency-inverse text frequency of the word is calculated, and the characteristic values are extracted by descending order.
Step S3, performing synonym expansion query of the feature words through a self-built construction industry common term lexicon and a Continuous Bag of words Model (CBOW).
The CBOW model is a three-layer neural network model;
the first layer of the CBOW model is an input layer, and word vectors with known contexts are input;
the middle layer of the CBOW model is called a linear hidden layer and accumulates all input word vectors;
the third layer of the CBOW model is a two-classifier softmax, and corresponding word near-meaning word expansion is obtained through training. Such as "fall", "fall" and "drop" are words of similar import.
Specifically, the step S3 includes the following steps:
step S31, giving a training text, namely a construction safety accident case library and Chinese Wikipedia, using one-hot codes as input of a CBOW model, setting the dimension of a self-setting word vector as 100, setting a window as 5, setting the minimum occurrence frequency as 5, setting the number of threads used by the training word vector as 9, embedding words through the CBOW model, accumulating the input word vectors, and finally finishing vectorization representation of the words through a two-classifier.
And step S32, reading the feature words extracted in the step S2, obtaining word vectors of the feature words by using the trained word vectors, calculating the first 5 words most similar to the feature words by using cosine distance, and performing synonym expansion. Such as "fall", "fall" and "drop" are words of similar import.
And step S4, calculating the similarity of the similar construction safety cases based on the vector space model and the cosine function improvement method.
Specifically, after the feature words and the synonyms are obtained, the similarity between the construction safety cases is calculated by utilizing a vector space model and improving a cosine function, the cosine coefficient algorithm result is accurate and is the most common calculation method in VSM, and therefore a cosine coefficient method is used for calculating the similarity. Calculating similarity between an input case and a text by using a similarity model in a third-party tool genesis of Python, sequencing the input case and the text from large to small according to the similarity value, and finally taking the first 10 texts as output results;
Figure BDA0002990822360000101
the Sim (t _1, t _0) is the original query, the Sim (t _1, t _ k) is the expanded query, so the value between 0< lambda <1 is taken, and after multiple times of verification, lambda is set to be 0.7.
Step S5, the whole database and query system is integrated into a local server or an intelligent device.
For example, the query can be performed on a construction site by using a mobile phone or a tablet, and the database and the query system are local, so that the query can be performed regardless of whether a network exists, and the required safety knowledge can be queried in real time even for projects in remote mountainous areas.
More specifically, the construction safety accident case can be inquired by using a mobile phone or a tablet, and the user can input the accident problem and then directly output the similar construction safety accident case after inputting the daily report, so that the low efficiency and the complexity of webpage search are effectively avoided, and the multiplexing efficiency of construction safety accident knowledge is improved.
Example two
The second embodiment of the invention provides a construction case knowledge reuse query device, which comprises:
the construction safety case acquisition and processing module is used for collecting construction safety standard documents and construction safety accident reports, electronizing the documents and establishing a case library in the construction safety field; performing text word segmentation and word removal processing on the construction safety standard document and the construction safety accident report based on a natural language processing technology, and calculating characteristic words by a word frequency inverse text algorithm;
the synonym expansion query module is connected with the construction safety case acquisition and processing module and is used for carrying out synonym expansion query on the characteristic words through a self-built common term word bank and a continuous word bag model in the building industry;
the similar case retrieval module is connected with the synonym expansion query module and is used for calculating the similarity of the similar construction safety cases based on a vector space model and a cosine function improvement method;
the construction safety case obtaining and processing module comprises a crawler algorithm and word segmentation and stop words; the synonym expansion query module comprises a text vectorization and continuous bag-of-words model; the similar case retrieval module comprises similarity calculation based on vector space model texts.
The system also comprises a local server or intelligent equipment, wherein the whole database and the query system are stored in the local server or the intelligent equipment.
For example, the query can be performed on a construction site by using a mobile phone or a tablet, and the database and the query system are local, so that the query can be performed regardless of whether a network exists, and the required safety knowledge can be queried in real time even for projects in remote mountainous areas.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (9)

1. A construction case knowledge reuse query method is characterized by comprising the following steps:
step S1, collecting construction safety standard documents and construction safety accident reports, electronizing the documents, and establishing a case library in the construction safety field;
step S2, performing text word segmentation and word removal processing on the construction safety standard document and the construction safety accident report based on a natural language processing technology, and calculating characteristic words through a word frequency inverse text algorithm;
step S3, performing synonym expansion query of the feature words through a self-built common term lexicon and a continuous bag-of-words model in the construction industry;
step S4, similarity calculation of similar construction safety cases is carried out based on a vector space model and a cosine function improvement method;
step S5, the whole database and query system is integrated into a local server or an intelligent device.
2. The construction case knowledge reuse query method according to claim 1, wherein the step S2 includes the following steps:
step S21, performing word segmentation on the construction safety accident case through jieba, wherein the jieba uses a prefix tree to classify words for improving the retrieval efficiency;
step S22, removing the null words existing in the construction safety accident case text by self-building a stop word library, wherein the stop words are extremely common words and have little value for helping to calculate the similarity of the text, and the size of the library can be greatly reduced and the retrieval efficiency can be improved by deleting the meaningless words;
and step S23, selecting a word frequency inverse text algorithm to extract the feature words through algorithm comparison and selection, calculating the weight of the feature words, and extracting the feature words in the construction safety accident case.
3. The construction case knowledge reuse query method according to claim 2, wherein the step S23 includes the following steps:
step S231, calculating a weight according to the importance of the words on the basis of the word frequency, wherein the weight is called 'inverse text frequency', and the size of the weight is inversely proportional to the common degree of the words;
step S232, different weights are given to different words, larger weight is given to less common words, smaller weight is given to more common words, minimum weight is given to most common words, and the word frequency and the inverse text frequency are multiplied to obtain TF-IDF values of the words;
and step S233, the higher the importance of the word to the text, the larger the TF-IDF value of the word is, and the feature value extraction of the text can be completed according to the descending order of the TF-IDF value.
4. The construction case knowledge reuse query method according to claim 3, wherein the word frequency, the inverse text frequency, and the word frequency-inverse text frequency are calculated as follows:
word frequency TF: the number of times a feature value appears in the text, i.e. if ti, k appears ni, k times in the text di
TFi,k=ni,k
In practical applications, to avoid statistical deviations due to too long text, a normalization process, Σ, is generally requiredmnm,kI.e. the total number of words of the text:
Figure FDA0002990822350000021
inverse text frequency IDF: the frequency of the feature items appearing in the total text set D is that if the total text set has M texts and the feature items ti, k appear in mi, k texts
Figure FDA0002990822350000022
Wherein alpha is an empirical constant, and is generally 0.01; the more common the denominator of the word is, the smaller the inverse text frequency is; the reason for the denominator plus a is to avoid being 0, i.e. all text does not contain the word;
word frequency-inverse text frequency IF-IDF: the IF-IDF calculation method is that the word frequency is multiplied by the inverse text frequency
wi,k=TFi,k*IDFi,k
The word frequency-inverse text frequency is inversely proportional to the occurrence number of a word in the whole total text library and is directly proportional to the occurrence number of the word in a specific text, so that the word frequency-inverse text frequency of the word is calculated, and the characteristic value is extracted by descending order.
5. The construction case knowledge reuse query method according to claim 1, wherein the step S3 includes the following steps:
step S31, giving a training text, namely a construction safety accident case library and Chinese Wikipedia, using one-hot codes as input of a CBOW model, setting the dimension of a self-setting word vector as 100, setting a window as 5, setting the minimum occurrence frequency as 5, setting the number of threads used by the training word vector as 9, embedding words through the CBOW model, accumulating the input word vectors, and finally finishing vectorization representation of the words through a two-classifier;
and step S32, reading the feature words extracted in the step S2, obtaining word vectors of the feature words by using the trained word vectors, calculating the first 5 words most similar to the feature words by using cosine distance, and performing synonym expansion.
6. The construction case knowledge reuse query method according to claim 5, wherein the CBOW model is a three-layer neural network model;
the first layer of the CBOW model is an input layer, and word vectors with known contexts are input;
the middle layer of the CBOW model is called a linear hidden layer and accumulates all input word vectors;
the third layer of the CBOW model is a two-classifier softmax, and corresponding word near-meaning word expansion is obtained through training.
7. The construction case knowledge reuse query method according to claim 1, wherein the step S4 includes: after the feature words and the synonyms are obtained, a vector space model is utilized, a cosine function is improved, the similarity between the building construction safety cases is calculated, the cosine coefficient algorithm result is accurate and is the most common calculation method in VSM, the similarity between the input cases and the texts is calculated by using a similarity model in a third party tool genesis of Python, the texts are sequenced from large to small according to the similarity value, and finally the first 10 texts are used as output results;
Figure FDA0002990822350000041
where Sim (t _1, t _0) is the original query and Sim (t _1, t _ k) is the extended query, so that values between 0< λ <1 are taken, and after multiple verifications, λ is set to 0.7.
8. A construction case knowledge reuse inquiry device is characterized by comprising:
the construction safety case acquisition and processing module is used for collecting construction safety standard documents and construction safety accident reports, electronizing the documents and establishing a case library in the construction safety field; performing text word segmentation and word removal processing on the construction safety standard document and the construction safety accident report based on a natural language processing technology, and calculating characteristic words by a word frequency inverse text algorithm;
the synonym expansion query module is connected with the construction safety case acquisition and processing module and is used for carrying out synonym expansion query on the characteristic words through a self-built common term word bank and a continuous word bag model in the building industry;
the similar case retrieval module is connected with the synonym expansion query module and is used for calculating the similarity of the similar construction safety cases based on a vector space model and a cosine function improvement method;
the construction safety case obtaining and processing module comprises a crawler algorithm and word segmentation and stop words; the synonym expansion query module comprises a text vectorization and continuous bag-of-words model; the similar case retrieval module comprises similarity calculation based on vector space model texts.
9. The construction case knowledge reuse query device according to claim 8, further comprising a local server or an intelligent device, wherein the entire database and the query system are stored in the local server or the intelligent device.
CN202110313320.2A 2021-03-24 2021-03-24 Construction case knowledge reuse query method and device Pending CN112949304A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110313320.2A CN112949304A (en) 2021-03-24 2021-03-24 Construction case knowledge reuse query method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110313320.2A CN112949304A (en) 2021-03-24 2021-03-24 Construction case knowledge reuse query method and device

Publications (1)

Publication Number Publication Date
CN112949304A true CN112949304A (en) 2021-06-11

Family

ID=76228430

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110313320.2A Pending CN112949304A (en) 2021-03-24 2021-03-24 Construction case knowledge reuse query method and device

Country Status (1)

Country Link
CN (1) CN112949304A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113610112A (en) * 2021-07-09 2021-11-05 中国商用飞机有限责任公司上海飞机设计研究院 Auxiliary decision-making method for airplane assembly quality defects

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095737A (en) * 2016-06-07 2016-11-09 杭州凡闻科技有限公司 Documents Similarity computational methods and similar document the whole network retrieval tracking
US20170103441A1 (en) * 2015-10-07 2017-04-13 Gastown Data Sciences Comparing Business Documents to Recommend Organizations
CN108491462A (en) * 2018-03-05 2018-09-04 昆明理工大学 A kind of semantic query expansion method and device based on word2vec
CN108536677A (en) * 2018-04-09 2018-09-14 北京信息科技大学 A kind of patent text similarity calculating method
CN109255021A (en) * 2018-11-01 2019-01-22 北京京航计算通讯研究所 Data query method based on quality text similarity
CN110597949A (en) * 2019-08-01 2019-12-20 湖北工业大学 Court similar case recommendation model based on word vectors and word frequency
US20200050638A1 (en) * 2018-08-12 2020-02-13 Parker Douglas Hancock Systems and methods for analyzing the validity or infringment of patent claims

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170103441A1 (en) * 2015-10-07 2017-04-13 Gastown Data Sciences Comparing Business Documents to Recommend Organizations
CN106095737A (en) * 2016-06-07 2016-11-09 杭州凡闻科技有限公司 Documents Similarity computational methods and similar document the whole network retrieval tracking
CN108491462A (en) * 2018-03-05 2018-09-04 昆明理工大学 A kind of semantic query expansion method and device based on word2vec
CN108536677A (en) * 2018-04-09 2018-09-14 北京信息科技大学 A kind of patent text similarity calculating method
US20200050638A1 (en) * 2018-08-12 2020-02-13 Parker Douglas Hancock Systems and methods for analyzing the validity or infringment of patent claims
CN109255021A (en) * 2018-11-01 2019-01-22 北京京航计算通讯研究所 Data query method based on quality text similarity
CN110597949A (en) * 2019-08-01 2019-12-20 湖北工业大学 Court similar case recommendation model based on word vectors and word frequency

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113610112A (en) * 2021-07-09 2021-11-05 中国商用飞机有限责任公司上海飞机设计研究院 Auxiliary decision-making method for airplane assembly quality defects
CN113610112B (en) * 2021-07-09 2024-04-16 中国商用飞机有限责任公司上海飞机设计研究院 Auxiliary decision-making method for aircraft assembly quality defects

Similar Documents

Publication Publication Date Title
CN110222160B (en) Intelligent semantic document recommendation method and device and computer readable storage medium
CN112699246B (en) Domain knowledge pushing method based on knowledge graph
CN109271477A (en) A kind of method and system by internet building taxonomy library
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN106708929B (en) Video program searching method and device
CN104392006B (en) A kind of event query processing method and processing device
CN102053992A (en) Clustering method and system
CN103049569A (en) Text similarity matching method on basis of vector space model
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning
CN103324700A (en) Noumenon concept attribute learning method based on Web information
CN112507109A (en) Retrieval method and device based on semantic analysis and keyword recognition
CN112632224B (en) Case recommendation method and device based on case knowledge graph and electronic equipment
CN108549697A (en) Information-pushing method, device, equipment based on semantic association and storage medium
CN112559747A (en) Event classification processing method and device, electronic equipment and storage medium
CN114780746A (en) Knowledge graph-based document retrieval method and related equipment thereof
CN110781679A (en) News event keyword mining method based on associated semantic chain network
CN112926340A (en) Semantic matching model for knowledge point positioning
CN109359299A (en) A kind of internet of things equipment ability ontology based on commodity data is from construction method
CN106570196B (en) Video program searching method and device
CN113761192B (en) Text processing method, text processing device and text processing equipment
CN112949304A (en) Construction case knowledge reuse query method and device
Senthilkumar et al. A Survey On Feature Selection Method For Product Review
CN116662525A (en) Financial judicial knowledge association method and system based on heterogeneous graph neural network
CN113987536A (en) Method and device for determining security level of field in data table, electronic equipment and medium
CN113536772A (en) Text processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination