CN110348470B - Semantic retrieval method for industrial fault information rapid matching - Google Patents

Semantic retrieval method for industrial fault information rapid matching Download PDF

Info

Publication number
CN110348470B
CN110348470B CN201910428519.2A CN201910428519A CN110348470B CN 110348470 B CN110348470 B CN 110348470B CN 201910428519 A CN201910428519 A CN 201910428519A CN 110348470 B CN110348470 B CN 110348470B
Authority
CN
China
Prior art keywords
word
words
data
matrix
industrial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910428519.2A
Other languages
Chinese (zh)
Other versions
CN110348470A (en
Inventor
李肯立
闫安民
阳王东
刘楚波
陈岑
周旭
吴帆
唐卓
李克勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN201910428519.2A priority Critical patent/CN110348470B/en
Publication of CN110348470A publication Critical patent/CN110348470A/en
Application granted granted Critical
Publication of CN110348470B publication Critical patent/CN110348470B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a semantic retrieval method for quickly matching industrial fault information, which comprises the following steps of: step one, performing word segmentation indexing and word frequency statistics on an original document; training by using a word bag model, a local difference of words and a training algorithm; step three, classifying and filing the industrial parts; inputting documents, and calculating the document with the closest distance through a matrix distance algorithm; and step five, screening and sorting the selected result documents again by combining the industrial fault information, and returning solution documents according to the index. The invention aims at improving the matching accuracy and the matching speed by the method for realizing the industrial fault matching.

Description

Semantic retrieval method for industrial fault information rapid matching
[ technical field ] A method for producing a semiconductor device
The invention belongs to the technical field of text similarity matching in natural language processing, and relates to a semantic retrieval method for quickly matching industrial fault information.
[ background of the invention ]
With the advent of the data era, industrial enterprises accumulate a large amount of data, and are expected to suggest and solve the traditional problems facing the industry. The one point problem that wherein is prominent is that the industrial technology often needs the long-time accumulation of workman, and the master takes the accumulation experience of apprentice's system, and apprentice does not know how to go to solve the trouble when often coping with the trouble problem because the experience is not enough to because manpower reason, special circumstances etc. probably master does not have the time to go to the preferential solution this problem, this has just caused the loss of manpower resources and the financial loss that causes because the trouble.
Because the existing search engine technology is mostly based on word reverse index and word corresponding documents, intersection processing is performed by inputting the documents corresponding to the words, the method is simple and rough and is suitable for corresponding search of mass data in the internet, but in an enterprise, the amount of matched documents is only tens of thousands and hundreds of thousands of data, and the business is set aiming at a subject domain, so that the enterprise pays more attention to how to improve matching accuracy and matching speed.
Compared with the prior art, the method has the advantages that the word bag model and the local difference and training algorithm of the words are established by establishing the word segmentation and word frequency data bins, and then the matching is carried out through the matrix distance algorithm, so that the method is more suitable for the search field of industrial enterprises and the requirements of high-precision and quick search.
[ summary of the invention ]
A semantic retrieval method for industrial fault information fast matching comprises the following steps:
step one, performing word segmentation indexing and word frequency statistics on an original document;
step two, training by using a word bag model, a local difference of words and a training algorithm;
step three, classifying and filing the industrial parts;
inputting a document to be detected, and calculating the closest document through a matrix distance algorithm;
and step five, screening the top ten documents which are closest by combining the industrial fault information, and returning solution documents according to the index.
Further, the word segmentation index of the first step is to perform word segmentation processing on all original documents by taking a single word as a distinction, establish an index based on the word segmentation, establish the index for all the documents and store the indexes into a data bin, and simultaneously, count word frequency and store the word frequency into the data bin.
Further, the local difference and training algorithm using the bag-of-words model and the words specifically includes: training is carried out based on all the word segments and the word frequency of the data bin, a frequency matrix value of word occurrence is constructed, n documents are set to have k different words, an n-k dimensional matrix is constructed, and the content of i rows and j columns of the matrix is the number of times that the jth word occurs in the ith document plus a value which is obtained through a local difference and a training algorithm and represents the word position.
Furthermore, the bag-of-words model and local difference training algorithm for words further comprises calculation of position data, the position data is represented by collecting a plurality of positions of words as a number by calculating and extracting positions of the words in the text, calculating sum and difference of word position distribution, then performing product of the sum and the difference, and adding the product with word frequency, the local difference of the words is represented by compressing a sentence position vector into a number which can approach to uniquely represent the position of the sentence, and the calculation formula is as follows:
setting a position a, a position b, a position c, wherein a<c,b<c,a<b, then (a/c + b/c) (b/c-a/c) = (b) 2 -a 2 )/c 2 <1;
The value calculated by the formula represents the word position, when the word frequency does not exceed 2, the value is certainly less than 1, each text is set to be divided into ten positions, and the words are only distributed in the ten positions; when the word has n positions, iterating and calculating the (n-1) th position through the calculation formula, repeating the calculation, calculating the (n-2) th position, converting a plurality of positions of the word into 2 positions through iterating and calculating repeatedly, and finally realizing that the distribution of the positions of the word is represented by one numerical value; the calculation is calculated using absolute values.
Further, the industrial part classified filing comprises industrial parts, fault phenomena and corresponding solution documents, and the industrial part classified filing data are stored in the data warehouse.
Further, the matrix distance algorithm specifically includes: performing word segmentation and word frequency statistics and position data calculation on the input document to be detected to obtain a matrix X ak By fitting said matrix X ak And the data matrix X in the data bin nk Performing the matrix distance calculation, wherein the same words are calculated in the matrix during calculation, namely X a1 And X n1 Corresponding to the same word, and calculating to obtain a comparison result value d an Said value d an The smaller the representation, the closer the representation, the calculation formula is:
Figure GDA0003760974890000031
further, the industrial fault information comprises industrial part data, fault characteristic data and artificially defined feedback data.
Compared with the prior art, the method establishes the word bag model and the local difference and training algorithm of the words by establishing the word segmentation and word frequency data bins, and then performs matching through the matrix distance algorithm, thereby being more suitable for the search field of industrial enterprises and the requirements of high-precision and quick search.
[ description of the drawings ]
FIG. 1 is a flow chart of a semantic retrieval method for rapid matching of industrial fault information provided by the present invention;
fig. 2 is an example display of sentences of the same word frequency and different word distributions, which are compared through the method experiment.
[ detailed description ] embodiments
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A semantic retrieval method for industrial fault information fast matching comprises the following steps:
performing word segmentation indexing and word frequency statistics on original documents, performing word segmentation on all the original documents, and establishing an index;
specifically, please refer to table 1 for an example of constructing matrix data by using the participle index, for example: i like you, do you like me, carry out word segmentation statistics in table 1, construct matrix position [2,2,2,1], store data subjected to word segmentation and indexing into the data warehouse;
Figure GDA0003760974890000032
Figure GDA0003760974890000041
TABLE 1
And secondly, training by using a word bag model and a local difference and training algorithm of words, training based on all the participles and word frequencies of the data bin, constructing a frequency matrix value of the occurrence of the words, setting that n documents have k different words, constructing a matrix of n-k dimensions, wherein the content of i rows and j columns of the matrix is the number of the occurrence of the jth word in the ith document plus a value which is obtained by the local difference and the training algorithm and represents the position of the word.
Furthermore, the bag-of-words model and word local difference training algorithm further comprises calculation of position data, the position data is represented by collecting a plurality of positions of words as a number by calculating and extracting positions of the words in a text, calculating sum and difference of word position distribution, then performing product formation on the sum and difference, adding the sum and the difference with word frequency, compressing a sentence position vector into a number which can be approximated to the position of the sentence, calculating the sum and the difference of the word position distribution, and then performing product formation on the sum and the difference, wherein a calculation formula of the product formation on the two terms is as follows:
setting a position a, a position b, a position c, wherein a<c,b<c,a<b, then (a/c + b/c) (b/c-a/c) = (b) 2 -a 2 )/c 2 <1;
The value calculated by the formula represents the word position, when the word frequency does not exceed 2, the value is certainly less than 1, each text is set to be divided into ten positions, and the words are only distributed in the ten positions; when the words exceed 2, the word positions are sorted according to the sequence appearing in the document, two adjacent word positions are gradually summed and two adjacent word positions are differentiated to be converted into 2 local vectors, and then calculation is performed again, wherein the calculation is performed by adopting absolute values.
When the word has n positions, iteratively calculating the (n-1) th position through the calculation formula, repeating the calculation, calculating the (n-2) th position, converting the positions of the word into 2 positions through repeating the iteration of the calculation, and finally realizing that the distribution of the positions of the word is represented by one numerical value. The method comprises the following specific steps:
assuming three positions of the word [ a, b, c ], we calculate w1= (b-a) × (b + a) first, then calculate: w2= (c-b) × (b + c), and finally, | w1-w2| (w 1+ w 2) are calculated, that is, the distribution of the plurality of positions of the word is calculated in such a manner that it is continuously aggregated.
When the word frequency exceeds 2, iterative computation is carried out in a mode of clustering every two word blocks, namely every two adjacent words are one word block, if 4 positions exist [ a, b, c and d ], the words can be divided into three word blocks, then 3 word blocks are clustered into 2 word blocks in such a mode again, and then the result is computed.
Referring to fig. 2, an example of sentences with the same word frequency and different word distributions, which are compared by the method,
sample 1 is a sentence: i like you, do you like me
Example 2 sentences: how much you like me, I like you
Example 3 is the sentence: do I i you like
Through the given text, it can be seen that the word distribution in the sentence 1,2 is relatively the same, and the word frequencies of 3 sentences are completely the same.
1.0: "you", 2.0: "Dome", 3.0: "like", 4.0: 'I'
It is clear from fig. 2 which words in each sentence are more closely distributed, e.g. sample 1 and sample 3 are closer in the word distribution "you". While sample 2 is closer to sample 1 in the distribution of the word "like", consistent with our observations from the sentence, since in sample 1, it is distributed at 2,5, at 2,6 in sample 2, and at 5,6 in sample 3.
And step three, classifying and filing the industrial parts, wherein the classifying and filing of the industrial parts comprises industrial parts, fault phenomena and corresponding solution documents, and the classifying and filing data of the industrial parts are stored in the data warehouse.
Specifically, referring to table 2, an example of industrial parts data storage is shown,
Figure GDA0003760974890000051
TABLE 2
Inputting a document to be detected, and calculating the closest document through a matrix distance algorithm; the matrix distance algorithm is to execute the word segmentation and word frequency statistics and the calculation of the position data on the input document to be detected to obtain a matrix Xak, and the matrix X is used ak And the data matrix X in the data bin nk Performing the matrix distance calculation, wherein the same words are calculated in the matrix during calculation, namely X a1 And X n1 Corresponding to the same word, and obtaining a comparison result value d after calculation an Said value d an The smaller the representation, the closer the representation, the calculation formula is:
Figure GDA0003760974890000061
and step five, screening the top ten documents which are closest by combining the industrial fault information, and returning solution documents according to the index. The industrial fault information comprises industrial part data, fault characteristic data and artificially defined feedback data.
Compared with the prior art, the semantic retrieval method provided by the invention establishes the word bag model and the local difference and training algorithm of the words by establishing the word segmentation and word frequency data bin, and then performs matching by the matrix distance algorithm, thereby being more suitable for the retrieval field of industrial enterprises and the requirements of high-precision and quick retrieval.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (5)

1. A semantic retrieval method for industrial fault information fast matching is characterized by comprising the following steps:
step one, performing word segmentation indexing and word frequency statistics on an original document;
training by using a word bag model, a local difference of words and a training algorithm;
step three, classifying and filing the industrial parts;
inputting a document to be detected, and calculating the closest document through a matrix distance algorithm;
step five, screening out the top ten documents which are closest by combining the industrial fault information, and returning to the solution documents according to the index;
the local difference sum training algorithm using the bag of words model and the words specifically comprises the following steps: training based on all the word segments and the word frequencies of the data bin, constructing a frequency matrix value of the occurrence of words, setting n documents to have k different words, constructing an n-k dimensional matrix, wherein the content of i rows and j columns of the matrix is the number of the occurrence of the jth word in the ith document plus a value representing the word position calculated by a local difference and a training algorithm;
the bag-of-words model and the local difference of words and training algorithm further comprises calculation of position data, the position data is represented by collecting a plurality of positions of words as a number by calculating and extracting positions of the words in a text, calculating sum and difference of word position distribution, then performing product of the sum and the difference, and adding the product with word frequency, the local difference of the words is represented by compressing a sentence position vector into a number which can approach to uniquely represent the position of the words, and the calculation formula is as follows:
setting a position a, a position b, a position c, wherein a<c,b<c,a<b, then (a/c + b/c) (b/c-a/c) = (b) 2 -a 2 )/c 2 <1;
The value calculated by the formula represents the word position, when the word frequency does not exceed 2, the value is certainly less than 1, each text is set to be divided into ten positions, and the words are only distributed in the ten positions; when the word has n positions, iterating and calculating the (n-1) th position through the calculation formula, repeating the calculation, calculating the (n-2) th position, converting a plurality of positions of the word into 2 positions through iterating and calculating repeatedly, and finally realizing that the distribution of the positions of the word is represented by one numerical value; the calculation is calculated using absolute values.
2. The semantic retrieval method according to claim 1, wherein in step one, the participle index specifically is: all original documents are divided into words by taking a single word as a distinction, indexes are built on the basis of the words, all the documents are built and stored in a data bin, and meanwhile, the statistical word frequency is stored in the data bin.
3. The semantic retrieval method of claim 1, wherein the classification archive of industrial parts includes a type of industrial part, a fault phenomenon, and a corresponding solution document, and the classification archive data of industrial parts is stored in the data warehouse.
4. The semantic retrieval method according to claim 1, wherein the matrix distance algorithm is specifically: performing word segmentation and word frequency statistics and position data calculation on the input document to be detected to obtain a matrix X ak By fitting said matrix X ak And the data matrix X in the data bin nk Calculating the distance of the matrix, and using the same words in the matrix during calculationCalculation of, i.e. X a1 And X n1 Corresponding to the same word, and calculating to obtain a comparison result value d an Said value d an The smaller the representation, the closer the representation, the calculation formula is:
Figure FDA0003804564750000021
5. the semantic retrieval method of claim 1, wherein the industrial fault information comprises industrial part data, fault signature data, and human defined feedback data.
CN201910428519.2A 2019-05-21 2019-05-21 Semantic retrieval method for industrial fault information rapid matching Active CN110348470B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910428519.2A CN110348470B (en) 2019-05-21 2019-05-21 Semantic retrieval method for industrial fault information rapid matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910428519.2A CN110348470B (en) 2019-05-21 2019-05-21 Semantic retrieval method for industrial fault information rapid matching

Publications (2)

Publication Number Publication Date
CN110348470A CN110348470A (en) 2019-10-18
CN110348470B true CN110348470B (en) 2022-11-22

Family

ID=68173908

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910428519.2A Active CN110348470B (en) 2019-05-21 2019-05-21 Semantic retrieval method for industrial fault information rapid matching

Country Status (1)

Country Link
CN (1) CN110348470B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009104475A (en) * 2007-10-24 2009-05-14 Nippon Telegr & Teleph Corp <Ntt> Similar document retrieval device, and similar document retrieval method and program
CN102955848A (en) * 2012-10-29 2013-03-06 北京工商大学 Semantic-based three-dimensional model retrieval system and method
US8880540B1 (en) * 2012-03-28 2014-11-04 Emc Corporation Method and system for using location transformations to identify objects
US9069768B1 (en) * 2012-03-28 2015-06-30 Emc Corporation Method and system for creating subgroups of documents using optical character recognition data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009104475A (en) * 2007-10-24 2009-05-14 Nippon Telegr & Teleph Corp <Ntt> Similar document retrieval device, and similar document retrieval method and program
US8880540B1 (en) * 2012-03-28 2014-11-04 Emc Corporation Method and system for using location transformations to identify objects
US9069768B1 (en) * 2012-03-28 2015-06-30 Emc Corporation Method and system for creating subgroups of documents using optical character recognition data
CN102955848A (en) * 2012-10-29 2013-03-06 北京工商大学 Semantic-based three-dimensional model retrieval system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
高效可扩展的对称密文检索架构;吴志强等;《通信学报》;20170825(第08期);全文 *

Also Published As

Publication number Publication date
CN110348470A (en) 2019-10-18

Similar Documents

Publication Publication Date Title
CN111104794B (en) Text similarity matching method based on subject term
CN110209823B (en) Multi-label text classification method and system
CN109101477B (en) Enterprise field classification and enterprise keyword screening method
CN110851598B (en) Text classification method and device, terminal equipment and storage medium
CN102799647B (en) Method and device for webpage reduplication deletion
CN109165294B (en) Short text classification method based on Bayesian classification
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN107391772B (en) Text classification method based on naive Bayes
US20170330054A1 (en) Method And Apparatus Of Establishing Image Search Relevance Prediction Model, And Image Search Method And Apparatus
CN108073568A (en) keyword extracting method and device
CN111797210A (en) Information recommendation method, device and equipment based on user portrait and storage medium
CN104834651B (en) Method and device for providing high-frequency question answers
CN105183833A (en) User model based microblogging text recommendation method and recommendation apparatus thereof
CN106096066A (en) The Text Clustering Method embedded based on random neighbor
CN108027814B (en) Stop word recognition method and device
CN104239553A (en) Entity recognition method based on Map-Reduce framework
US8090720B2 (en) Method for merging document clusters
CN107329954B (en) Topic detection method based on document content and mutual relation
CN113515629A (en) Document classification method and device, computer equipment and storage medium
Farhoodi et al. Applying machine learning algorithms for automatic Persian text classification
CN106528768A (en) Consultation hotspot analysis method and device
CN112818121A (en) Text classification method and device, computer equipment and storage medium
CN107862051A (en) A kind of file classifying method, system and a kind of document classification equipment
CN112527948A (en) Data real-time duplicate removal method and system based on sentence-level index
Reddy et al. Prediction of star ratings from online reviews

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant