CN110348470B

CN110348470B - Semantic retrieval method for industrial fault information rapid matching

Info

Publication number: CN110348470B
Application number: CN201910428519.2A
Authority: CN
Inventors: 李肯立; 闫安民; 阳王东; 刘楚波; 陈岑; 周旭; 吴帆; 唐卓; 李克勤
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2019-05-21
Filing date: 2019-05-21
Publication date: 2022-11-22
Anticipated expiration: 2039-05-21
Also published as: CN110348470A

Abstract

The invention discloses a semantic retrieval method for quickly matching industrial fault information, which comprises the following steps of: step one, performing word segmentation indexing and word frequency statistics on an original document; training by using a word bag model, a local difference of words and a training algorithm; step three, classifying and filing the industrial parts; inputting documents, and calculating the document with the closest distance through a matrix distance algorithm; and step five, screening and sorting the selected result documents again by combining the industrial fault information, and returning solution documents according to the index. The invention aims at improving the matching accuracy and the matching speed by the method for realizing the industrial fault matching.

Description

Semantic retrieval method for industrial fault information rapid matching

[ technical field ] A method for producing a semiconductor device

The invention belongs to the technical field of text similarity matching in natural language processing, and relates to a semantic retrieval method for quickly matching industrial fault information.

[ background of the invention ]

With the advent of the data era, industrial enterprises accumulate a large amount of data, and are expected to suggest and solve the traditional problems facing the industry. The one point problem that wherein is prominent is that the industrial technology often needs the long-time accumulation of workman, and the master takes the accumulation experience of apprentice's system, and apprentice does not know how to go to solve the trouble when often coping with the trouble problem because the experience is not enough to because manpower reason, special circumstances etc. probably master does not have the time to go to the preferential solution this problem, this has just caused the loss of manpower resources and the financial loss that causes because the trouble.

Because the existing search engine technology is mostly based on word reverse index and word corresponding documents, intersection processing is performed by inputting the documents corresponding to the words, the method is simple and rough and is suitable for corresponding search of mass data in the internet, but in an enterprise, the amount of matched documents is only tens of thousands and hundreds of thousands of data, and the business is set aiming at a subject domain, so that the enterprise pays more attention to how to improve matching accuracy and matching speed.

Compared with the prior art, the method has the advantages that the word bag model and the local difference and training algorithm of the words are established by establishing the word segmentation and word frequency data bins, and then the matching is carried out through the matrix distance algorithm, so that the method is more suitable for the search field of industrial enterprises and the requirements of high-precision and quick search.

[ summary of the invention ]

A semantic retrieval method for industrial fault information fast matching comprises the following steps:

step one, performing word segmentation indexing and word frequency statistics on an original document;

step two, training by using a word bag model, a local difference of words and a training algorithm;

step three, classifying and filing the industrial parts;

inputting a document to be detected, and calculating the closest document through a matrix distance algorithm;

and step five, screening the top ten documents which are closest by combining the industrial fault information, and returning solution documents according to the index.

Further, the word segmentation index of the first step is to perform word segmentation processing on all original documents by taking a single word as a distinction, establish an index based on the word segmentation, establish the index for all the documents and store the indexes into a data bin, and simultaneously, count word frequency and store the word frequency into the data bin.

Further, the local difference and training algorithm using the bag-of-words model and the words specifically includes: training is carried out based on all the word segments and the word frequency of the data bin, a frequency matrix value of word occurrence is constructed, n documents are set to have k different words, an n-k dimensional matrix is constructed, and the content of i rows and j columns of the matrix is the number of times that the jth word occurs in the ith document plus a value which is obtained through a local difference and a training algorithm and represents the word position.

Furthermore, the bag-of-words model and local difference training algorithm for words further comprises calculation of position data, the position data is represented by collecting a plurality of positions of words as a number by calculating and extracting positions of the words in the text, calculating sum and difference of word position distribution, then performing product of the sum and the difference, and adding the product with word frequency, the local difference of the words is represented by compressing a sentence position vector into a number which can approach to uniquely represent the position of the sentence, and the calculation formula is as follows:

setting a position a, a position b, a position c, wherein a<c,b<c,a<b, then (a/c + b/c) (b/c-a/c) = (b) ² -a ² )/c ² <1；

The value calculated by the formula represents the word position, when the word frequency does not exceed 2, the value is certainly less than 1, each text is set to be divided into ten positions, and the words are only distributed in the ten positions; when the word has n positions, iterating and calculating the (n-1) th position through the calculation formula, repeating the calculation, calculating the (n-2) th position, converting a plurality of positions of the word into 2 positions through iterating and calculating repeatedly, and finally realizing that the distribution of the positions of the word is represented by one numerical value; the calculation is calculated using absolute values.

Further, the industrial part classified filing comprises industrial parts, fault phenomena and corresponding solution documents, and the industrial part classified filing data are stored in the data warehouse.

Further, the matrix distance algorithm specifically includes: performing word segmentation and word frequency statistics and position data calculation on the input document to be detected to obtain a matrix X _ak By fitting said matrix X _ak And the data matrix X in the data bin _nk Performing the matrix distance calculation, wherein the same words are calculated in the matrix during calculation, namely X _a1 And X _n1 Corresponding to the same word, and calculating to obtain a comparison result value d _an Said value d _an The smaller the representation, the closer the representation, the calculation formula is:

further, the industrial fault information comprises industrial part data, fault characteristic data and artificially defined feedback data.

Compared with the prior art, the method establishes the word bag model and the local difference and training algorithm of the words by establishing the word segmentation and word frequency data bins, and then performs matching through the matrix distance algorithm, thereby being more suitable for the search field of industrial enterprises and the requirements of high-precision and quick search.

[ description of the drawings ]

FIG. 1 is a flow chart of a semantic retrieval method for rapid matching of industrial fault information provided by the present invention;

fig. 2 is an example display of sentences of the same word frequency and different word distributions, which are compared through the method experiment.

[ detailed description ] embodiments

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

performing word segmentation indexing and word frequency statistics on original documents, performing word segmentation on all the original documents, and establishing an index;

specifically, please refer to table 1 for an example of constructing matrix data by using the participle index, for example: i like you, do you like me, carry out word segmentation statistics in table 1, construct matrix position [2,2,2,1], store data subjected to word segmentation and indexing into the data warehouse;

TABLE 1

And secondly, training by using a word bag model and a local difference and training algorithm of words, training based on all the participles and word frequencies of the data bin, constructing a frequency matrix value of the occurrence of the words, setting that n documents have k different words, constructing a matrix of n-k dimensions, wherein the content of i rows and j columns of the matrix is the number of the occurrence of the jth word in the ith document plus a value which is obtained by the local difference and the training algorithm and represents the position of the word.

Furthermore, the bag-of-words model and word local difference training algorithm further comprises calculation of position data, the position data is represented by collecting a plurality of positions of words as a number by calculating and extracting positions of the words in a text, calculating sum and difference of word position distribution, then performing product formation on the sum and difference, adding the sum and the difference with word frequency, compressing a sentence position vector into a number which can be approximated to the position of the sentence, calculating the sum and the difference of the word position distribution, and then performing product formation on the sum and the difference, wherein a calculation formula of the product formation on the two terms is as follows:

The value calculated by the formula represents the word position, when the word frequency does not exceed 2, the value is certainly less than 1, each text is set to be divided into ten positions, and the words are only distributed in the ten positions; when the words exceed 2, the word positions are sorted according to the sequence appearing in the document, two adjacent word positions are gradually summed and two adjacent word positions are differentiated to be converted into 2 local vectors, and then calculation is performed again, wherein the calculation is performed by adopting absolute values.

When the word has n positions, iteratively calculating the (n-1) th position through the calculation formula, repeating the calculation, calculating the (n-2) th position, converting the positions of the word into 2 positions through repeating the iteration of the calculation, and finally realizing that the distribution of the positions of the word is represented by one numerical value. The method comprises the following specific steps:

assuming three positions of the word [ a, b, c ], we calculate w1= (b-a) × (b + a) first, then calculate: w2= (c-b) × (b + c), and finally, | w1-w2| (w 1+ w 2) are calculated, that is, the distribution of the plurality of positions of the word is calculated in such a manner that it is continuously aggregated.

When the word frequency exceeds 2, iterative computation is carried out in a mode of clustering every two word blocks, namely every two adjacent words are one word block, if 4 positions exist [ a, b, c and d ], the words can be divided into three word blocks, then 3 word blocks are clustered into 2 word blocks in such a mode again, and then the result is computed.

Referring to fig. 2, an example of sentences with the same word frequency and different word distributions, which are compared by the method,

sample 1 is a sentence: i like you, do you like me

Example 2 sentences: how much you like me, I like you

Example 3 is the sentence: do I i you like

Through the given text, it can be seen that the word distribution in the sentence 1,2 is relatively the same, and the word frequencies of 3 sentences are completely the same.

1.0: "you", 2.0: "Dome", 3.0: "like", 4.0: 'I'

It is clear from fig. 2 which words in each sentence are more closely distributed, e.g. sample 1 and sample 3 are closer in the word distribution "you". While sample 2 is closer to sample 1 in the distribution of the word "like", consistent with our observations from the sentence, since in sample 1, it is distributed at 2,5, at 2,6 in sample 2, and at 5,6 in sample 3.

And step three, classifying and filing the industrial parts, wherein the classifying and filing of the industrial parts comprises industrial parts, fault phenomena and corresponding solution documents, and the classifying and filing data of the industrial parts are stored in the data warehouse.

Specifically, referring to table 2, an example of industrial parts data storage is shown,

TABLE 2

Inputting a document to be detected, and calculating the closest document through a matrix distance algorithm; the matrix distance algorithm is to execute the word segmentation and word frequency statistics and the calculation of the position data on the input document to be detected to obtain a matrix Xak, and the matrix X is used _ak And the data matrix X in the data bin _nk Performing the matrix distance calculation, wherein the same words are calculated in the matrix during calculation, namely X _a1 And X _n1 Corresponding to the same word, and obtaining a comparison result value d after calculation _an Said value d _an The smaller the representation, the closer the representation, the calculation formula is:

and step five, screening the top ten documents which are closest by combining the industrial fault information, and returning solution documents according to the index. The industrial fault information comprises industrial part data, fault characteristic data and artificially defined feedback data.

Compared with the prior art, the semantic retrieval method provided by the invention establishes the word bag model and the local difference and training algorithm of the words by establishing the word segmentation and word frequency data bin, and then performs matching by the matrix distance algorithm, thereby being more suitable for the retrieval field of industrial enterprises and the requirements of high-precision and quick retrieval.

While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims

1. A semantic retrieval method for industrial fault information fast matching is characterized by comprising the following steps:

training by using a word bag model, a local difference of words and a training algorithm;

step three, classifying and filing the industrial parts;

step five, screening out the top ten documents which are closest by combining the industrial fault information, and returning to the solution documents according to the index;

the local difference sum training algorithm using the bag of words model and the words specifically comprises the following steps: training based on all the word segments and the word frequencies of the data bin, constructing a frequency matrix value of the occurrence of words, setting n documents to have k different words, constructing an n-k dimensional matrix, wherein the content of i rows and j columns of the matrix is the number of the occurrence of the jth word in the ith document plus a value representing the word position calculated by a local difference and a training algorithm;

the bag-of-words model and the local difference of words and training algorithm further comprises calculation of position data, the position data is represented by collecting a plurality of positions of words as a number by calculating and extracting positions of the words in a text, calculating sum and difference of word position distribution, then performing product of the sum and the difference, and adding the product with word frequency, the local difference of the words is represented by compressing a sentence position vector into a number which can approach to uniquely represent the position of the words, and the calculation formula is as follows:

setting a position a, a position b, a position c, wherein a<c，b<c，a<b, then (a/c + b/c) (b/c-a/c) = (b) ² -a ² )/c ² <1；

2. The semantic retrieval method according to claim 1, wherein in step one, the participle index specifically is: all original documents are divided into words by taking a single word as a distinction, indexes are built on the basis of the words, all the documents are built and stored in a data bin, and meanwhile, the statistical word frequency is stored in the data bin.

3. The semantic retrieval method of claim 1, wherein the classification archive of industrial parts includes a type of industrial part, a fault phenomenon, and a corresponding solution document, and the classification archive data of industrial parts is stored in the data warehouse.

4. The semantic retrieval method according to claim 1, wherein the matrix distance algorithm is specifically: performing word segmentation and word frequency statistics and position data calculation on the input document to be detected to obtain a matrix X _ak By fitting said matrix X _ak And the data matrix X in the data bin _nk Calculating the distance of the matrix, and using the same words in the matrix during calculationCalculation of, i.e. X _a1 And X _n1 Corresponding to the same word, and calculating to obtain a comparison result value d _an Said value d _an The smaller the representation, the closer the representation, the calculation formula is:

5. the semantic retrieval method of claim 1, wherein the industrial fault information comprises industrial part data, fault signature data, and human defined feedback data.