CN110348470A

CN110348470A - Semantic retrieving method for industrial fault message Rapid matching

Info

Publication number: CN110348470A
Application number: CN201910428519.2A
Authority: CN
Inventors: 李肯立; 闫安民; 阳王东; 刘楚波; 陈岑; 周旭; 吴帆; 唐卓; 李克勤
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2019-05-21
Filing date: 2019-05-21
Publication date: 2019-10-18
Anticipated expiration: 2039-05-21
Also published as: CN110348470B

Abstract

The invention discloses a kind of semantic retrieving methods for industrial fault message Rapid matching comprising following steps: Step 1: carrying out participle index, statistics word frequency to original document；Step 2: using bag of words and word part is poor and training algorithm is trained；Step 3: to industrial part Put on file；Step 4: input document, is calculated by matrix distance algorithm apart from immediate document；Step 5: combining the industrial fault message result document that screening and sequencing is selected again, solution documentation is returned to according to index.The present invention improves matching precision and matching speed for the method for realizing industrial Trouble Match.

Description

Semantic retrieving method for industrial fault message Rapid matching

[technical field]

The invention belongs to natural language processing text similarity matching technique fields, and it is quick to be related to a kind of industrial fault message Matched semantic retrieving method.

[background technique]

With the arriving of data age, each industrial enterprise has accumulated a large amount of data, is expected to prompt and solve industry face The conventional difficulties faced.Some problems of its saliency are that industrial technology generally requires worker and adds up for a long time, and master worker is with apprentice System accumulate experience, apprentice due to do not known how when experience deficiency often copes with failure problems solve failure, and by In manpower reason, special circumstances etc., possible master worker does not have the time to go preferentially to solve the problems, such as this, this has resulted in the damage of human resources It becomes estranged the financial losses as caused by failure.

Since current search engine technique is mostly based on word Converse Index and word corresponds to document, pass through input word pair The document answered takes intersection to handle, and such mode is simple and crude, is suitable for the corresponding of mass data in internet and searches for, but In enterprise, matching document size is often only tens of thousands of, ten tens of thousands of data volumes, and often business is set for subject area, looks forward to What industry was more concerned about is how to improve matching precision and matching speed.

In contrast, by establishing participle and word frequency data bins, establish bag of words and word part is poor and training algorithm, Then it is matched by matrix distance algorithm, more suitable for wanting for industrial enterprise's searching field and high accurancy and precision quick-searching It asks.

[summary of the invention]

A kind of semantic retrieving method for industrial fault message Rapid matching, comprising the following steps:

Step 1: carrying out participle index, statistics word frequency to original document；

Step 2: using bag of words and word part is poor and training algorithm is trained；

Step 3: to industrial part Put on file；

Step 4: inputting document to be detected, immediate document is calculated by matrix distance algorithm；

Step 5: industrial fault message is combined to filter out immediate preceding ten documents, solution is returned to according to index Document.

Further, the step 1 participle index, be by all original documents with word for distinguish into Row word segmentation processing, and index is established based on the participle, all documents are established and indexes and is stored in data bins, are counted simultaneously Word frequency is stored in data bins.

Further, the part using bag of words and word is poor and training algorithm specifically: storehouse based on the data All participles and the word frequency be trained, the frequency matrix numerical value that building word occurs, setting n documents, to share k a Different words then construct the matrix of n*k dimension, and matrix i row j column content is then that the number that j-th of word occurs in i-th document adds Pass through the numerical value of the poor expression word position found out with training algorithm in part.

Further, the poor training algorithm in the part of the bag of words and word further includes the calculating of position data, institute's rheme Setting data is to extract the position of word in the text by calculating, and calculating word position distributing position and poor, then two are done Product, and be added with word frequency, realizing multiple location sets of word are a number indicates, the part of word is poor and is by sentence Sub- position vector boil down to one can approach a number for uniquely representing its position, its calculation formula is:

Setting position a, position b, position c, wherein a < c, b < c, a <b, then (a/c+b/c) * (b/c-a/c)=(b²-a²)/ c²<1；

It represents lexeme by the calculated value of the formula to set, when its word frequency is no more than 2, this value is necessarily smaller than 1, and setting is every A text is divided into ten positions, and word is only distributed in this ten positions；It is public by the calculating when word has n position Formula iterates to calculate out (n-1)th position, repeats the calculating, calculates the n-th -2 positions, will by calculating described in iteration Multiple positions of word are converted to 2 positions, final to realize the distribution that multiple positions of word are indicated with a numerical value；It is described Calculating is calculated using absolute value.

Further, described includes industrial part, phenomenon of the failure and corresponding solution to industrial part Put on file The industrial part Put on file data are equally stored in the data bins by document.

Further, the matrix distance algorithm specifically: document to be detected will be inputted and execute the participle and word frequency system The calculating of meter and the position data, obtains matrix X_ak, by by the matrix X_akWith the data square in the data bins Battle array X_nkThe matrix distance calculating is carried out, is calculated when calculating in the matrix with identical word, i.e. X_a1With X_n1Correspondence is same A word obtains comparing result value d after calculating_an, described value d_anIt is smaller to indicate closer, its calculation formula is:

Further, the industrial fault message includes industrial part data, fault signature data and artificially defined feedback Data.

Compared with prior art, the present invention establishes the part of bag of words and word by establishing participle and word frequency data bins Difference and training algorithm, are then matched by matrix distance algorithm, more suitable for industrial enterprise's searching field and high accurancy and precision The requirement of quick-searching.

[Detailed description of the invention]

Fig. 1 is the flow chart provided by the present invention for the semantic retrieving method of industrial fault message Rapid matching；

Fig. 2 is that the example for being distributed sentence by the identical word frequency different terms compared by this method experiment is shown.

[specific embodiment]

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that the described embodiments are merely a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts all other Embodiment shall fall within the protection scope of the present invention.

Step 1: carrying out participle index, statistics word frequency to original document, all original documents are segmented, rope is established Draw；

Specifically, the citing that table 1 segments index construct matrix data is please referred to, such as: I likes you, you like me, Participle statistics is carried out in table 1, is constructed matrix position [2,2,2,1], and the data for being segmented and being indexed are stored in the data bins；

Table 1

Step 2: using bag of words and word part is poor and training algorithm is trained, the institute in storehouse based on the data There are the participle and the word frequency to be trained, the frequency matrix numerical value that building word occurs, n documents of setting shared k different Word, then construct the matrix of n*k dimension, and matrix i row j column content is then the number that occurs in i-th document of j-th of word plus passing through The numerical value of the poor expression word position found out with training algorithm in part.

It represents lexeme by the calculated value of the formula to set, when its word frequency is no more than 2, this value is necessarily smaller than 1, and setting is every A text is divided into ten positions, and word is only distributed in this ten positions；When word is more than 2, lexeme is set according in document The sequencing of middle appearance sorts, done two-by-two by step by step setting two adjacent lexemes and, the mode made the difference two-by-two is changed into 2 A partial vector, is then calculated again, and the calculating is calculated using absolute value.

When word has n position, (n-1)th position is iterated to calculate out by the calculation formula, repeats the calculating, The n-th -2 positions are calculated, multiple positions of word are converted to by 2 positions by calculating described in iteration, it is final to realize The distribution of multiple positions of word is indicated with a numerical value.It is specific as follows:

Setting word has three positions [a, b, c], then first calculates w1=(b-a) * (b+a), then calculates: W2=(c-b) * (b+c), finally calculate | w1-w2 | * (w1+w2), due to it is possible that w1 < w1, in order to guarantee the nonnegativity of word vectors, I W1-w2 is calculated using absolute value, but which results in the appearance of error, might have | d²-b²|=| c²-b²|, it calculates Probability out is about 1/90, and with the increase of word frequency, error can become to become smaller, and when for 3 words, choose 3 positions Combination has 10*9*8=720 kind, and the case where being likely to occur error only has 8 kinds, while probability does product when word amount rises, meeting Become smaller, so extend to iterative calculation method when word is more than 3, i.e., the mode constantly polymerizeing in this way calculates:

When word frequency is more than 2, in such a way that word block polymerize two-by-two, i.e., the adjacent word of every two is a word block to carry out Iterative calculation, if any 4 positions [a, b, c, d], then can be divided into three word blocks, then 3 word blocks are again by such side Formula is polymerized to 2 word blocks, then calculates result.

Referring to Fig. 2, being the example exhibition for being distributed sentence by the identical word frequency different terms compared by this method experiment Show,

Sample 1 is sentence: I likes you, you like me

2 sentences of sample: you like me, I likes you

Sample 3 is sentence: I I you you, like

Pass through the text provided, it can be seen that word distribution is more identical in 1, No. 2 sentence, while the word frequency of 3 sentences It is identical.

1.0: " you ", 2.0: " ", 3.0: " liking ", 4.0: " I "

From the distribution for being clear which word in each sentence in Fig. 2 more closely, such as sample 1 and sample 3 It is upper closer in the distribution of " you " this word.And in the distribution of " liking " this word, sample 2 is more nearly with sample 1, this and I It is consistent from sentence because being distributed in 2,5 positions in sample 1, be distributed in 2,6 positions in sample 2, and 5,6 positions are distributed in sample 3.

Step 3: described includes industrial part, phenomenon of the failure to industrial part Put on file to industrial part Put on file And corresponding solution documentation, the industrial part Put on file data are equally stored in the data bins.

Specifically, please referring to table 2 is the storage citing of industrial part data,

Table 2

Step 4: inputting document to be detected, immediate document is calculated by matrix distance algorithm；The matrix distance Algorithm is the calculating that will be inputted the document to be detected and execute the participle and word frequency statistics and the position data, obtains Matrix Xak, by by the matrix X_akWith the data matrix X in the data bins_nkThe matrix distance calculating is carried out, is being counted It is calculated when calculation in the matrix with identical word, i.e. X_a1With X_n1The corresponding same word, obtains comparing result value after calculating d_an, described value d_anIt is smaller to indicate closer, its calculation formula is:

Step 5: industrial fault message is combined to filter out immediate preceding ten documents, solution is returned to according to index Document.The industry fault message includes industrial part data, fault signature data and artificially defined feedback data.

Compared with prior art, semantic retrieving method provided by the invention is established by establishing participle and word frequency data bins The part of bag of words and word is poor and training algorithm, is then matched by matrix distance algorithm, more suitable for industrial enterprise The requirement of searching field and high accurancy and precision quick-searching.

Above-described is only embodiments of the present invention, it should be noted here that for those of ordinary skill in the art For, without departing from the concept of the premise of the invention, improvement can also be made, but these belong to protection model of the invention It encloses.

Claims

1. a kind of semantic retrieving method for industrial fault message Rapid matching, which comprises the following steps:

Step 2: using the part of bag of words and word is poor and training algorithm；

Step 3: to industrial part Put on file；

Step 5: industrial fault message is combined to filter out immediate preceding ten documents, solution documentation is returned to according to index.

2. semantic retrieving method as described in claim 1, which is characterized in that in step 1, the participle index specifically: By being to distinguish to carry out word segmentation processing, and establish index based on the participle with word all original documents, to all The document, which is established, to be indexed and is stored in data bins, while counting word frequency deposit data bins.

3. semantic retrieving method as claimed in claim 2, which is characterized in that the part using bag of words and word it is poor and Training algorithm specifically: all participles in storehouse and the word frequency are trained based on the data, the frequency that building word occurs Rate matrix numerical value sets shared k different words of n documents, then constructs the matrix of n*k dimension, matrix i row j column content is then j-th The number that word occurs in i-th document adds the numerical value by the poor expression word position found out with training algorithm in part.

4. semantic retrieving method as claimed in claim 3, which is characterized in that the part of the bag of words and word is poor and trains Algorithm further includes the calculating of position data, and the position data is to extract the position of word in the text by calculating, and calculates word Language position distribution position and poor, then two are done product, and are added with word frequency, realize that by multiple location sets of word be one A number indicates that the part of word is poor and be that can approach sentence position vector boil down to one uniquely to represent its position One number, its calculation formula is:

Setting position a, position b, position c, wherein a < c, b < c, a <b, then (a/c+b/c) * (b/c-a/c)=(b²-a²)/c²<1；

It represents lexeme by the calculated value of the formula to set, when its word frequency is no more than 2, this value is necessarily smaller than 1, sets each text Originally it is divided into ten positions, word is only distributed in this ten positions；When word has n position, changed by the calculation formula In generation, calculates (n-1)th position, repeats the calculating, calculates the n-th -2 positions, by calculating described in iteration by word Multiple positions be converted to 2 positions, it is final to realize the distribution that multiple positions of word are indicated with a numerical value；The calculating It is calculated using absolute value.

5. semantic retrieving method as described in claim 1, which is characterized in that described includes industry to industrial part Put on file Type, phenomenon of the failure and the corresponding solution documentation of part, and will be described in industrial part Put on file data deposit Data bins.

6. semantic retrieving method as described in claim 1, which is characterized in that the matrix distance algorithm specifically: will input Document to be detected executes the calculating of the participle and word frequency statistics and the position data, obtains matrix X_ak, by will be described Matrix X_akWith the data matrix X in the data bins_nkThe matrix distance calculating is carried out, when calculating with phase in the matrix It is calculated with word, i.e. X_a1With X_n1The corresponding same word, obtains comparing result value d after calculating_an, described value d_anSmaller expression It is closer, its calculation formula is:

7. semantic retrieving method as described in claim 1, which is characterized in that the industry fault message includes industrial part number According to, fault signature data and artificially defined feedback data.