CN110348470A - Semantic retrieving method for industrial fault message Rapid matching - Google Patents
Semantic retrieving method for industrial fault message Rapid matching Download PDFInfo
- Publication number
- CN110348470A CN110348470A CN201910428519.2A CN201910428519A CN110348470A CN 110348470 A CN110348470 A CN 110348470A CN 201910428519 A CN201910428519 A CN 201910428519A CN 110348470 A CN110348470 A CN 110348470A
- Authority
- CN
- China
- Prior art keywords
- word
- data
- matrix
- calculating
- industrial
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of semantic retrieving methods for industrial fault message Rapid matching comprising following steps: Step 1: carrying out participle index, statistics word frequency to original document;Step 2: using bag of words and word part is poor and training algorithm is trained;Step 3: to industrial part Put on file;Step 4: input document, is calculated by matrix distance algorithm apart from immediate document;Step 5: combining the industrial fault message result document that screening and sequencing is selected again, solution documentation is returned to according to index.The present invention improves matching precision and matching speed for the method for realizing industrial Trouble Match.
Description
[technical field]
The invention belongs to natural language processing text similarity matching technique fields, and it is quick to be related to a kind of industrial fault message
Matched semantic retrieving method.
[background technique]
With the arriving of data age, each industrial enterprise has accumulated a large amount of data, is expected to prompt and solve industry face
The conventional difficulties faced.Some problems of its saliency are that industrial technology generally requires worker and adds up for a long time, and master worker is with apprentice
System accumulate experience, apprentice due to do not known how when experience deficiency often copes with failure problems solve failure, and by
In manpower reason, special circumstances etc., possible master worker does not have the time to go preferentially to solve the problems, such as this, this has resulted in the damage of human resources
It becomes estranged the financial losses as caused by failure.
Since current search engine technique is mostly based on word Converse Index and word corresponds to document, pass through input word pair
The document answered takes intersection to handle, and such mode is simple and crude, is suitable for the corresponding of mass data in internet and searches for, but
In enterprise, matching document size is often only tens of thousands of, ten tens of thousands of data volumes, and often business is set for subject area, looks forward to
What industry was more concerned about is how to improve matching precision and matching speed.
In contrast, by establishing participle and word frequency data bins, establish bag of words and word part is poor and training algorithm,
Then it is matched by matrix distance algorithm, more suitable for wanting for industrial enterprise's searching field and high accurancy and precision quick-searching
It asks.
[summary of the invention]
A kind of semantic retrieving method for industrial fault message Rapid matching, comprising the following steps:
Step 1: carrying out participle index, statistics word frequency to original document;
Step 2: using bag of words and word part is poor and training algorithm is trained;
Step 3: to industrial part Put on file;
Step 4: inputting document to be detected, immediate document is calculated by matrix distance algorithm;
Step 5: industrial fault message is combined to filter out immediate preceding ten documents, solution is returned to according to index
Document.
Further, the step 1 participle index, be by all original documents with word for distinguish into
Row word segmentation processing, and index is established based on the participle, all documents are established and indexes and is stored in data bins, are counted simultaneously
Word frequency is stored in data bins.
Further, the part using bag of words and word is poor and training algorithm specifically: storehouse based on the data
All participles and the word frequency be trained, the frequency matrix numerical value that building word occurs, setting n documents, to share k a
Different words then construct the matrix of n*k dimension, and matrix i row j column content is then that the number that j-th of word occurs in i-th document adds
Pass through the numerical value of the poor expression word position found out with training algorithm in part.
Further, the poor training algorithm in the part of the bag of words and word further includes the calculating of position data, institute's rheme
Setting data is to extract the position of word in the text by calculating, and calculating word position distributing position and poor, then two are done
Product, and be added with word frequency, realizing multiple location sets of word are a number indicates, the part of word is poor and is by sentence
Sub- position vector boil down to one can approach a number for uniquely representing its position, its calculation formula is:
Setting position a, position b, position c, wherein a < c, b < c, a <b, then (a/c+b/c) * (b/c-a/c)=(b2-a2)/
c2<1;
It represents lexeme by the calculated value of the formula to set, when its word frequency is no more than 2, this value is necessarily smaller than 1, and setting is every
A text is divided into ten positions, and word is only distributed in this ten positions;It is public by the calculating when word has n position
Formula iterates to calculate out (n-1)th position, repeats the calculating, calculates the n-th -2 positions, will by calculating described in iteration
Multiple positions of word are converted to 2 positions, final to realize the distribution that multiple positions of word are indicated with a numerical value;It is described
Calculating is calculated using absolute value.
Further, described includes industrial part, phenomenon of the failure and corresponding solution to industrial part Put on file
The industrial part Put on file data are equally stored in the data bins by document.
Further, the matrix distance algorithm specifically: document to be detected will be inputted and execute the participle and word frequency system
The calculating of meter and the position data, obtains matrix Xak, by by the matrix XakWith the data square in the data bins
Battle array XnkThe matrix distance calculating is carried out, is calculated when calculating in the matrix with identical word, i.e. Xa1With Xn1Correspondence is same
A word obtains comparing result value d after calculatingan, described value danIt is smaller to indicate closer, its calculation formula is:
Further, the industrial fault message includes industrial part data, fault signature data and artificially defined feedback
Data.
Compared with prior art, the present invention establishes the part of bag of words and word by establishing participle and word frequency data bins
Difference and training algorithm, are then matched by matrix distance algorithm, more suitable for industrial enterprise's searching field and high accurancy and precision
The requirement of quick-searching.
[Detailed description of the invention]
Fig. 1 is the flow chart provided by the present invention for the semantic retrieving method of industrial fault message Rapid matching;
Fig. 2 is that the example for being distributed sentence by the identical word frequency different terms compared by this method experiment is shown.
[specific embodiment]
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that the described embodiments are merely a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts all other
Embodiment shall fall within the protection scope of the present invention.
A kind of semantic retrieving method for industrial fault message Rapid matching, comprising the following steps:
Step 1: carrying out participle index, statistics word frequency to original document, all original documents are segmented, rope is established
Draw;
Specifically, the citing that table 1 segments index construct matrix data is please referred to, such as: I likes you, you like me,
Participle statistics is carried out in table 1, is constructed matrix position [2,2,2,1], and the data for being segmented and being indexed are stored in the data bins;
Table 1
Step 2: using bag of words and word part is poor and training algorithm is trained, the institute in storehouse based on the data
There are the participle and the word frequency to be trained, the frequency matrix numerical value that building word occurs, n documents of setting shared k different
Word, then construct the matrix of n*k dimension, and matrix i row j column content is then the number that occurs in i-th document of j-th of word plus passing through
The numerical value of the poor expression word position found out with training algorithm in part.
Further, the poor training algorithm in the part of the bag of words and word further includes the calculating of position data, institute's rheme
Setting data is to extract the position of word in the text by calculating, and calculating word position distributing position and poor, then two are done
Product, and be added with word frequency, realizing multiple location sets of word are a number indicates, the part of word is poor and is by sentence
Sub- position vector boil down to one can approach a number for uniquely representing its position, its calculation formula is:
Setting position a, position b, position c, wherein a < c, b < c, a <b, then (a/c+b/c) * (b/c-a/c)=(b2-a2)/
c2<1;
It represents lexeme by the calculated value of the formula to set, when its word frequency is no more than 2, this value is necessarily smaller than 1, and setting is every
A text is divided into ten positions, and word is only distributed in this ten positions;When word is more than 2, lexeme is set according in document
The sequencing of middle appearance sorts, done two-by-two by step by step setting two adjacent lexemes and, the mode made the difference two-by-two is changed into 2
A partial vector, is then calculated again, and the calculating is calculated using absolute value.
When word has n position, (n-1)th position is iterated to calculate out by the calculation formula, repeats the calculating,
The n-th -2 positions are calculated, multiple positions of word are converted to by 2 positions by calculating described in iteration, it is final to realize
The distribution of multiple positions of word is indicated with a numerical value.It is specific as follows:
Setting word has three positions [a, b, c], then first calculates w1=(b-a) * (b+a), then calculates: W2=(c-b) *
(b+c), finally calculate | w1-w2 | * (w1+w2), due to it is possible that w1 < w1, in order to guarantee the nonnegativity of word vectors, I
W1-w2 is calculated using absolute value, but which results in the appearance of error, might have | d2-b2|=| c2-b2|, it calculates
Probability out is about 1/90, and with the increase of word frequency, error can become to become smaller, and when for 3 words, choose 3 positions
Combination has 10*9*8=720 kind, and the case where being likely to occur error only has 8 kinds, while probability does product when word amount rises, meeting
Become smaller, so extend to iterative calculation method when word is more than 3, i.e., the mode constantly polymerizeing in this way calculates:
When word frequency is more than 2, in such a way that word block polymerize two-by-two, i.e., the adjacent word of every two is a word block to carry out
Iterative calculation, if any 4 positions [a, b, c, d], then can be divided into three word blocks, then 3 word blocks are again by such side
Formula is polymerized to 2 word blocks, then calculates result.
Referring to Fig. 2, being the example exhibition for being distributed sentence by the identical word frequency different terms compared by this method experiment
Show,
Sample 1 is sentence: I likes you, you like me
2 sentences of sample: you like me, I likes you
Sample 3 is sentence: I I you you, like
Pass through the text provided, it can be seen that word distribution is more identical in 1, No. 2 sentence, while the word frequency of 3 sentences
It is identical.
1.0: " you ", 2.0: " ", 3.0: " liking ", 4.0: " I "
From the distribution for being clear which word in each sentence in Fig. 2 more closely, such as sample 1 and sample 3
It is upper closer in the distribution of " you " this word.And in the distribution of " liking " this word, sample 2 is more nearly with sample 1, this and I
It is consistent from sentence because being distributed in 2,5 positions in sample 1, be distributed in 2,6 positions in sample 2, and
5,6 positions are distributed in sample 3.
Step 3: described includes industrial part, phenomenon of the failure to industrial part Put on file to industrial part Put on file
And corresponding solution documentation, the industrial part Put on file data are equally stored in the data bins.
Specifically, please referring to table 2 is the storage citing of industrial part data,
Table 2
Step 4: inputting document to be detected, immediate document is calculated by matrix distance algorithm;The matrix distance
Algorithm is the calculating that will be inputted the document to be detected and execute the participle and word frequency statistics and the position data, obtains
Matrix Xak, by by the matrix XakWith the data matrix X in the data binsnkThe matrix distance calculating is carried out, is being counted
It is calculated when calculation in the matrix with identical word, i.e. Xa1With Xn1The corresponding same word, obtains comparing result value after calculating
dan, described value danIt is smaller to indicate closer, its calculation formula is:
Step 5: industrial fault message is combined to filter out immediate preceding ten documents, solution is returned to according to index
Document.The industry fault message includes industrial part data, fault signature data and artificially defined feedback data.
Compared with prior art, semantic retrieving method provided by the invention is established by establishing participle and word frequency data bins
The part of bag of words and word is poor and training algorithm, is then matched by matrix distance algorithm, more suitable for industrial enterprise
The requirement of searching field and high accurancy and precision quick-searching.
Above-described is only embodiments of the present invention, it should be noted here that for those of ordinary skill in the art
For, without departing from the concept of the premise of the invention, improvement can also be made, but these belong to protection model of the invention
It encloses.
Claims (7)
1. a kind of semantic retrieving method for industrial fault message Rapid matching, which comprises the following steps:
Step 1: carrying out participle index, statistics word frequency to original document;
Step 2: using the part of bag of words and word is poor and training algorithm;
Step 3: to industrial part Put on file;
Step 4: inputting document to be detected, immediate document is calculated by matrix distance algorithm;
Step 5: industrial fault message is combined to filter out immediate preceding ten documents, solution documentation is returned to according to index.
2. semantic retrieving method as described in claim 1, which is characterized in that in step 1, the participle index specifically:
By being to distinguish to carry out word segmentation processing, and establish index based on the participle with word all original documents, to all
The document, which is established, to be indexed and is stored in data bins, while counting word frequency deposit data bins.
3. semantic retrieving method as claimed in claim 2, which is characterized in that the part using bag of words and word it is poor and
Training algorithm specifically: all participles in storehouse and the word frequency are trained based on the data, the frequency that building word occurs
Rate matrix numerical value sets shared k different words of n documents, then constructs the matrix of n*k dimension, matrix i row j column content is then j-th
The number that word occurs in i-th document adds the numerical value by the poor expression word position found out with training algorithm in part.
4. semantic retrieving method as claimed in claim 3, which is characterized in that the part of the bag of words and word is poor and trains
Algorithm further includes the calculating of position data, and the position data is to extract the position of word in the text by calculating, and calculates word
Language position distribution position and poor, then two are done product, and are added with word frequency, realize that by multiple location sets of word be one
A number indicates that the part of word is poor and be that can approach sentence position vector boil down to one uniquely to represent its position
One number, its calculation formula is:
Setting position a, position b, position c, wherein a < c, b < c, a <b, then (a/c+b/c) * (b/c-a/c)=(b2-a2)/c2<1;
It represents lexeme by the calculated value of the formula to set, when its word frequency is no more than 2, this value is necessarily smaller than 1, sets each text
Originally it is divided into ten positions, word is only distributed in this ten positions;When word has n position, changed by the calculation formula
In generation, calculates (n-1)th position, repeats the calculating, calculates the n-th -2 positions, by calculating described in iteration by word
Multiple positions be converted to 2 positions, it is final to realize the distribution that multiple positions of word are indicated with a numerical value;The calculating
It is calculated using absolute value.
5. semantic retrieving method as described in claim 1, which is characterized in that described includes industry to industrial part Put on file
Type, phenomenon of the failure and the corresponding solution documentation of part, and will be described in industrial part Put on file data deposit
Data bins.
6. semantic retrieving method as described in claim 1, which is characterized in that the matrix distance algorithm specifically: will input
Document to be detected executes the calculating of the participle and word frequency statistics and the position data, obtains matrix Xak, by will be described
Matrix XakWith the data matrix X in the data binsnkThe matrix distance calculating is carried out, when calculating with phase in the matrix
It is calculated with word, i.e. Xa1With Xn1The corresponding same word, obtains comparing result value d after calculatingan, described value danSmaller expression
It is closer, its calculation formula is:
7. semantic retrieving method as described in claim 1, which is characterized in that the industry fault message includes industrial part number
According to, fault signature data and artificially defined feedback data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910428519.2A CN110348470B (en) | 2019-05-21 | 2019-05-21 | Semantic retrieval method for industrial fault information rapid matching |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910428519.2A CN110348470B (en) | 2019-05-21 | 2019-05-21 | Semantic retrieval method for industrial fault information rapid matching |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110348470A true CN110348470A (en) | 2019-10-18 |
CN110348470B CN110348470B (en) | 2022-11-22 |
Family
ID=68173908
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910428519.2A Active CN110348470B (en) | 2019-05-21 | 2019-05-21 | Semantic retrieval method for industrial fault information rapid matching |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110348470B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009104475A (en) * | 2007-10-24 | 2009-05-14 | Nippon Telegr & Teleph Corp <Ntt> | Similar document retrieval device, and similar document retrieval method and program |
CN102955848A (en) * | 2012-10-29 | 2013-03-06 | 北京工商大学 | Semantic-based three-dimensional model retrieval system and method |
US8880540B1 (en) * | 2012-03-28 | 2014-11-04 | Emc Corporation | Method and system for using location transformations to identify objects |
US9069768B1 (en) * | 2012-03-28 | 2015-06-30 | Emc Corporation | Method and system for creating subgroups of documents using optical character recognition data |
-
2019
- 2019-05-21 CN CN201910428519.2A patent/CN110348470B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009104475A (en) * | 2007-10-24 | 2009-05-14 | Nippon Telegr & Teleph Corp <Ntt> | Similar document retrieval device, and similar document retrieval method and program |
US8880540B1 (en) * | 2012-03-28 | 2014-11-04 | Emc Corporation | Method and system for using location transformations to identify objects |
US9069768B1 (en) * | 2012-03-28 | 2015-06-30 | Emc Corporation | Method and system for creating subgroups of documents using optical character recognition data |
CN102955848A (en) * | 2012-10-29 | 2013-03-06 | 北京工商大学 | Semantic-based three-dimensional model retrieval system and method |
Non-Patent Citations (1)
Title |
---|
吴志强等: "高效可扩展的对称密文检索架构", 《通信学报》 * |
Also Published As
Publication number | Publication date |
---|---|
CN110348470B (en) | 2022-11-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108573411B (en) | Mixed recommendation method based on deep emotion analysis and multi-source recommendation view fusion of user comments | |
Hassan et al. | Twitter sentiment analysis: A bootstrap ensemble framework | |
CN104102626B (en) | A kind of method for short text Semantic Similarity Measurement | |
KR101700585B1 (en) | On-line product search method and system | |
CN103064970B (en) | Optimize the search method of interpreter | |
CN103823838B (en) | A kind of method of multi-format document typing and comparison | |
CN108073568A (en) | keyword extracting method and device | |
CN111737535B (en) | Network characterization learning method based on element structure and graph neural network | |
CN105022754A (en) | Social network based object classification method and apparatus | |
CN107239512B (en) | A kind of microblogging comment spam recognition methods of combination comment relational network figure | |
CA2720842A1 (en) | System and method for value significance evaluation of ontological subjects of network and the applications thereof | |
CN103885937A (en) | Method for judging repetition of enterprise Chinese names on basis of core word similarity | |
CN104408033A (en) | Text message extracting method and system | |
CN111199474A (en) | Risk prediction method and device based on network diagram data of two parties and electronic equipment | |
CN105894183A (en) | Project evaluation method and apparatus | |
CN109033132A (en) | The method and device of text and the main body degree of correlation are calculated using knowledge mapping | |
CN110287329A (en) | A kind of electric business classification attribute excavation method based on commodity text classification | |
CN106339486A (en) | Image retrieval method based on incremental learning of large vocabulary tree | |
CN106528768A (en) | Consultation hotspot analysis method and device | |
CN106202038A (en) | Synonym method for digging based on iteration and device | |
Cao et al. | Towards automatic numerical cross-checking: Extracting formulas from text | |
CN109408643B (en) | Fund similarity calculation method, system, computer equipment and storage medium | |
CN112905906B (en) | Recommendation method and system fusing local collaboration and feature intersection | |
CN109033428A (en) | A kind of intelligent customer service method and system | |
Meena et al. | A survey on community detection algorithm and its applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |