CN110348470A - Semantic retrieving method for industrial fault message Rapid matching - Google Patents

Semantic retrieving method for industrial fault message Rapid matching Download PDF

Info

Publication number
CN110348470A
CN110348470A CN201910428519.2A CN201910428519A CN110348470A CN 110348470 A CN110348470 A CN 110348470A CN 201910428519 A CN201910428519 A CN 201910428519A CN 110348470 A CN110348470 A CN 110348470A
Authority
CN
China
Prior art keywords
word
data
matrix
calculating
industrial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910428519.2A
Other languages
Chinese (zh)
Other versions
CN110348470B (en
Inventor
李肯立
闫安民
阳王东
刘楚波
陈岑
周旭
吴帆
唐卓
李克勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN201910428519.2A priority Critical patent/CN110348470B/en
Publication of CN110348470A publication Critical patent/CN110348470A/en
Application granted granted Critical
Publication of CN110348470B publication Critical patent/CN110348470B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of semantic retrieving methods for industrial fault message Rapid matching comprising following steps: Step 1: carrying out participle index, statistics word frequency to original document;Step 2: using bag of words and word part is poor and training algorithm is trained;Step 3: to industrial part Put on file;Step 4: input document, is calculated by matrix distance algorithm apart from immediate document;Step 5: combining the industrial fault message result document that screening and sequencing is selected again, solution documentation is returned to according to index.The present invention improves matching precision and matching speed for the method for realizing industrial Trouble Match.

Description

Semantic retrieving method for industrial fault message Rapid matching
[technical field]
The invention belongs to natural language processing text similarity matching technique fields, and it is quick to be related to a kind of industrial fault message Matched semantic retrieving method.
[background technique]
With the arriving of data age, each industrial enterprise has accumulated a large amount of data, is expected to prompt and solve industry face The conventional difficulties faced.Some problems of its saliency are that industrial technology generally requires worker and adds up for a long time, and master worker is with apprentice System accumulate experience, apprentice due to do not known how when experience deficiency often copes with failure problems solve failure, and by In manpower reason, special circumstances etc., possible master worker does not have the time to go preferentially to solve the problems, such as this, this has resulted in the damage of human resources It becomes estranged the financial losses as caused by failure.
Since current search engine technique is mostly based on word Converse Index and word corresponds to document, pass through input word pair The document answered takes intersection to handle, and such mode is simple and crude, is suitable for the corresponding of mass data in internet and searches for, but In enterprise, matching document size is often only tens of thousands of, ten tens of thousands of data volumes, and often business is set for subject area, looks forward to What industry was more concerned about is how to improve matching precision and matching speed.
In contrast, by establishing participle and word frequency data bins, establish bag of words and word part is poor and training algorithm, Then it is matched by matrix distance algorithm, more suitable for wanting for industrial enterprise's searching field and high accurancy and precision quick-searching It asks.
[summary of the invention]
A kind of semantic retrieving method for industrial fault message Rapid matching, comprising the following steps:
Step 1: carrying out participle index, statistics word frequency to original document;
Step 2: using bag of words and word part is poor and training algorithm is trained;
Step 3: to industrial part Put on file;
Step 4: inputting document to be detected, immediate document is calculated by matrix distance algorithm;
Step 5: industrial fault message is combined to filter out immediate preceding ten documents, solution is returned to according to index Document.
Further, the step 1 participle index, be by all original documents with word for distinguish into Row word segmentation processing, and index is established based on the participle, all documents are established and indexes and is stored in data bins, are counted simultaneously Word frequency is stored in data bins.
Further, the part using bag of words and word is poor and training algorithm specifically: storehouse based on the data All participles and the word frequency be trained, the frequency matrix numerical value that building word occurs, setting n documents, to share k a Different words then construct the matrix of n*k dimension, and matrix i row j column content is then that the number that j-th of word occurs in i-th document adds Pass through the numerical value of the poor expression word position found out with training algorithm in part.
Further, the poor training algorithm in the part of the bag of words and word further includes the calculating of position data, institute's rheme Setting data is to extract the position of word in the text by calculating, and calculating word position distributing position and poor, then two are done Product, and be added with word frequency, realizing multiple location sets of word are a number indicates, the part of word is poor and is by sentence Sub- position vector boil down to one can approach a number for uniquely representing its position, its calculation formula is:
Setting position a, position b, position c, wherein a < c, b < c, a <b, then (a/c+b/c) * (b/c-a/c)=(b2-a2)/ c2<1;
It represents lexeme by the calculated value of the formula to set, when its word frequency is no more than 2, this value is necessarily smaller than 1, and setting is every A text is divided into ten positions, and word is only distributed in this ten positions;It is public by the calculating when word has n position Formula iterates to calculate out (n-1)th position, repeats the calculating, calculates the n-th -2 positions, will by calculating described in iteration Multiple positions of word are converted to 2 positions, final to realize the distribution that multiple positions of word are indicated with a numerical value;It is described Calculating is calculated using absolute value.
Further, described includes industrial part, phenomenon of the failure and corresponding solution to industrial part Put on file The industrial part Put on file data are equally stored in the data bins by document.
Further, the matrix distance algorithm specifically: document to be detected will be inputted and execute the participle and word frequency system The calculating of meter and the position data, obtains matrix Xak, by by the matrix XakWith the data square in the data bins Battle array XnkThe matrix distance calculating is carried out, is calculated when calculating in the matrix with identical word, i.e. Xa1With Xn1Correspondence is same A word obtains comparing result value d after calculatingan, described value danIt is smaller to indicate closer, its calculation formula is:
Further, the industrial fault message includes industrial part data, fault signature data and artificially defined feedback Data.
Compared with prior art, the present invention establishes the part of bag of words and word by establishing participle and word frequency data bins Difference and training algorithm, are then matched by matrix distance algorithm, more suitable for industrial enterprise's searching field and high accurancy and precision The requirement of quick-searching.
[Detailed description of the invention]
Fig. 1 is the flow chart provided by the present invention for the semantic retrieving method of industrial fault message Rapid matching;
Fig. 2 is that the example for being distributed sentence by the identical word frequency different terms compared by this method experiment is shown.
[specific embodiment]
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that the described embodiments are merely a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts all other Embodiment shall fall within the protection scope of the present invention.
A kind of semantic retrieving method for industrial fault message Rapid matching, comprising the following steps:
Step 1: carrying out participle index, statistics word frequency to original document, all original documents are segmented, rope is established Draw;
Specifically, the citing that table 1 segments index construct matrix data is please referred to, such as: I likes you, you like me, Participle statistics is carried out in table 1, is constructed matrix position [2,2,2,1], and the data for being segmented and being indexed are stored in the data bins;
Table 1
Step 2: using bag of words and word part is poor and training algorithm is trained, the institute in storehouse based on the data There are the participle and the word frequency to be trained, the frequency matrix numerical value that building word occurs, n documents of setting shared k different Word, then construct the matrix of n*k dimension, and matrix i row j column content is then the number that occurs in i-th document of j-th of word plus passing through The numerical value of the poor expression word position found out with training algorithm in part.
Further, the poor training algorithm in the part of the bag of words and word further includes the calculating of position data, institute's rheme Setting data is to extract the position of word in the text by calculating, and calculating word position distributing position and poor, then two are done Product, and be added with word frequency, realizing multiple location sets of word are a number indicates, the part of word is poor and is by sentence Sub- position vector boil down to one can approach a number for uniquely representing its position, its calculation formula is:
Setting position a, position b, position c, wherein a < c, b < c, a <b, then (a/c+b/c) * (b/c-a/c)=(b2-a2)/ c2<1;
It represents lexeme by the calculated value of the formula to set, when its word frequency is no more than 2, this value is necessarily smaller than 1, and setting is every A text is divided into ten positions, and word is only distributed in this ten positions;When word is more than 2, lexeme is set according in document The sequencing of middle appearance sorts, done two-by-two by step by step setting two adjacent lexemes and, the mode made the difference two-by-two is changed into 2 A partial vector, is then calculated again, and the calculating is calculated using absolute value.
When word has n position, (n-1)th position is iterated to calculate out by the calculation formula, repeats the calculating, The n-th -2 positions are calculated, multiple positions of word are converted to by 2 positions by calculating described in iteration, it is final to realize The distribution of multiple positions of word is indicated with a numerical value.It is specific as follows:
Setting word has three positions [a, b, c], then first calculates w1=(b-a) * (b+a), then calculates: W2=(c-b) * (b+c), finally calculate | w1-w2 | * (w1+w2), due to it is possible that w1 < w1, in order to guarantee the nonnegativity of word vectors, I W1-w2 is calculated using absolute value, but which results in the appearance of error, might have | d2-b2|=| c2-b2|, it calculates Probability out is about 1/90, and with the increase of word frequency, error can become to become smaller, and when for 3 words, choose 3 positions Combination has 10*9*8=720 kind, and the case where being likely to occur error only has 8 kinds, while probability does product when word amount rises, meeting Become smaller, so extend to iterative calculation method when word is more than 3, i.e., the mode constantly polymerizeing in this way calculates:
When word frequency is more than 2, in such a way that word block polymerize two-by-two, i.e., the adjacent word of every two is a word block to carry out Iterative calculation, if any 4 positions [a, b, c, d], then can be divided into three word blocks, then 3 word blocks are again by such side Formula is polymerized to 2 word blocks, then calculates result.
Referring to Fig. 2, being the example exhibition for being distributed sentence by the identical word frequency different terms compared by this method experiment Show,
Sample 1 is sentence: I likes you, you like me
2 sentences of sample: you like me, I likes you
Sample 3 is sentence: I I you you, like
Pass through the text provided, it can be seen that word distribution is more identical in 1, No. 2 sentence, while the word frequency of 3 sentences It is identical.
1.0: " you ", 2.0: " ", 3.0: " liking ", 4.0: " I "
From the distribution for being clear which word in each sentence in Fig. 2 more closely, such as sample 1 and sample 3 It is upper closer in the distribution of " you " this word.And in the distribution of " liking " this word, sample 2 is more nearly with sample 1, this and I It is consistent from sentence because being distributed in 2,5 positions in sample 1, be distributed in 2,6 positions in sample 2, and 5,6 positions are distributed in sample 3.
Step 3: described includes industrial part, phenomenon of the failure to industrial part Put on file to industrial part Put on file And corresponding solution documentation, the industrial part Put on file data are equally stored in the data bins.
Specifically, please referring to table 2 is the storage citing of industrial part data,
Table 2
Step 4: inputting document to be detected, immediate document is calculated by matrix distance algorithm;The matrix distance Algorithm is the calculating that will be inputted the document to be detected and execute the participle and word frequency statistics and the position data, obtains Matrix Xak, by by the matrix XakWith the data matrix X in the data binsnkThe matrix distance calculating is carried out, is being counted It is calculated when calculation in the matrix with identical word, i.e. Xa1With Xn1The corresponding same word, obtains comparing result value after calculating dan, described value danIt is smaller to indicate closer, its calculation formula is:
Step 5: industrial fault message is combined to filter out immediate preceding ten documents, solution is returned to according to index Document.The industry fault message includes industrial part data, fault signature data and artificially defined feedback data.
Compared with prior art, semantic retrieving method provided by the invention is established by establishing participle and word frequency data bins The part of bag of words and word is poor and training algorithm, is then matched by matrix distance algorithm, more suitable for industrial enterprise The requirement of searching field and high accurancy and precision quick-searching.
Above-described is only embodiments of the present invention, it should be noted here that for those of ordinary skill in the art For, without departing from the concept of the premise of the invention, improvement can also be made, but these belong to protection model of the invention It encloses.

Claims (7)

1. a kind of semantic retrieving method for industrial fault message Rapid matching, which comprises the following steps:
Step 1: carrying out participle index, statistics word frequency to original document;
Step 2: using the part of bag of words and word is poor and training algorithm;
Step 3: to industrial part Put on file;
Step 4: inputting document to be detected, immediate document is calculated by matrix distance algorithm;
Step 5: industrial fault message is combined to filter out immediate preceding ten documents, solution documentation is returned to according to index.
2. semantic retrieving method as described in claim 1, which is characterized in that in step 1, the participle index specifically: By being to distinguish to carry out word segmentation processing, and establish index based on the participle with word all original documents, to all The document, which is established, to be indexed and is stored in data bins, while counting word frequency deposit data bins.
3. semantic retrieving method as claimed in claim 2, which is characterized in that the part using bag of words and word it is poor and Training algorithm specifically: all participles in storehouse and the word frequency are trained based on the data, the frequency that building word occurs Rate matrix numerical value sets shared k different words of n documents, then constructs the matrix of n*k dimension, matrix i row j column content is then j-th The number that word occurs in i-th document adds the numerical value by the poor expression word position found out with training algorithm in part.
4. semantic retrieving method as claimed in claim 3, which is characterized in that the part of the bag of words and word is poor and trains Algorithm further includes the calculating of position data, and the position data is to extract the position of word in the text by calculating, and calculates word Language position distribution position and poor, then two are done product, and are added with word frequency, realize that by multiple location sets of word be one A number indicates that the part of word is poor and be that can approach sentence position vector boil down to one uniquely to represent its position One number, its calculation formula is:
Setting position a, position b, position c, wherein a < c, b < c, a <b, then (a/c+b/c) * (b/c-a/c)=(b2-a2)/c2<1;
It represents lexeme by the calculated value of the formula to set, when its word frequency is no more than 2, this value is necessarily smaller than 1, sets each text Originally it is divided into ten positions, word is only distributed in this ten positions;When word has n position, changed by the calculation formula In generation, calculates (n-1)th position, repeats the calculating, calculates the n-th -2 positions, by calculating described in iteration by word Multiple positions be converted to 2 positions, it is final to realize the distribution that multiple positions of word are indicated with a numerical value;The calculating It is calculated using absolute value.
5. semantic retrieving method as described in claim 1, which is characterized in that described includes industry to industrial part Put on file Type, phenomenon of the failure and the corresponding solution documentation of part, and will be described in industrial part Put on file data deposit Data bins.
6. semantic retrieving method as described in claim 1, which is characterized in that the matrix distance algorithm specifically: will input Document to be detected executes the calculating of the participle and word frequency statistics and the position data, obtains matrix Xak, by will be described Matrix XakWith the data matrix X in the data binsnkThe matrix distance calculating is carried out, when calculating with phase in the matrix It is calculated with word, i.e. Xa1With Xn1The corresponding same word, obtains comparing result value d after calculatingan, described value danSmaller expression It is closer, its calculation formula is:
7. semantic retrieving method as described in claim 1, which is characterized in that the industry fault message includes industrial part number According to, fault signature data and artificially defined feedback data.
CN201910428519.2A 2019-05-21 2019-05-21 Semantic retrieval method for industrial fault information rapid matching Active CN110348470B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910428519.2A CN110348470B (en) 2019-05-21 2019-05-21 Semantic retrieval method for industrial fault information rapid matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910428519.2A CN110348470B (en) 2019-05-21 2019-05-21 Semantic retrieval method for industrial fault information rapid matching

Publications (2)

Publication Number Publication Date
CN110348470A true CN110348470A (en) 2019-10-18
CN110348470B CN110348470B (en) 2022-11-22

Family

ID=68173908

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910428519.2A Active CN110348470B (en) 2019-05-21 2019-05-21 Semantic retrieval method for industrial fault information rapid matching

Country Status (1)

Country Link
CN (1) CN110348470B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009104475A (en) * 2007-10-24 2009-05-14 Nippon Telegr & Teleph Corp <Ntt> Similar document retrieval device, and similar document retrieval method and program
CN102955848A (en) * 2012-10-29 2013-03-06 北京工商大学 Semantic-based three-dimensional model retrieval system and method
US8880540B1 (en) * 2012-03-28 2014-11-04 Emc Corporation Method and system for using location transformations to identify objects
US9069768B1 (en) * 2012-03-28 2015-06-30 Emc Corporation Method and system for creating subgroups of documents using optical character recognition data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009104475A (en) * 2007-10-24 2009-05-14 Nippon Telegr & Teleph Corp <Ntt> Similar document retrieval device, and similar document retrieval method and program
US8880540B1 (en) * 2012-03-28 2014-11-04 Emc Corporation Method and system for using location transformations to identify objects
US9069768B1 (en) * 2012-03-28 2015-06-30 Emc Corporation Method and system for creating subgroups of documents using optical character recognition data
CN102955848A (en) * 2012-10-29 2013-03-06 北京工商大学 Semantic-based three-dimensional model retrieval system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴志强等: "高效可扩展的对称密文检索架构", 《通信学报》 *

Also Published As

Publication number Publication date
CN110348470B (en) 2022-11-22

Similar Documents

Publication Publication Date Title
CN108573411B (en) Mixed recommendation method based on deep emotion analysis and multi-source recommendation view fusion of user comments
Hassan et al. Twitter sentiment analysis: A bootstrap ensemble framework
CN104102626B (en) A kind of method for short text Semantic Similarity Measurement
KR101700585B1 (en) On-line product search method and system
CN103064970B (en) Optimize the search method of interpreter
CN103823838B (en) A kind of method of multi-format document typing and comparison
CN108073568A (en) keyword extracting method and device
CN111737535B (en) Network characterization learning method based on element structure and graph neural network
CN105022754A (en) Social network based object classification method and apparatus
CN107239512B (en) A kind of microblogging comment spam recognition methods of combination comment relational network figure
CA2720842A1 (en) System and method for value significance evaluation of ontological subjects of network and the applications thereof
CN103885937A (en) Method for judging repetition of enterprise Chinese names on basis of core word similarity
CN104408033A (en) Text message extracting method and system
CN111199474A (en) Risk prediction method and device based on network diagram data of two parties and electronic equipment
CN105894183A (en) Project evaluation method and apparatus
CN109033132A (en) The method and device of text and the main body degree of correlation are calculated using knowledge mapping
CN110287329A (en) A kind of electric business classification attribute excavation method based on commodity text classification
CN106339486A (en) Image retrieval method based on incremental learning of large vocabulary tree
CN106528768A (en) Consultation hotspot analysis method and device
CN106202038A (en) Synonym method for digging based on iteration and device
Cao et al. Towards automatic numerical cross-checking: Extracting formulas from text
CN109408643B (en) Fund similarity calculation method, system, computer equipment and storage medium
CN112905906B (en) Recommendation method and system fusing local collaboration and feature intersection
CN109033428A (en) A kind of intelligent customer service method and system
Meena et al. A survey on community detection algorithm and its applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant