CN114064855A

CN114064855A - Information retrieval method and system based on transformer knowledge base

Info

Publication number: CN114064855A
Application number: CN202111329907.9A
Authority: CN
Inventors: 孙瀚; 李坤仑; 王晶; 张庆伟; 王力
Original assignee: Nari Technology Co Ltd; NARI Nanjing Control System Co Ltd
Current assignee: Nari Technology Co Ltd; NARI Nanjing Control System Co Ltd
Priority date: 2021-11-10
Filing date: 2021-11-10
Publication date: 2022-02-18
Anticipated expiration: 2041-11-10
Also published as: CN114064855B

Abstract

The invention discloses an information retrieval method and system based on a transformer knowledge base, which comprises the following steps: selecting 38 standards/specifications commonly used in the field of transformer operation and maintenance to construct a sample library; structurally analyzing the standard/normative document to form a corpus; iterating a transformer knowledge base dictionary and segmenting Chinese words in the electric power field, extracting key words of sentences in a language database, and expanding an iterative transformer knowledge base dictionary; establishing a word frequency feature library and a semantic feature library, and establishing a feature vector library of terms according to word frequency correlation and semantic similarity; integrating learning rough sorting, matching the query quantity input by a user with a feature vector library to obtain a normalized measurement value, and forming a rough sorting list; post-processing and fine sorting, and screening and adjusting a coarse sorting list through a logical strategy. The method has the advantages of high retrieval result sorting accuracy, good expandability, easy dictionary and corpus expansion operation and the like, can be conveniently brought into the intelligent operation and maintenance service process, and promotes standard digital construction.

Description

Information retrieval method and system based on transformer knowledge base

Technical Field

The application relates to the fields of natural language processing, information retrieval, standard digitization and intelligent operation and maintenance of transformers, in particular to an information retrieval method and system based on a transformer knowledge base.

Background

Searching from a simple database to a complex network search engine, the information retrieval relates to inquiring relevant information for a user, and sequencing relevant documents according to a certain rule. The ranking is a core problem of information retrieval, and currently, a relevance ranking model, an importance ranking model and a learning ranking model are mainly used. The relevance ranking model ranks the documents according to the similarity between the query quantity and the documents, such as a Boolean model, a TD-IDF and a BM25, the method considers the word frequency relevance but does not consider semantic information, and the method precisely ranks the retrieval results through a plurality of sub strategies, and is suitable for the precise retrieval field (the query quantity is always in the retrieval results); the importance ranking model does not consider the query quantity, and judges the authority degree of the documents only according to the graph structures among the documents, such as PageRank and TextRank algorithms, and the like, and the method can analyze the relation among the documents, but the retrieval accuracy is not high; the learning sequencing model solves the sequencing problem by using a machine learning algorithm, extracts corresponding characteristics (including various correlation degrees between query quantity and documents, characteristics and importance of the documents and the like) of sentences in a corpus, obtains document correlation labels manually or from a rough arrangement table, and finally uses a model learning sequencing table, such as RankNet and Lambdarank algorithms.

The technical standard/specification of the power industry is a necessary guarantee for guiding basic safety production and promoting high-quality development of services, and is a precondition for the work of equipment management technology. When the basic personnel carry out operations such as operation, maintenance and the like, the operation needs to be carried out strictly according to the flow specification of the technical standard, and a large amount of retrieval requirements of the technical standard exist. However, the national grid company has thousands of technical standards of grid equipment, the number is large, the content is complex, the management is mainly performed offline in the form of books and documents, basic level personnel cannot look up and fully use the standard in time, and the guidance effect of the technical standards on various operations is not fully exerted.

The difference between the standard/standard retrieval and the generalized retrieval task of the knowledge base in the power industry is as follows: the power industry knowledge base retrieval is a bounded retrieval relative to the generalized retrieval, and the retrieval range only aims at a limited number of industry standards and specifications; meanwhile, the retrieval result obtained by the generalized retrieval task is only referred, and the requirement on the comprehensiveness of the retrieval result is higher, but the retrieval of the knowledge base in the power industry is a serious retrieval mode, the retrieval result needs to guide basic level operation, and the requirement on the accuracy of the retrieval result is higher.

In summary, it is urgently needed to adopt a new technical means to construct an information retrieval system based on a knowledge base, to realize structuralization, knowledgeability and intellectualization of technical standards, to improve the use form of basic level personnel for the technical standards, to improve retrieval efficiency, to solve the problem that technical answers are difficult to find or even impossible to find, and to improve the quality and efficiency of equipment management operation. However, the current sorting method in information retrieval basically adopts a single model to retrieve the query quantity input by a user, and the importance sorting model is suitable for accurate retrieval, but does not consider the semantic information correlation, and has low associated retrieval accuracy; the learning model is suitable for associated information retrieval, the whole process is complex, and the accuracy of positioning the middle money in accurate retrieval is not improved.

Disclosure of Invention

The purpose of the invention is as follows: in order to solve the problems in the background art, the invention provides an information retrieval method and system based on a transformer knowledge base, provides a word frequency correlation and semantic similarity feature extraction and fusion algorithm, and provides a method integrating accurate retrieval and correlation information retrieval for a retrieval system, and simultaneously ensures the term positioning accuracy and comprehensiveness of the retrieval result of the knowledge base.

The technical scheme is as follows: in order to achieve the purpose of the invention, the transformer knowledge base-based information retrieval method specifically comprises the following steps:

step 1, selecting common technical specifications under transformer operation and maintenance and overhaul operation scenes to form an original sample library;

step 2, converting the original sample library into an available corpus file by an optical character recognition technology;

step 3, extracting keywords in the titles of all sections in the corpus to form a transformer knowledge base initial dictionary; changing word strings in the corpus into word strings by adopting a word segmentation model, extracting key words in the sentences, and iteratively updating a transformer knowledge base dictionary; the iterative updating of the transformer knowledge base dictionary in the step 3 comprises the following steps:

step 31, carrying out manual word segmentation on 30% of data in the corpus, and filtering useless words in the sentences according to the word stop table;

step 32, training a segmentation model by using the result increment of the manual segmentation; the invention uses the incremental training LAC (legacy Analysis of Chinese) model to perform Chinese word segmentation in the electric power field,

step 33, segmenting all sentences in the corpus according to the trained segmentation model, and filtering useless words in the stop word list;

and step 34, extracting keywords of the segmented result by adopting a TextRank algorithm, and updating a transformer knowledge base dictionary.

Step 4, quantizing words in the document through an algorithm, and extracting word frequency correlation characteristics; meanwhile, sentences with different lengths in the corpus are mapped into sentence vectors with fixed dimensions, and the sentence vectors are input into a measuring neural network model to extract semantic representation of the sentences to obtain semantic information characteristics;

the extracting of the word frequency correlation characteristics in the step 4 comprises the following steps:

step 411, taking each key term after word segmentation as a query term, and calculating to obtain the word frequency correlation characteristic of the query term of each sentence in the corpus by using TF-IDF, BM25 Okapi, BM25+ or BM25F algorithm;

step 412, assigning different weights α and (1- α) to the title and the text;

and 413, constructing a corpus word frequency correlation feature vector library according to the unique search number.

The semantic information features in the step 4, the extraction step of which comprises:

step 421, calculating a 256-dimensional word vector for the word segmentation terms in the corpus by a word2vec algorithm;

step 422, multiplying the word frequency correlation characteristics with the word vectors, and performing weighted average on the word vectors of each term in the sentence to obtain a sentence vector;

step 423, constructing a measuring neural network based on a twin structure, inputting sentence vectors of two sentences into the neural network with the same structure and shared parameters, and optimizing a measuring space through a triple loss function to ensure that related semantic representations are as close as possible and unrelated semantic representations are as far away as possible;

step 424, during model testing, extracting the neuron at the last layer of the measuring neural network as a semantic representation feature;

step 425, constructing a semantic representation feature vector library according to the unique retrieval number;

step 426, when searching the key terms, the cosine distance between the vectors is used to represent the similarity between the two sentences.

Step 5, performing query quantity word segmentation according to the word segmentation model obtained in the step 3 and a transformer knowledge base dictionary;

step 6, firstly, combining the normalization results of various word frequency characteristic quantities to carry out accurate retrieval and coarse sequencing on sentences in a corpus; then, inputting the query quantity into a measuring neural network to perform similarity matching with sentence semantic representations in a corpus, and performing rough ordering on associated information of the corpus according to the value; the method comprises the following steps:

step 61, calculating word frequency correlation tf according to the word segmentation result of the query quantity input by the user_queryIf a query quantity is divided into a plurality of terms by the word segmentation model, calculating the sum of term word frequency relevance:

wherein i represents the ith sentence in the corpus, query represents the word segmentation result of the input query quantity, n represents the word segmentation word string, norm represents the normalized correlation value, and the normalization range is [0.3, 1 ];

step 62, extracting a sentence vector of the query quantity according to the measuring neural network model in the step 4, calculating the query quantity and the cosine distance of the sentences in the corpus by the measuring neural network, and normalizing the result to the range of [0.3, 1 ];

and step 63, combining word frequency correlation and semantic similarity measurement sorting to respectively form a precise retrieval and associated information retrieval coarse sorting list.

And 7, setting a post-processing logic strategy and adjusting the retrieval sequence.

The post-processing logic strategy in the step 7 comprises the following steps:

step 71, secondarily adjusting retrieval sequencing according to whether the query quantity is contained in the title;

step 72, if the word segmentation result contains a plurality of terms, considering the interrelation among the terms and the intervals among the sentences, and finely adjusting the relevance ranking;

step 73, considering the influence of different levels of titles on the retrieval result by the standard/specification, the frequency of the query quantity word segmentation terms appearing in the sentence and other factors, adjusting the weighting weight of the word frequency correlation and the semantic similarity, and adjusting the retrieval sequence;

and 74, realizing retrieval refinement by considering daily constraint, wherein the daily constraint comprises: the same chapter of the same standard appears only once in the system return result; the priority of the power industry standard is higher than the priority of the power industry recommended standard and is higher than the priority of the national grid enterprise standard.

Correspondingly, the invention provides an information retrieval system based on a transformer knowledge base, which comprises a memory, a processor and an interactive display device, wherein the memory is stored with a word segmentation model, a transformer knowledge base dictionary, word frequency correlation characteristics and semantic information characteristics obtained through algorithm training, and the processor executes the steps in the information retrieval method based on the transformer knowledge base according to the query quantity sent by the interactive display device, and sends the final sequencing result to the interactive display device for display.

Has the advantages that: compared with the prior art, the invention has the following remarkable advantages: the method and the system constructed by the invention can simultaneously mine the relation between the query information and the corpus in terms of word frequency and semantics, simultaneously realize accurate retrieval and associated information retrieval, and meet the requirements of actual business on the accuracy and comprehensiveness of term positioning of power standard/standard retrieval; by adopting various word frequency analysis algorithms and integrating learning fusion and adjusting retrieval sequencing, the term positioning accuracy of the algorithms is effectively improved. In the process of establishing a knowledge base, different weights are set for each level of titles and texts according to service requirements and document characteristics of electric power industry standard/standard retrieval, the effect of the titles on the texts is highlighted, and the weights are introduced into word frequency correlation calculation; and an associated retrieval model is established by combining word frequency correlation and semantic features, and the clause positioning accuracy and comprehensiveness of an associated retrieval result are improved. According to the method, a post-processing fine ranking strategy suitable for standard/standard retrieval in the power industry is established according to business requirements, and the result ranking reasonability is improved; a sample library containing 38 standards and specifications is constructed according to actual business requirements, when the sample library needs to be expanded, incremental learning can be conveniently carried out on the related models, the updating of a corpus, a word segmentation dictionary and a feature library can be realized through file continuous writing, and the expansion operation is simple and convenient; the system designed by the invention adopts the RESTful interface to realize the interaction between the user and the program, and finally, the interaction is displayed and presented in a webpage form, so that the interaction can be brought into the business processes and systems of intelligent operation and maintenance, digital teams and groups and the like.

Drawings

FIG. 1 is the effect of the document saturation parameter of the present invention on the score of the BM25 series algorithm;

FIG. 2 is a schematic diagram of a semantic information representation model according to the present invention;

FIG. 3 is a schematic diagram of a transformer knowledge base-based information retrieval system retrieval interface according to the present invention;

FIG. 4 is a schematic diagram of a retrieval result interface of the transformer knowledge base-based information retrieval system according to the present invention;

fig. 5 is a flowchart of the transformer knowledge base-based information retrieval method according to the present invention.

Detailed Description

The technical solution of the present invention will be further described with reference to the accompanying drawings and embodiments.

According to the transformer knowledge base-based information retrieval method, by constructing the accurate retrieval and associated information retrieval model, the relation between the query quantity input by a user and the standard/specification on semantics and word frequency can be mined, the accuracy, comprehensiveness and sequencing rationality of the retrieval result can be improved, the retrieval requirement in actual service is met, and a technical foundation is laid for standard/standard digital work of a transformer substation. The flow of the method is shown in fig. 5, and specifically comprises the following steps:

and step A, constructing a sample library.

According to the actual requirements of the operation and maintenance of the transformer, 38 common technical specifications are selected to form a sample library, and a list of the sample library is shown in table 1.

TABLE 1, 38 common technical standards/Specifications

And step B, establishing a corpus.

7742 corpus materials used in the invention are converted into csv format corpus materials needed in the subsequent processing by OCR, structuring processing and other methods. The method comprises the following specific steps:

(B.1) downloading the standard in the table 1 in a full text mode from websites of a national knowledge network (file format is pdf), a national power grid standard electronic library (file format is gwbz) and the like;

(B.2) directly carrying out OCR recognition on the document in the PDF format, and storing the document in the word format; and printing and scanning the GWBZ-format file to save the GWBZ-format file as a PDF-format document, performing OCR (optical character recognition) and saving the GWBZ-format document as a word-format document. In order to prevent errors in the conversion process, correction is carried out. The converted document can be checked and verified manually, and wrongly written characters, formula symbols and the like after OCR conversion caused by low initial document quality can be modified;

(B.3) analyzing the word document according to the title structuralization, storing the analyzed result as a csv format file, wherein the row content of the csv file is each sentence in the text, and the column content comprises a unique index number, a standard/specification name, a lowest-level subtitle to which a paragraph belongs, a paragraph subtitle set, a text and other key information;

the csv document generated by manual review mainly modifies errors encountered in the document structuring process, such as missing title numbers and wrong titles, and also needs to adjust and modify the length of each sentence, so as to avoid that too long or too short sentences influence the performance of the subsequent algorithm.

And step C, iteration of a transformer knowledge base dictionary and Chinese word segmentation in the electric power field.

Firstly, establishing a transformer knowledge base first edition dictionary: and manually extracting key words in the titles of all sections in the corpus to form a transformer knowledge base initial dictionary. And then performing word segmentation and iteration on the transformer knowledge base primary dictionary. The invention relates to dictionary iteration and Chinese word segmentation in the electric power field, which mainly comprises incremental training of a word segmentation model, extraction of sentence keywords according to word segmentation results and dictionary iteration based on the keywords. The method comprises the following specific steps:

and (C.1) Chinese word segmentation in the electric power field. The Chinese word segmentation model is an LAC algorithm (https:// github. com/baidu/LAC) developed by Baidu corporation, and the method adopts an incremental training method to enable the algorithm to be more suitable for a data set.

(C.1.1) training set preparation. Carrying out manual word segmentation on 30% of sentences in a corpus to form a word segmentation model training set, wherein the general principle of manual word segmentation is to ensure that proper nouns in the electric power field are not separated, and separated terms are separated by spaces;

(C.1.2) incremental training. Performing incremental training on the training set, wherein the training period is 10 rounds, and the trained model is stored to a specified path;

(C.1.3) constructing a primary version dictionary. The initial dictionary used in the invention mainly comprises titles corresponding to sentences in the corpus, and the initial dictionary is manually checked by word segmentation, wherein the general principle of checking is to keep important terms;

(C.1.4) Chinese word segmentation. Segmenting the sentences in the corpus based on the trained model and the initial version dictionary, and considering that partial English still exists in the standard specification, segmenting the word segmentation result by using "|", and storing the segmentation result in a fixed path;

(C.1.5) preprocessing the word segmentation result. Preprocessing is one of the important steps in processing text models, which has no fixed rules but is related to the task being performed. The word segmentation result preprocessing specifically comprises the following steps:

(C.1.5.1) stop words are filtered out. Stop words are the most common words that do not add much extra value to subsequent text processing tasks (e.g., "what," i.e., "then," etc.), and filtering out stop words increases computational and storage efficiency. The invention traverses all word segmentation results, and if the word string is contained in the stop word list, the word string is not added to the final word segmentation list.

(C.1.5.2) filtering the single character. The single character has a limited effect on knowing the document, so the invention traverses the iterative word segmentation result, and if the length of the word string is not more than 1, the word string is not added to the word segmentation list.

(C.1.5.3) filtering out punctuation marks. Punctuation marks are unnecessary marks for a corpus which is mainly Chinese corpus adopted by the invention, and because (B.2) OCR is adopted to convert a document, the punctuation marks have the condition of confusion of half angles and full angles, the filtered punctuation marks simultaneously contain half angle and full angle characters (such as ",",\\ n "," - ", and the like), the invention stores all the marks which need to be filtered in a variable, and then iterates the variable, and deletes specific marks in the corpus;

and (C.1.5.4) English characters are unified into lower case. Because of The amount of english abbreviations, interpretations, formulas, etc. that exist in The power industry standards/specifications, computers are sensitive to textual data, i.e., that is different from The, but humans know that The two are semantically The same. Therefore, all English characters are converted into lower case processing;

and (C.1.5.5) English stem conversion. Tenses are special representation forms (such as playing and played) in English, but semantic information expressed by different tenses is basically the same, and stem conversion is to convert words in different tenses into stems. The present invention uses the Porter-Stemmer library to identify and delete suffixes or affixes of words.

And (C.2) extracting the keywords in the word segmentation result. In the field of Natural Language Processing (NLP), keyword extraction is one of the basic tasks, and for texts with different lengths, the topic idea of the whole speech can be known through several keywords. The method adopts a TextRank algorithm based on a word graph model to extract keywords, constructs and analyzes a language network graph of a document through the relation between word strings, and searches terms with important functions on the graph. The method comprises the following specific steps:

(c.2.1) construct language network directed weighted graph G ═ V, E. Wherein V is a node set, and is composed of word segmentation results of (C.1.5), then an edge E between any two points in the graph is constructed by adopting a co-occurrence relation, only when terms of two nodes are co-occurrence in a window with the length of K, an edge exists between the nodes, when K represents the size of the window, namely K words appear at most, when E is a unidirectional edge, an edge weight value (w) is 0.5, and when E is a bidirectional edge, the edge weight value is 1.

(C.2.2) iteration of the algorithm. According to equation 1, the importance weight S of each node is iterated until convergence:

in the formula (d)_nThe damping coefficient represents the probability that a certain point in the graph points to any other point, and the value is generally 0.85; in (V)_i) Is the presence of a set of pointing terms i, Out (V)_j) Is a connected set of terms, w, where the term j exists and points to other terms_jiIs the edge weight between the terms i and j, w_jkIs the edge weight between the terms k and j. The formula requires multiple iterations to obtain the result, S (V)_j) The initial importance weight S of each term is set to 1 as the importance weight of the term j, and the convergence aims at an error rate of any point in the graph being less than a given threshold.

(C.2.3) carrying out reverse ordering on the importance weights of the nodes to obtain the most important T terms which serve as candidate keywords, wherein the value of T is 3;

(c.2.4) the T keywords from (c.2.3) are marked in the original text, and if adjacent phrases are formed, they are combined into a keyword phrase.

And (C.3) iterating the transformer knowledge base dictionary, adding the generated keywords into the transformer knowledge base dictionary, and ensuring that the keywords only appear in the dictionary once. The invention finally forms a dictionary containing 4620 keywords with 38 labels in table 1. And C.1.2, carrying out manual word segmentation and word segmentation result preprocessing flow in the power industry again according to the dictionary and the word segmentation model trained in the step C.1.2 to form a final word segmentation result, and storing the final word segmentation result into the JSON file by taking the unique index number as a key and the word segmentation result as a value.

And D, extracting the characteristic information.

The invention extracts word frequency information and semantic representation from sentences in a corpus as characteristic vectors. The word frequency information is obtained by quantifying terms in the document through an algorithm, the quantified numerical value represents the importance of the terms in the document and the corpus, and the characteristic information is suitable for an accurate retrieval scene; meanwhile, sentences with different lengths in the corpus are mapped into a sentence vector with fixed dimensionality, the sentence vector is input into the measuring neural network model to extract semantic representation of the sentences, and the representation is suitable for relevant information retrieval scenes.

And (D.1) word frequency information characteristics. At present, a search engine is an indispensable part in work and life, from web search to mobile phone application search to chat record search, the intuitive understanding of search results is that the more times of occurrence of key terms of query quantity, the higher the matching degree of documents (web pages) and the query quantity. This is the intuitive embodiment of the word frequency correlation characteristic. The invention respectively uses TF-IDF, Okapi BM25, BM25+ and BM25F to calculate the word frequency correlation characteristics.

(D.1.1) the term frequency-inverse document frequency (TF-IDF) algorithm, the calculation process can be divided into TF and IDF, and the specific steps are as follows:

(D.1.1.1) calculating the Term Frequency (Term Frequency). The word frequency measures the occurrence frequency of terms in the document, and depends on the length of the document and the universality of the terms, but because the sentences in the corpus are different in length and cannot be simply called, the long document is more important than the short document, therefore, the word frequency is normalized, and is divided by the total number of terms in the sentences;

(D.1.1.2) word frequency vectorization. The search task cannot only consider the words appearing in the documents, which results in different vector lengths, and therefore requires the vectorization of the query volume entered by the user and the documents in the corpus. The invention counts all possible term lists based on the word segmentation result, then checks the count of each term in the term list for each sentence in the corpus, if yes, fills in TF value in the corresponding position of the vector, if no, fills in 0:

wherein t represents the tth term in the term list, d represents the tth sentence in the corpus, and tf (t, d) tableWord frequency, n_tRepresenting the number of times the term t appears in the current document d,

representing the total number of occurrences of all words in the d document.

(D.1.1.3) calculating Document Frequency (DF). The DF value is a method of measuring the importance of a sentence (i.e., a document) in a corpus, and refers to the number of sentences in the corpus that contain the term. The present invention uses the notation DF to represent the DF value, normalized by dividing by the total number of sentences in the corpus (denoted N) in order to keep DF and tf (t, d) within a range;

(D.1.1.4) calculating Inverse text Frequency (IDF) characterizes the relationship between the Frequency and importance of the term appearing in the corpus, with higher frequencies being less important. Aiming at the requirement of subsequent expansibility, as the corpus is gradually increased to cause IDF value explosion, the log logarithm of the IDF value is taken; meanwhile, in order to prevent a case where a term does not appear in the term list, resulting in a denominator df of 0, the present invention adds 1 to the denominator to smooth the IDF value:

idf(t)＝log(N/(df+1)) (3)

where idf (t) is the inverse text frequency, N represents the total number of documents, and df represents the document frequency.

(D.1.1.5) calculating the TF-IDF value. The invention obtains the TF-IDF score by taking the product value of TF and IDF:

tf-idf(t，d)＝tf(t，d)*log(N/(df+1)) (4)

(D.1.1.6) calculating the weighted TF-IDF value. Based on the importance of the title in the power industry standard/specification, different weights need to be set for the title and the text, and the text weight is set to be 0.4 and the title weight is set to be 0.6. Updating the tf-idf value by comparing whether a term is included in the title in the term list, if the term is included in the title, maintaining the tf-idf value, otherwise multiplying the tf-idf value by a weight of 0.4;

(D.1.1.7) storing the TF-IDF value. The method uses a dictionary form to store TF-IDF values, a key of the dictionary is a tuple, the tuple elements are formed into unique index numbers and terms, the value of the dictionary is a td-IDF value calculated by a formula 4, and finally the dictionary is stored into a JSON format file and stored to a specified path.

The (d.1.2) Okapi Best Matching 25(Okapi BM25) algorithm is a ranking function used by search engines to evaluate the relevance of documents to a given query quantity, and is essentially a bag-of-words search function that ranks a set of documents according to the query quantity present in each document, regardless of the proximity of the query quantity to the documents. BM25 is a series of scoring functions with slightly different components and parameters, and is represented by the classical Okapi BM25, as shown in equation 5:

the components and parameters in the formula have the following specific meanings:

variable q_iRepresenting the ith query term. For example, if a user needs to search for a "cannula," the query volume contains only one term, and thus q is₀Namely a sleeve; if the result of the Chinese word segmentation in the step C power field is 'sleeve insulating oil', namely containing 2 key terms q₀Is a "sleeve", q₁The results of the two parts of retrieval and calculation are substituted into other components of formula 5, and the results are added to obtain the final output score value of the Okapi BM25 algorithm.

IDF(q_i) Is the inverse document frequency of the ith query term. IDF (q) in BM25, although the same phrase name as in D.1.1.4_i) The calculation of (c) is different from the TF-IDF, and the IDF in BM25 measures the frequency of a term appearing in all documents and makes a "penalty" policy for common terms, as shown in equation 6:

wherein docCount represents the total number of sentences in the corpus, f (q)_i) Representing the number of documents containing the ith query term. Equation 6 assigns relatively rare query termsWith higher IDF (q)_i) Coefficient, making these terms more contributing to the final BM25 score. For example, the query amount entered is "despite casing insulating oil. -," and the key term after the word segmentation is "despite casing insulating oil. -," because "although" often occurs in the corpus, it is "casing insulating oil" that is the term that contributes more to the final output result. More weight can be given to key terms by equation 6.

Representing the relative length of the sentence, is calculated by dividing the length of the current sentence by the average length of the sentences in the corpus. If the document length is longer than the average length, the denominator in equation 5 becomes larger and score (D, Q) decreases; if it is shorter than the average length, the denominator in equation 5 becomes smaller and the score increases. The sentence length refers to the number of effective terms after Chinese word segmentation preprocessing.

The weighting factor b times the relative length is used to control the effect of the document length on score value, if b is larger, the sentence length will have a larger effect on score, if b is set to 0, then the relative length of the sentence will not have an effect on score. The variable b is set to 0.75 in the present invention.

f(q_iD) represents the key term q in the document D_iThe meaning of the parameter is that the more times the keywords appear in the document, the higher the score is, and intuitively, the document containing the key terms many times is more likely to be matched with the key terms to be queried.

Variable k₁The representative document saturation parameter helps to determine the word frequency saturation characteristics, which limits the degree of influence of a single query term on the document score. FIG. 1 illustrates k₁Influence on BM25score, k₁Characterizing the slope change of the curve, when tf is less than or equal to k₁When the word frequency has a fast increase in the influence curve on the score, and when tf > k₁In time, the influence curve of the word frequency on the score gradually becomes slow. In the present invention, k is₁The value of (d) is set to 1.5.

The (d.1.3) BM25+ algorithm is an extended form of the Okapi BM25 algorithm, and the relative lengths of the sentences described in d.1.2.3 result in similar relevance of very long documents to shorter documents that contain no key terms at all, resulting in an unfair scoring of very long documents by equation 5. Therefore, the scoring formula of BM25+ adds an additional free parameter δ, set to 1 in the present invention:

the (d.1.4) BM25F algorithm is another modification to Okapi BM 25. Okapi BM25 considers the document as a whole when calculating relevance, but the power industry standards/specifications are usually cut into multiple parts, each part contains the fields of title, body, etc., which cannot be treated equally as they contribute to the ranking of the search results, and different weights need to be set for them, and the score calculated by BM25F is the weighted sum of the scores of the terms in each domain:

wherein the content of the first and second substances,

denotes the frequency of occurrence of the term t in the document d, boost_cRepresents the weight, l, of the corresponding field c_cIndicates the length of the content in field c, avl_cRepresents the average length of the content in field c, b_cIs a regulatory factor of domain c. In the invention, a title domain and a text domain are arranged together, the weight of the title domain is 0.6, and b is because the length of the title is shorter_cSet to 0; text field weight 0.4, b_cSet to 0.75.

(D.2) semantic information representation is that the representation of sentences in a vector space is automatically learned through a deep neural network, sentences with similar semantics are enabled to be close to the vector space as much as possible, sentences with different semantics are enabled to be far away from the vector space as much as possible, and an algorithm model schematic diagram is shown in FIG. 2. The method solves the problems that the word frequency correlation characteristic can only output the document containing the query term accurately (namely accurate retrieval) and can not carry out semantic synonym retrieval (namely associated information retrieval). The method comprises the following specific steps:

(D.2.1) the word2vec algorithm generates a word vector of the next word according to the current word, wherein the word vector refers to a numerical representation form which converts different characters and symbols into the same dimension or is embedded into a mathematical space. The context is predicted according to the input words by using a Skip-Gram word2vec model. The modeling process of the algorithm is similar to that of an auto-encoder (auto-encoder), the overall process is to construct a neural network with input and output in the same dimension, after the model is trained, a new task is processed without the model, only hidden layer parameters of the model training need to be extracted to serve as word vector representation, and the specific steps are as follows:

(D.2.1.1) the input layer is a one-hot coded form of the current word, for example, assuming that V terms sorted according to a certain rule are shared in the corpus, finding the current term in the sorted list and setting the corresponding position in the vector to be 1, the invention has 22794 keywords, so

n is the number of input keywords;

(D.2.1.2) the purpose of the hidden layer is to get a semantic feature representation, the term vector dimension used in the present invention is 300, so the weight matrix of the hidden layer is

The hidden layer output is the word vector of the corresponding term;

(D.2.1.3) the dimensionality of the output layer vector is the same as that of the input layer, the output layer vector is a sofimax regression classifier, the value of each output node represents the probability between 0 and 1, the sum of the probabilities of all output layer neuron nodes is 1, and the model optimization goal is to enable the value of the output layer neuron nodes to be as close as possible to the input one-hot vector;

(D.2.1.4) the invention uses the genesis toolkit to train the word2vec model, as specified below:

(D.2.1.4.1) Window size: different tasks should use different window sizes, word vector characterization using smaller windows (typically 2-15), indicating that the terms are interchangeable if the similarity score is high. The corpus of the electric power industry standard/specification is divided by paragraphs or sentences, so that the requirement can be met by selecting a smaller window, and in the invention, the window size is set to be 5, which indicates that the corpus also comprises two terms before and after the input term;

(D.2.1.4.2) negative sample number: the word2vec paper considers that 5-20 negative samples are relatively ideal in number, when a large enough data set is provided, 2-5 negative samples can meet requirements, and 9 negative samples are used in combination with the size of a used corpus;

(D.2.1.4.3) other parameters: the parameter min _ count is set to 5, representing removal of terms that occur less than 5 times; corpus training co-iterates for 10 rounds.

(D.2.1.5) storing and loading word vectors, wherein the word vectors are stored and loaded through model.wv.save _ word2vec _ format and KeyedVectors.load in gensim respectively, and the stored format is the word vectors, and the format can be simply converted into a vector (ndary) or tensor (tensor) form to facilitate subsequent calculation.

(D.2.2) IDF weighted average sentence vector. The average sentence vector is that word2vec word vectors are taken for each term after word segmentation, and each digit is added and averaged respectively, so that all words are considered to be the same in importance degree. The present invention refines the importance or contribution of each term to the current text using the IDF value, which is calculated using equation 4 to weight the word vector.

(D.2.3) the measuring neural network adopts a twin network structure (same structure and weight sharing), semantic representation is obtained by sentence vectors in the D.2.2 through two full-connection layers, Triplet Loss is set as a Loss function, and sentence vector space is optimized, so that semantically similar sentences are close to each other as much as possible, irrelevant sentences are far away from each other as much as possible, and the method specifically comprises the following steps:

(D.2.3.1) measuring automatic generation of neural network training sample pairs, calculating cosine similarity between a sentence vector (anchor) of each sentence and other sentence vectors in a corpus, sorting the sentence vectors from large to small according to the similarity, selecting the first 20 bits in a sorted list as similar positive sample pairs (positive sample pairs, positive), marking the similar positive sample pairs as 1, selecting the last 20 bits in the sorted list, considering the similar positive sample pairs as irrelevant negative sample pairs (negative sample pairs, negative), and marking the similar positive sample pairs as 0;

(D.2.3.2) training sample pairs to be audited manually, and auditing automatically generated samples by a service expert, and modifying and removing the sample pairs with error marks;

the (D.2.3.3) measuring neural network structure comprises three full connection layers, and the weight matrixes are

The output semantic feature vector is 64-dimensional;

(D.2.3.4) the triple loss function is used for training samples with small differences, the training marks comprise anchor samples (anchors), positive samples (correlated samples) and negative samples (uncorrelated samples), and metric learning is realized by optimizing that the distance between the anchors and the positive samples is smaller than that between the anchors and the negative samples:

where margin represents a distance threshold, and is set to 0.35 in the present invention.

(D.2.3.5) the algorithm training parameters of the present invention are as follows: the initialization mode of the full connection layer adopts random initialization, the optimizer selects SGD, the initial learning rate is 0.01, the momentum term is 0.949, and the weight attenuation coefficient is 5 multiplied by 10^-4The batch training size is 32, the training iterations total 100 rounds, and the learning rate reduction mode is that 1/10 is respectively scaled on the basis of the original learning rate in the 60 th round iteration and the 80 th round iteration.

And (D.2.3.6) after the model training is finished, sequentially inputting all sentence vectors weighted by IDF in the corpus into a measuring neural network to obtain neurons of an output layer in D.2.3.3 as semantic features, and storing the neurons into a JSON format file according to a unique retrieval number.

And E, querying quantity participles.

And C.1, segmenting the query quantity input by the user according to the LAC segmentation model of the incremental training in the step C.1, updating the transformer knowledge base dictionary after iteration in the step C.3 and a segmentation preprocessing mode in the step C.1.

And F, integrating retrieval and sequencing.

According to the word segmentation result of the query quantity, the word frequency correlation and semantic similarity value between the word segmentation result and sentences in the corpus are calculated, and because the value ranges of the metric values of each algorithm are different, the results obtained by the same algorithm are normalized to the range of [0.3, 1], and finally, the addition and summation are carried out according to the unique retrieval number. The method comprises the following specific steps:

(F.1) calculating the matching scores of the query quantity word segmentation terms and the corpus word segmentation results, wherein the matching score calculation is applicable to all the methods in the step D.1. Reading the JSON file stored in the dictionary form in the step D.1, traversing all keys in the dictionary according to the word segmentation terms, if the terms exist, adding the score value calculated by the algorithm into a word frequency correlation dictionary taking the unique retrieval number as the key, normalizing the score values of all related documents of the same algorithm, and finally forming the sequence of each algorithm;

(F.2) calculating the cosine similarity of the query quantity and the corpus, wherein the matching score is an effective method in most cases, but the matching score ordering can be deviated as the number of the term of the query quantity participle increases. The cosine similarity vectorizes the query quantity and the documents in the corpus, and the specific steps are as follows:

(F.2.1) generating 22794-dimensional vectors by using all keywords in the corpus, and sequencing the words according to a certain sequence;

(F.2.2) finding an index position in the vector generated by the F.2.1 according to the word segmentation result of the sentence in the corpus, and filling the score value of the corresponding algorithm into the vector;

(F.2.3) for vectorization of the query quantity, calculating word frequency correlation according to the method in the step D.1, wherein term frequency terms (TF terms) in TF-IDF and BM25 algorithms are determined by the query quantity, while IDF terms are calculated according to related parameters in a corpus, and the score value of the word frequency information is filled in corresponding index positions of the query vector;

(F.2.4) calculating cosine similarity according to the formula 10 and sorting:

wherein vec_qAnd vec_dRepresenting the vectorized query volume and corpus documents, respectively.

(F.3) integrating the word frequency correlation calculation results, averaging the similarity measures cos _ sim obtained by TF-IDF, Okapi BM25, BM25+ and BM25F by taking the unique search number as a basis, and sequencing according to the averaged numerical values to form a coarse word frequency correlation ranking list.

(F.4) calculating the semantic similarity between the query quantity and the corpus, inputting the query quantity into the trained deep neural network to obtain semantic information feature vectors, and comparing the semantic similarity with the semantic feature vectors in the corpus, wherein the method specifically comprises the following steps:

(F.4.1) loading a word vector file of the corpus through a KeyedVectors.load function in a gensim toolkit;

(F.4.2) respectively finding word vector representations and IDF weights from the loaded word vector library and the corpus according to the query quantity word segmentation result in the step E to obtain sentence vectors after IDF weighting;

(F.4.3) inputting the weighted sentence vector into the depth measurement neural network which completes training, and obtaining the neuron number value of the output layer as a semantic information feature vector;

(F.4.4) reading the sentence vectors of the corpus stored in the step D.2.3.6, and calculating the cosine similarity with the query quantity sentence vectors in sequence;

(F.4.5) frequency NR of occurrence of word segmentation results in corpus sentences according to query quantity_iWeighting the cosine similarity as a semantic similarity measure:

in the formula, apear_iRepresenting the number of times a term appears in the ith sentence of the corpus, N_iIndicates the total number of non-repeating terms contained in the ith paragraph.

(F.5) normalizing the semantic similarity to a [0.3, 1] range, and sequencing to form a semantic similarity coarse sequencing list.

And G, post-processing fine ranking, making a post-processing strategy according to the word frequency correlation and the semantic similarity, adjusting the coarse ranking list, and forming a final fine ranking list, wherein the method specifically comprises the following steps:

(G.1) creating a fine sorting list of final output;

(G.2) performing word frequency correlation post-processing fine ordering, starting iteration according to the coarse ordered list, and adding the unique index number of the paragraph into the fine ordered list in advance when the following conditions are met:

(g.2.1) the query volume is completely contained in the body or title;

(g.2.2) the query quantity tokenization result is completely contained in the title;

(G.2.3) if the query quantity is separated by the space bar, the character string is completely contained in the title or the text after being spliced.

(G.3) semantic similarity post-processing fine sorting, starting iteration according to the coarse sorting list, and adding a unique index number to a fine sorting category when the participle results meeting the query quantity are completely contained in a set (a corpus paragraph text, a paragraph corresponding title and a standard large title);

(G.4) integrating word frequency correlation and semantic similarity results, adding and sequencing the remaining paragraphs in the rough-ordered list, and adding the final sequencing result into the fine-ordered list;

(G.5) filtering the fine ordered list according to the rules of practical application, which specifically comprises:

(g.5.1) the same standard and the same chapter appears only once in the system return result;

(G.5.2) when the numerical measurement standards are the same, the priority of the power industry standard (DL) is higher than the power industry recommendation standard (DL/T) and is higher than the national grid enterprise standard (Q/GDW), and the like.

And H, building a retrieval system based on the Restful interface.

The retrieval system mainly comprises a front-end Web interface and a rear-end retrieval module, wherein the front-end Web interface is built by using an Vue framework based on a personal computer loaded with a Windows operating system; the back-end retrieval module is realized based on Python loading the Ubuntu system, and the two modules are communicated through a Restful interface. The method mainly comprises the following steps:

(H.1) installing and building Vue debugging development environment.

(h.1.1) download install node environment because Vue's execution relies on npm management tool implementation in the node;

(H.1.2) an Vue project environment was set up, installing globally available vue-cli scaffolding.

(H.2) webpage rendering and style management are realized through the cssmodule in Vue, the webpage style is designed as shown in FIG. 3, and a user can enter the term or sentence of the required query in a text box in the center of the interface.

(H.3) after the user finishes typing, click the next "search" button to search, when clicking the search, Vue generates the required request link through axios tool library, the invention uses GET mode to transmit parameters to the backend program, the link form is http: /< ip >: < port >/irquery ═ Q >. The ip in the link is a server ip for running a Python program, the port is a port for opening communication of the server, and the Q is a query quantity input by a user.

(H.4) the back-end searching module mainly comprises the functional modules of link analysis, initialization, information searching and the like.

(H.4.1) considering the operation efficiency, the time consumption of the initialization process in the retrieval program is long, and if the initialization parameters are loaded every time of retrieval, the actual operation requirements cannot be met. Therefore, before interface communication is carried out, program initialization is carried out, and required artificial intelligence models and parameters are loaded, wherein the artificial intelligence models and the parameters comprise a word segmentation model, a word stop list, word2vec word vectors, a depth measurement learning model, a word frequency information characteristic vector library, a semantic information characteristic vector library and the like;

(H.4.2) installing a flash toolkit of Python to realize RESTful interface service, transmitting url and communication modes (GET, POST and the like) transmitted by a front end through app.route, and analyzing url parameters through request.args to obtain required query quantity;

(H.4.3) retrieving information according to the query quantity, wherein the program returns to the fine-sorted list in the step G.4, the returned value is a character string in a JSON format, and the key mainly comprises a sorted serial number, a power industry standard/specification name and corresponding text content.

(H.5) Vue parses the JSON character string returned by the back end, and renders and displays the JSON character string in a Web interface through a CSS module. Vue the variables in the JSON character string are read directly by the name of the key, the standard/standard name amplification is positioned above each search result, the text content is placed under the name, the position containing the key terms of the query quantity in the text is highlighted, and the search result is shown in figure 4.

Based on the method, a retrieval system based on the RESTful interface is built, and real-time interaction between the user and the code is realized. The system comprises a memory, a processor and an interactive display device, wherein word segmentation models, a transformer knowledge base dictionary, word frequency correlation characteristics and semantic information characteristics obtained through algorithm training are stored in the memory, the processor carries out accurate retrieval rough sequencing on sentences in a corpus by combining normalization results of the word frequency correlation and the semantic similarity according to query quantity sent by the interactive display device, then inputs the query quantity into a metric neural network to carry out similarity matching with sentence semantic representations in the corpus, carries out corpus correlation information rough sequencing according to the value, adjusts the retrieval sequencing according to post-processing logic strategies, and sends final sequencing results to the interactive display device for displaying. The system adopts RESTful interface to realize the interconnection of the user and the code: a user clicks a retrieval button after typing query quantity on a front-end Web interface, and the front end automatically sends a GET request to a back-end program; analyzing url links sent by a front end by using a rear end Python code, inputting query quantity input by a user by using a rear end program, and returning a retrieval result to a retrieval system in a JSON character string mode; and the retrieval system analyzes the JSON character string and finally displays the retrieval result on a system interface.

Claims

1. An information retrieval method based on a transformer knowledge base is characterized by comprising the following steps:

step 3, extracting keywords in the titles of all sections in the corpus to form a transformer knowledge base initial dictionary; changing word strings in the corpus into word strings by adopting a word segmentation model, extracting key words in the sentences, and iteratively updating a transformer knowledge base dictionary;

step 6, firstly, combining the normalization results of the word frequency correlation characteristics and the semantic similarity characteristics to accurately retrieve and roughly sort the sentences in the corpus; then, inputting the query quantity into a measuring neural network to perform similarity matching with sentence semantic representations in a corpus, and performing rough ordering on associated information of the corpus according to the value;

and 7, adjusting the retrieval sequence according to the post-processing logic strategy.

2. The transformer knowledge base-based information retrieval method of claim 1, wherein the step 3 of using the word segmentation model to change the word strings in the corpus into word strings, extracting keywords in the sentences and iteratively updating the transformer knowledge base dictionary comprises the following steps:

step 32, training a segmentation model by using the result increment of the manual segmentation;

3. The transformer knowledge base-based information retrieval method according to claim 1, wherein the extracting the word-frequency correlation characteristics in step 4 comprises the following steps:

step 412, assigning different weights α and (1- α) to the title and the text;

4. The transformer knowledge base-based information retrieval method according to claim 1, wherein the semantic information features in step 4 are extracted by steps including:

5. The transformer knowledge base-based information retrieval method according to claim 1, wherein the step 6 comprises:

wherein tf is_queryFor word frequency correlation, i represents the ith sentence in the corpus, query represents the word segmentation result of the input query quantity, n represents the word segmentation word string, norm represents the normalized correlation value, and the normalization range is [0.3, 1]]；

6. The transformer knowledge base-based information retrieval method according to claim 1, wherein the post-processing logic strategy in step 7 comprises:

7. An information retrieval system based on transformer knowledge base is characterized in that: the system comprises a memory, a processor and an interactive display device, wherein a word segmentation model, a transformer knowledge base dictionary, a word frequency correlation characteristic and a semantic information characteristic which are obtained through algorithm training are stored in the memory, the processor executes the steps in the transformer knowledge base-based information retrieval method in claim 1 according to the query quantity sent by the interactive display device, and sends the final sequencing result to the interactive display device for display.