CN112036178A

CN112036178A - Distribution network entity related semantic search method

Info

Publication number: CN112036178A
Application number: CN202010864615.4A
Authority: CN
Inventors: 王鑫; 张淑娟; 汪玉; 赵龙; 胡世骏; 秦丹丹; 郑高峰; 刘丽; 李龙跃; 高博; 徐斌; 袁方; 李金中; 王潇; 孙伟; 李博; 卞真旭; 金雨楠; 钱光超; 仇茹嘉
Original assignee: State Grid Corp of China SGCC; Electric Power Research Institute of State Grid Anhui Electric Power Co Ltd; State Grid Anhui Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Electric Power Research Institute of State Grid Anhui Electric Power Co Ltd; State Grid Anhui Electric Power Co Ltd
Priority date: 2020-08-25
Filing date: 2020-08-25
Publication date: 2020-12-04

Abstract

The invention relates to a semantic search method related to a distribution network entity, which is characterized by comprising the following steps: performing word segmentation on a text in a distribution network entity by a word segmentation method based on word frequency statistics to obtain a word bank; performing model word segmentation on the text in the distribution network entity to obtain a word bank; merging a word bank based on word frequency and a word bank based on a model; labeling proper nouns with wrong words in a word bank; a large amount of training is carried out on texts in the distribution network entity by using the labeled words to obtain a word bank; combining the three word banks based on the Jieba word segmentation packet; the Jieba word segmentation packet can be added with professional words labeled by the Jieba word segmentation packet; before a large amount of training is carried out on a word stock, separating characters such as line feed characters, periods, commas and the like are firstly used for dividing a full text into short sentences; word segmentation is counted based on word frequency to count the frequency of simultaneous occurrence of any two words.

Description

Distribution network entity related semantic search method

Technical Field

The invention discloses a semantic search method related to distribution network entities, and relates to graph database and semantic search.

Background

Searching for topological data entities requires positioning to specific systems and physical tables through assistance of service personnel on one hand, and also requires data management personnel to write fixed query statements for correlation retrieval in a database on the other hand. The data resource retrieval mode is time-consuming and labor-consuming, the method is poor in expansibility, when entities of different types are searched, business personnel and data management personnel are required to conduct reprocessing, and all distribution network topology data entities in the full-service unified data center are difficult to cover.

In order to solve the problems that the search content of the existing search engine is limited in key fields, the search efficiency is low, the query result is disordered and single, and the like, the project analyzes information search characteristics and carries out intelligent search work of distribution network data resources based on semantics on the basis of technologies such as knowledge map and natural language processing. Firstly, a distribution network entity name recognition model is trained to accurately recognize a special entity name in the field of a power grid, and secondly, natural language processing technologies such as lexical analysis, syntactic analysis and semantic recognition are used for performing semantic analysis on natural language problems input by a user; then, an inverted index technology is utilized to build an index for the text information in the knowledge graph, and technical support is provided for quickly positioning the search target; and finally, performing relevance calculation on the result list, and returning the result with the highest relevance to the user.

Disclosure of Invention

The invention aims to solve the problem of semantic search in a distribution network entity.

The technical scheme for realizing the invention is as follows:

the invention relates to a semantic search method related to a distribution network entity, which comprises the following steps of 1: performing word segmentation on the text in the distribution network entity;

s1, obtaining a first word bank for the text in the distribution network entity based on a word frequency method;

s2, performing word segmentation on the text in the distribution network entity by using the distribution network entity name recognition model to obtain a second word bank;

s3, merging the first word stock and the second word stock to obtain a merged word stock;

s4, deleting words with wrong word segmentation in the combined word library manually, and marking correct words with wrong word segmentation in the text in the distribution network entity to obtain a marked text;

s5, segmenting the tagged text again by using the distribution network entity name recognition model to obtain a third word bank;

s6, repeating S2-S5 until a final word stock is obtained;

preferably, the word frequency-based method in step S1 is a processing method based on Jieba word segmentation;

preferably, before the step S1, the full text is divided into short sentences by separators such as line breaks, periods, commas, etc.;

preferably, the distribution network entity name identification model in the step S2 is a model of BilSTM-CRF;

preferably, step S6 is repeated S2 to S5 at least once;

a semantic search method related to distribution network entities comprises the following steps:

step 1: segmenting words of texts in the entity of the distribution network, and decomposing the segmented words into single character forms to obtain a character table and a word group table;

step 2: vectorizing the characters in the character table in the step 1 to obtain a character vector;

and step 3: training the phrases in the phrase table in the step 1 by using a model to obtain a word vector;

and 4, step 4: combining the character vector and the word vector obtained in the step 2 to obtain a context information vector;

and 5: transmitting the context information vector information obtained in the step 4 into a bidirectional LSTM for training to obtain semantic information characteristics of the input text;

step 6: and D, inputting the output of the bidirectional LSTM in the step five into the conditional random field, calculating an optimal label sequence corresponding to the input word, and taking the sequence with the maximum probability as a final class label of the phrase.

Preferably, the model in step 3 is a glove model.

Preferably, in the word vectors in step 3, each word vector represents a word group, and the dimension of the word vector can be adjusted.

Compared with the prior art, the invention has the beneficial effects that:

in the related semantic search method for the distribution network entity, a Jieba participle packet based on word frequency statistics participle is introduced, so that the function of finding new words is greatly enhanced compared with the prior art, the BilSTM-CRF is also introduced, the recognition accuracy of a model on the distribution network naming is improved, and the number in a word bank is more targeted by a manual labeling method;

in the invention, the word segmentation is carried out by adopting Jieba, the text is analyzed, then the BilSTM-CRF is adopted for recognition, and finally, the word bank is more actually combined by adopting a manual labeling method.

In addition, the index establishing mode is adopted in the invention, so that the searching speed is higher, the search result is displayed more valuable due to the introduction of the PageRank, when one webpage is linked for many times, the weight is higher, and the display position is forward.

Drawings

FIG. 1 is a flow chart of the segmentation library establishment;

FIG. 2 shows the labeling results;

FIG. 3 is a workflow of the whole jieba participle package;

FIG. 4 is a schematic diagram of the LSTM neural network structure;

FIG. 5 is a transition probability;

fig. 6 is a flowchart of specific steps of a distribution network entity related semantic search method.

Detailed Description

The invention discloses a semantic search method related to a distribution network entity, and particularly relates to a full-service search engine which mainly adopts key word decomposition, matching and other modes to realize information retrieval at present and is lack of knowledge processing capability and comprehension capability. A brand-new semantic search method aiming at distribution network entity correlation is provided, and the correlation of the distribution network entity correlation semantic search results is improved. Compared with the traditional keyword search technology, the method introduces a BilSTM-CRF deep learning model, improves the recognition accuracy of the model to the distribution network naming body by establishing a word vector for the power entity, improves the search efficiency by adopting knowledge index construction based on whoosh, and improves the accuracy of the search result and the satisfaction degree of users by using a knowledge search engine technology in a database.

The invention relates to a semantic search method related to a distribution network entity, which specifically comprises the following operations during word stock preparation: performing word segmentation on a text in a distribution network entity by a word segmentation method based on word frequency statistics to obtain a word bank; performing model word segmentation on the text in the distribution network entity to obtain a word bank; merging a word bank based on word frequency and a word bank based on a model; labeling proper nouns with wrong words in a word bank; and performing secondary word segmentation on the text by using a distribution network entity name recognition model based on the BilSTM-CRF on the labeled text to finally obtain a third word bank.

The distribution network entity refers to a plurality of file texts in different areas of the power service; because the same object is called differently in different areas, the same object is actually called, for example, the same transformer, the large transformer, the small transformer and the current transformer are called as switching power supplies and can also be called as current transformers, the names of the transformers belonging to the same object are different in the field of power services, and the process of the names of the transformers belonging to the same object is called as classification when the transformers are manually labeled; s1, firstly, performing word segmentation on the text in a word frequency-based word segmentation mode to obtain a first word bank, wherein the word frequency-based word segmentation mode is used for counting the occurrence times of two continuously occurring words, when the occurrence times is more than 5, the words are considered to be new words, and the word frequency-based word segmentation method is used for finding the new words; s2, performing word segmentation on a large number of texts by using a distribution network entity name recognition model of BilSTM-CRF to obtain a second word bank; s3 merging the obtained first word stock and the second word stock; s4, manually checking and analyzing words with wrong word segmentation in the combined word bank, deleting the words with wrong word segmentation, manually labeling the text, and labeling special words and useless words, wherein the special words comprise transformers, voltage stabilizers and the like, and the useless words comprise tone words, conjunctions and the like; s5 performing word segmentation on the marked text again by using a distribution network entity name recognition model based on BilSTM-CRF, and S6 finally obtaining a third word bank.

Further, in the present invention, two word segmentation methods are adopted for word segmentation, as shown in fig. 1, in order to obtain a word segmentation word bank, a first word bank is obtained based on a word frequency method; using a distribution network entity name recognition model of iLSTM-CRF to perform word segmentation on a large number of texts needing word segmentation to obtain a second word stock, combining the first word stock and the second word stock, manually deleting words with wrong word segmentation in the first word stock, marking correct words in the original texts, and automatically taking the marked words as a word segmentation when the marked texts are trained; then, performing secondary word segmentation on the text by using a distribution network entity name recognition model based on the BilSTM-CRF; optionally, before this step, a word bank with linguistic words and conjunctions may be used, the text is processed first, then word segmentation is performed, and a third word bank obtained by word segmentation is a usable word bank.

In the invention, a preferable suitable occasion is that a power service distribution network is an entity, in the distribution network entity, a word segmentation method based on word frequency is firstly carried out on a text, the word segmentation method is a cut-all method, the word segmentation method does not depend on a dictionary, the frequency number of any two words appearing in an article at the same time is counted, a word can be formed when the frequency number is higher, the word belongs to a small branch in a Jieba word segmentation, and the word frequency word segmentation is called as the word segmentation using Jieba in the following. The method firstly segments all possible words matched with a word list, and applies a statistical language model and a decision algorithm to obtain an absolute optimal segmentation result. Based on the principle, a Jieba word segmentation packet is used for segmenting words of a text, after a large amount of training, twenty thousand words are input into a text file named ditt. The Jieba word segmentation packet is a directed acyclic graph formed by all possible word forming conditions of Chinese characters in a sentence by realizing efficient word graph scanning based on a Trie tree structure; the Jieba word segmentation packet adopts dynamic programming to search a maximum probability path and find out a maximum segmentation combination based on word frequency.

The specific steps of using the Jieba word segmentation bag are as follows:

1. the whole text is divided into short sentences by delimiters such as line break, period, comma and the like.

2. And deleting spaces before and after each short sentence.

3. And eliminating short sentences without any characters and numbers.

We can then perform word segmentation, segmenting each short sentence into words, and writing the result into a dictionary. If the dictionary already has the word, the frequency count corresponding to the word is +1, if not, the word is added, and the frequency count is set to 1. Finally, storing the result in a persistent mode to the local; when the operation is finished, whether the words obtained by the Jieba word segmentation packet are in accordance with the rules or not is checked, the words are used if the words are in accordance with the rules, manual labeling is carried out if the words are not in accordance with the rules, and the manual labeling is mainly carried out on the terms.

Further, when the Jieba word segmentation is used, the Jieba word segmentation mainly carries out word segmentation and part of speech tagging through a dictionary, and the word segmentation and the part of speech tagging use the same dictionary. Because of this, the result of word segmentation will depend on the dictionary to a great extent, and the following working flow of the whole jieba word segmentation packet is as shown in fig. 3:

in the jieba word segmentation, firstly, a directed acyclic graph of a sentence is generated by contrasting a dictionary, and then the sentence is intercepted after a shortest path is found according to the dictionary or the sentence is directly intercepted according to different selected modes. Where finding new words relies on HMM models in machine learning.

HMM algorithms and CRF algorithms are labeling algorithms in machine learning methods, in which NER is taken as a sequence labeling problem. And (4) learning a labeling model by utilizing large-scale linguistic data so as to label each position of the sentence. Common models in the NER task include a generative model HMM, a discriminant model CRF, and the like. HMM is used for new word discovery for unregistered words (words not in the dictionary). If the HMM is turned on, new word discovery will be performed with consecutive words that do not appear in the dictionary concatenated. For example, "do-good" in the example, the dictionary does not have the word, so the HMM model is taken away for new word discovery; however, if the original sentence is "today's weather is really good", the sentence is divided into "today's weather", "true" and "good" based on the dictionary, and the word "true" is in the dictionary, but because the frequency is small, the word is not selected as the best path, so that the words "true" and "good" cannot be taken to be found by the new word (even though the result of the word passing through the HMM will be "true"), and the final word division result will be "today's weather", "true" and "good". Conditional Random Fields (CRFs) are the current dominant model of NER. The target function not only considers the input state characteristic function, but also comprises a label transfer characteristic function. The SGD may be used to learn model parameters when training. When the model is known, solving the predicted output sequence for the input sequence, i.e. solving the optimal sequence for maximizing the objective function, is a dynamic programming problem, and the optimal tag sequence can be obtained by decoding with a Viterbi algorithm. An advantage of CRF is that it can make use of rich internal and contextual feature information in labeling a location. LSTM differs from the normal RNN unit, here four, interacting in a very specific way. The biLSTM-CRF model, in which the LSTM selectively forgets part of history information, adds part of current input information, finally integrates the current state and generates an output state to apply to the NER through three gate structures (input gate, forget gate, output gate), mainly consists of an Embedding layer (mainly including word vectors, word vectors and some additional features), a bidirectional LSTM layer, and a final CRF layer. Experimental results show that bilSTM-CRF reaches or exceeds a CRF model based on rich characteristics and becomes the most mainstream model in the current NER method based on deep learning. In the aspect of characteristics, the model inherits the advantages of a deep learning method, and can achieve a good effect by using word vectors and character vectors without characteristic engineering.

In the present invention as shown in fig. 6: step 1: the method comprises the steps of segmenting a text by using an open source tool, decomposing a phrase into a single character form, counting and numbering characters, words and labels, and constructing a character table and a word group table. And manually labeling the text, and counting text labels to construct a label table, so that unrecognized professional vocabularies can be labeled, and useless vocabularies such as standard language words, conjunctions and the like can be labeled. Part-of-speech tagging algorithms are mainly divided into two categories from the current use condition, one is a rule-based method, and the other is a dictionary lookup algorithm based on character string matching; another is based on data-driven methods, such as machine learning methods. Firstly, a dictionary lookup algorithm based on character string matching comprises the following main processes: the sentence is segmented, the part of speech of each word is searched from the dictionary, and the part of speech is labeled. This method is very simple, but fails to solve the word-multiword problem. And the part-of-speech tagging algorithm adopting the machine learning method, such as HMM, CRF and other methods, has a very good effect. For HMMs, which include 3 quantities, an initial state probability vector, a state transition probability matrix, an observation probability matrix, HMMs are generally composed of three types of problems:

probability calculation problem, namely, given A, B, pi and hidden state sequences, calculating the probability of observation sequences; predicting problem, also becoming decoding problem, knowing A, B, pi and observation sequence, and solving the optimal possible corresponding state sequence; learning the problem, knowing the observation sequence, estimating the A, B, pi parameters of the model, so that the probability of the observation sequence under the model is maximum, namely estimating the parameters by using a maximum likelihood estimation method. HMMs, hidden markov models, are statistical models based on markov assumptions. This is "hidden" because there are unknown parameters compared to the Markov process HMM. The purpose of our use of various models is to be able to predict the class Y for a given input X. The generated model is obtained by learning the joint probability distribution P (X, Y) and then solving the conditional probability through the Bayes theorem,

HMM belongs to the directed graph PGM of the generative model, modeled by joint probabilities:

s, O denotes a state sequence and an observation sequence, respectively.

The decoding problem of the HMM is:

then there is a change in the number of,

this equation is a recursive method of the Viterbi algorithm for HMM decoding problems.

A preferred example is: suppose there is a net friend, a little red, which explains every day what we do today in the circle of friends, and suppose it is only affected by the weather of the day, which is also only affected by the weather of the previous day. In a small way, what a small red does every day is a visible state, and how the weather there is a hidden state constitutes an HMM model. An HMM model requires five elements: hidden state set, observation set, transition probability, observation probability, and initial state probability. We define the hidden state set as N, which includes all the hidden states that might appear, in this example we consider

N ═ sunny, cloudy, rainy },

defining the observation set as M, where M includes all possible performance states that may occur in the observation, we assume in this example that

M ═ shopping, playing games, sleeping, sports },

an observation probability matrix is defined next:

wherein the content of the first and second substances,

b_ij＝p(M_j|N_i)，

indicating the probability of the ith hidden state transitioning to the jth hidden state.

In this example we assume that the transition probability is as shown in fig. 5, and in addition to this, an initial state probability vector pi is needed, which represents the probability value of the hidden state at the beginning of the observation, i.e. when t is 0. In this example we specify pi ═ {0,0,1 }. To this end, a complete hidden markov model has been defined, and HMMs generally have three types of problems: probability calculation problem, namely, given A, B, pi and hidden state sequences, calculating the probability of observation sequences; predicting problem, also becoming decoding problem, knowing A, B, pi and observation sequence, and solving the optimal possible corresponding state sequence; learning the problem, knowing the observation sequence, estimating the A, B, pi parameters of the model, so that the probability of the observation sequence under the model is maximum, namely estimating the parameters by using a maximum likelihood estimation method.

On the specific part-of-speech tagging problem, the observation sequence is a sentence after word segmentation, and the state sequence (or the tag sequence and the hidden state sequence both have one meaning) is a part-of-speech tagging sequence after tagging. While the 3 quantities of the model can be calculated from a large corpus. Thus, for the labeling problem, HMM model parameters are easily calculated, and then the problem is converted into a model prediction problem, i.e., a known model and an observation sequence, by combining the original sentence sequence after word segmentation to predict a tag sequence (or a state sequence).

The specific operation of the step 2 is to represent the character features in the step 1 by vectors, initialize the character table C, determine the dimension d1 of each character and obtain the character vector matrix Q e R^d1*|c|. Character vector matrix is used as input pair word of bidirectional LSTM neural networkThe symbol is encoded to yield a fixed size output vector w c e R d 1. For example, for "combined fertilizer city Feixi county transformer failure", we initialize a matrix with a size W4][6]If the first column value of the matrix is the initialized conditional probability distribution, the following are performed in sequence: p (S) P (fertilizer market | S), P (M) P (Feixi county | M), P (transformer) P (fertilizer market | E), P (W) P (fault | W). And then calculating the state probability distribution of the next word according to the transition probability, wherein the state probability distribution of all words in the sentence can be calculated by sequentially ending, and then the corresponding state sequence can be calculated by determining the boundary comparison condition. HMM is a common word segmentation method in Chinese word segmentation, and according to the description and the explanation, the word segmentation state mainly depends on the labeling of the corpus, the sentence needing word segmentation is calculated through the calculation of corpus initialization probability, state transition matrix and conditional probability matrix, in short, the historical state experience of the corresponding word is learned through a model and then used in a new matrix. HMMs have the advantage that the model is computationally simple and usually very efficient, with a good effect on words that do not appear in the dictionary. The CRF is a high-efficiency word segmentation model, similar to the HMM, the word segmentation task is used as a sequence labeling task, the above-mentioned four States (SMEW) are labeled on different words in a sentence, compared with the HMM, the CRF can utilize more features, and essentially, the HMM belongs to a generation model and describes a joint probability distribution of a known quantity and an unknown quantity, and the CRF directly carries out conditional probability modeling. The CRF is more abundant in characteristics, and characteristic information can be added through a custom characteristic function, and generally, information which can be modeled by the CRF should include characteristics of state transition and data initialization of the HMM. CRF is generally superior to HMM both theoretically and practically, and mainly comprises two parts of features: 1. simple features, only features related to the current state; 2. transition features, which relate to features between two states. Feature templates are a technique that is often used in engineering practice.

A further step 3: and (3) pre-training the word features extracted in the step (1) by utilizing a glove model to obtain word vectors w p E R d 2. In this step, the word vectors obtained by glove training can contain good semantic features, each vector represents a phrase, the dimension of each word vector can be adjusted by itself, and generally selectable dimensions are 50, 100, 200 and the like.

And 4, step 4: and combining the character vector obtained in the step 2 and the word vector obtained in the step 3 to obtain a context information vector w ═ w c, w p. Contextual utterances that are relevant to the query may be useful, while irrelevant content may be more noisy. In this case, we propose a variant of explicitly weighting the context vector by the attention score of the context query relevance. First, we compute context similarity using cosine values:

under the following conditions:

that is, the sentence vector is the sum of the word vectors.

Following the attention mechanism (bahdana et al, 2014), we wish to normalize these similarities by the softmax function and obtain the attention probability:

s_qand

calculated in the same way, and always summed to 1, which is the cosine value of two identical vectors, e.g.If the context is less relevant, we should focus primarily on the query itself, but if the context is relevant, we should focus more evenly between the context and the query. W_Seq(sum), where the weight vectors are added together:

W_Seq(concat) in which the weight vectors are concatenated

A further step 5: and (4) transmitting the context information vector w obtained in the step (4) into a bidirectional LSTM for training to obtain semantic information characteristics of the input text. In this step, a schematic structural diagram of the bidirectional LSTM neural network is shown in fig. 4, and the bidirectional LSTM neural network has two hidden layers, one hidden layer represents a forward LSTM neural network, and the other hidden layer represents a backward LSTM neural network. Each hidden layer has a fixed size LSTM kernel. In the step, an LSTM kernel in the bidirectional LSTM neural network is an improvement on RNN (Recurrent neural network), and the input information is selectively forgotten and reserved by adding a forgetting mechanism and a storing mechanism, so that the problem of gradient disappearance or gradient explosion of the RNN during derivation can be effectively avoided. The LSTM network comprises an input layer, two hidden layers and a softmax layer, and is learned through a back propagation algorithm, and a specific formula of the LSTM network is described as follows:

h_t＝ο_tοTanh(c_t)

wherein i_t、f_t、o_tThe forgoing mechanism, the saving mechanism and the outputting mechanism are respectively. b_i、b_f、b_oRespectively, representing the bias vectors of the respective mechanisms. t represents the current time and t-1 represents the last time._wAnd h represents the corresponding weight, c_tAnd h_tRespectively representing the output of the activation function at the current time and the output at the current time.

Step 6: and (5) inputting the output of the bidirectional LSTM in the step 5 into the conditional random field, calculating an optimal label sequence corresponding to the input word, and taking the sequence with the maximum probability as the final class label of the phrase.

In this step, the schematic diagram of the conditional random field structure is shown in fig. 4, and includes an input vector X ═ X₁,x₂,...,x_nThe model outputs a tag sequence Y ═ Y₁,y₂,..,y_n}. For a given input sequence X with a value of X, the conditional probability of Y over the tag sequence Y is p (Y | X), and the specific formula is as follows:

and further, the marked phrases are marked in the original text, so that the user can conveniently check the phrases. For the sentence "which transformer in the Fei He City in October fails? "the labeling results are shown in FIG. 2: [ 'October', 'CofeI', 'Presence', 'Fault', 'of', 'Transformer', 'of', 'which' ].

The method comprises the following steps of establishing an index database after segmenting words of texts of a plurality of different distribution network entities, and for the construction of knowledge indexes, the traditional data query needs data management personnel to compile fixed query statements to carry out associated retrieval in the database, wherein the method is based on the database knowledge search of whoosh and has the main characteristics that: the system comprises an agile API (Pythonic API), an indexing by field, a good framework, a scoring module, a word segmentation module, a storage module and other modules which can be plugged and unplugged, and the like, wherein the characteristics can identify and accurately return the information which a user wants to search.

The method comprises the following concrete steps:

1. an Index object is first required, and a Schema object must be defined when the Index object is first created, which lists all the fields of the Index. A field is an information for each document in an Index object, such as its title or its content. A domain can be indexed (i.e., searchable) or stored (i.e., results after indexing, which is useful for indexing such as titles).

2. When creating an index object, key parameters are needed to map domain names and types, which determine what is searchable on the index. There are many very useful predefined domain types for whoosh, as follows:

whoosh. fields. id: this type simply indexes the value of field as a separate unit (which means that he is not divided into separate words). This is beneficial for file path, URL, time, category, etc. fields.

whoosh. fields. stored: this type is stored with the document, but is not indexed. This field type is not searchable. This is useful for document information that you want to show to the user in the search results.

whoosh. fields. keyword: this type is designed for white space or comma spaced keywords. The indexable can be searched (partially stored). To reduce space, phrase searching is not supported.

whoosh. fields. text: this type is for the document body. The location of the text and term are stored to allow phrase searching.

whoosh. fields. numeric: this type is designed for numbers, and you can store integer or floating point numbers.

whoosh. fields. boolean: this type of storage pool type

After creating the schema object, an index can be created by using a create _ in function, and when creating the index, attention needs to be paid to: not every field has to be filled, whoosh allows for there to be unfilled fields-the indexed text field must be a unicode value, and the stored but not indexed field (STROED field type) can be any and serialized object.

3. Used in combination with the word segmentation

A Severe participle analysis module is added in QueryString, since the version after jieba 0.30 has added a participle interface ChineseAnalyzer for Whoosh. First, at the wooosh. fields. TEXT created in the Whoosh schema object, the FieldAttributes of the field at the time of declaring the TEXT by default has an attribute analyzer which is a class with __ call __ magic method and is used for analyzing the TEXT domain, when called, the value in the TEXT domain is processed by __ call __, the parameter received by the analyzer is a unicode character string, and the return value is character string segmentation. And finally, performing serialized storage on the return value. Knowledge search engine technology in a graph database may be preferably used.

Preferably, unlike document retrieval in the conventional internet, intelligent search needs to process structured semantic data with finer granularity, so that a natural language processing technology needs to be combined with a knowledge graph technology, on one hand, search content of a user is deeply analyzed, and on the other hand, the search result is more matched with search habits and search intentions of the user by combining with knowledge graph optimization to sequence the search result, so that intelligent search is realized, and graph database enables graph-based search, which is mainly because: graph databases provide models and query languages that support the natural structure of data, allowing enterprises to accurately structure data as it is generated and query based on their inherent structure; all contents in the graph database have rich metadata, so that a user can quickly search and find in real time; the built-in model of the graph database is very flexible, so that data architects and developers can easily modify data and their structures.

The theoretical basis of graph databases is graph theory, in which data is represented and stored by nodes, edges and attributes, and in particular, graph databases are based on directed graphs, wherein the nodes, edges and attributes are the core concepts of graph databases.

And (3) node: representing entities, events, etc. objects. eg: people, places, etc. can be nodes in the graph.

Side: the directional lines connecting the nodes in the graph represent the relationship between the nodes. eg: event relationships between persona nodes, other relationships, and the like.

The attributes are as follows: the characteristics of a node or edge are described. eg: names of persons (nodes), start and end times of event relationships (edges), and the like.

The knowledge graph is used as a background knowledge base, so that the search result can be obviously improved, but the search engine technology is still very important to provide complete intelligent search service.

The search engine technology used in this patent is described in detail below.

The index is a storage structure which is created in advance by a search engine based on target information content in order to accelerate the information searching process. Among various indexing techniques, inverted indexing is the most common one. The inverted index is derived from the actual application. When a user uses a search engine to search for information, often only some of the attributes or keywords in the related information are entered, which requires that the search engine be able to find all the related web pages (documents) that contain those keywords and then sort back by importance. Therefore, an inverted index technique based on a mapping structure of "keyword-document" has been developed. The basic principle of the inverted index technique is as follows: after N collected documents are numbered, firstly, Chinese word segmentation technology is utilized to segment sentences in the documents into words, and then stop words are removed by referring to a stop word list to form all word sets; for each word in the set, the document number in which it exists is recorded. Through the simple two steps, an inverted index structure is established. For an input query sentence, a search engine firstly carries out word segmentation analysis on the input query sentence, then queries the obtained words in an inverted index file to obtain documents with the words, and finally sorts the documents according to the similarity degree and returns the documents to a user.

In practical application, the inverted index structure is much more complex, and each word is added with many extra information, such as the position, frequency and the like of appearance in the document. The basic principle is completely consistent with the above description.

When the retrieval is completed and the content is numerous, the techniques adopted in the present invention for displaying the retrieved content are PageRank and TF-IDF.

PageRank solves the key technology of result quality. The model is proposed by the pioneer of Google, and the core idea is as follows: if a web page is linked by many other web pages, it is said to be important. Since different web pages have different qualities, when they are linked to the same web page, the influence weight on the target web page is different. The ranking of the target web page is roughly the sum of all the weights of the other web pages that point to this web page.

TF-IDF solves the key technology of correlation problem. Wherein TF refers to Term Frequency (Term Frequency), and IDF refers to Inverse Document Frequency index (Inverse Document Frequency).

There is a simple way to measure the relevance of a web page to a query, which is to directly use the total word frequency of each keyword appearing in the document. However, a problem exists in that the occurrence frequency of a professional word is lower than that of a common word, but the distinction degree of the professional word on a webpage is very high, and if the word frequency is used for measuring the relevance, the distinction degree cannot be reflected. For this reason, the concept of IDF has been proposed.

If a word appears in only a few web pages, it is relatively easy to lock the search target by it, and the weight should be larger. The most common measure of the magnitude of this weight is the IDF, which is calculated as follows:

where D is the number of total documents, D_WIs the number of documents in which the keyword appears. By such calculation, a higher weight can be given to the specialized vocabulary.

The formula for calculating the relevance of a query to a document is:

TF₁·IDE₁+TF₂·IDE₂+...+TF_n·IDE_n

for the results obtained by the inverted index, they still need to be properly arranged so that the content that the user most thinks of appears in front. The ranking of the search results depends on two parts of the content: the quality of the results (documents or web pages) and the relevance of the results to the query content.

The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A semantic search method related to a distribution network entity is characterized by comprising the following steps of 1: performing word segmentation on the text in the distribution network entity;

the word segmentation of the text in the distribution network entity comprises the following steps:

and S6, repeating S2-S5 until a final word stock is obtained.

2. The distribution network entity-related semantic search method according to claim 1, wherein the S1 word frequency-based method is a Jieba packet-based processing method.

3. The method of claim 1, wherein step S1 is performed by dividing the text of the distribution network entity into short sentences with line breaks, periods, commas and other delimiters before the step S1 is performed.

4. The method for semantic search related to distribution network entities according to claim 1, wherein the distribution network entity name identification model of step S2 is a model of BiLSTM-CRF.

5. The method for semantic search related to distribution network entities according to claim 1, wherein step S6 is repeated S2-S5, specifically at least once in steps S2-S5.

6. A semantic search method related to a distribution network entity is characterized by comprising the following steps:

and 5: transmitting the context information vector information obtained in the step 4 into a bidirectional LSTM for training to obtain semantic information characteristics of a text input into the entity of the distribution network;

7. The distribution network entity-related semantic search method according to claim 6, wherein the model in the step 3 is a glove model.

8. The method according to claim 6, wherein the word vectors in step 3 each represent a phrase, and the dimension of the word vector can be adjusted.