CN112507707A

CN112507707A - Correlation degree analysis and judgment method for innovative technologies in different fields of power internet of things

Info

Publication number: CN112507707A
Application number: CN202011408521.2A
Authority: CN
Inventors: 高昇宇; 皮一晨; 朱红; 周冬旭; 张玮亚; 刘少君; 胡年超; 李存斌; 王其清
Original assignee: Nanjing Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Current assignee: Nanjing Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2021-03-16

Abstract

The invention relates to a correlation degree analysis and judgment method for innovative technologies in different fields of power internet of things, and belongs to the technical field of data processing methods for management. The method comprises the steps of dividing the power Internet of things into 8 sub-fields, obtaining documents through retrieval, and extracting titles, abstracts, keywords and publication years of the documents as document data; extracting sentences containing keywords in the abstract as input of a space tool, training to obtain an entity recognition model, and traversing each sentence in the abstract to perform entity recognition to obtain key technical terms of the power internet of things; mapping Chinese and English literature data to a Chinese and English bilingual word embedding matrix by using a word embedding model, constructing a co-occurrence matrix of key technical terms and sub-fields, calculating two-dimensional mutual information of any two sub-fields, and finally judging the association strength between innovation technologies of any two sub-fields according to the two-dimensional mutual information. The method can provide reliable data sources for judging the association degree between the innovative technologies of the power Internet of things in different fields.

Description

Correlation degree analysis and judgment method for innovative technologies in different fields of power internet of things

Technical Field

The invention relates to a method for analyzing and judging mutual cooperative relationship among innovative technologies in different sub-fields of an electric power internet of things, and belongs to the technical field of data processing methods suitable for management.

Background

The electric power internet of things is an information physical fusion system, and the construction process of the electric power internet of things is also an innovative application process of the internet of things related technology in an electric power system. The research on the technical coupling action points and the collaborative innovation relationship between the internet of things related technology and the power system is beneficial to searching key technical breakthrough points of the power internet of things and developing efficient innovation paths.

At present, the coupling collaborative research aiming at the electric power system and the innovative technology of the internet of things focuses on the technical development situation of the internet of things, but because the electric power internet of things is a physical information fusion system and the technical innovation thereof comprises two aspects of construction of the electric power system and the internet of things, the currently known coupling collaborative research aiming at the electric power system and the innovative technology of the internet of things cannot provide an effective and reliable analysis basis for judging the development direction of the innovative technology of the electric power internet of things.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: effective and reliable data basis is provided for judging the development direction of the power Internet of things innovation technology.

The technical scheme provided by the invention for solving the technical problems is as follows: a correlation degree analysis and judgment method for innovative technologies in different fields of the power Internet of things comprises the following steps:

step 1, dividing and collecting document data in the field of power internet of things, specifically comprising the following steps:

dividing the power internet of things into 8 sub-fields of a power source end, a network end, a load end, a storage end, an internet of things sensing layer, a network layer, a computing layer and an application layer; constructing Chinese and English literature search formulas related to the sub-fields according to the definitions of the sub-fields, wherein each search formula comprises a plurality of search terms; searching the technical sub-fields from the known network and the Web of Science core database according to the search formula, respectively acquiring Chinese documents and English documents, respectively extracting titles, abstracts, keywords and publication years from the Chinese documents and the English documents as Chinese document data and English document data, and forming Chinese document data and English document data together;

step 2, obtaining key technical terms of the power internet of things, specifically as follows:

step 2.1, extracting sentences containing the keywords corresponding to each document in the abstract of each document, and taking the extracted sentences as input of a space tool and training to obtain an entity recognition model;

step 2.2, traversing each sentence in all abstracts in the Chinese and English literature data by using the entity recognition model to perform entity recognition, if the recognized entity is in the same sentence with the keyword of the literature, using the entity as a key technical term of the power Internet of things, and counting the occurrence times of all power Internet of things key technical terms in the Chinese and English literature data;

step 3, performing unified vectorization processing on the Chinese and English literature data of the power internet of things, and mapping the Chinese and English literature data to a Chinese and English bilingual word embedding matrix by using a word embedding model, wherein the steps are as follows:

3.1, self-defining a Chinese and English translation anchor file, wherein the Chinese and English translation anchor file defines the one-to-one correspondence of Chinese and English translations of common words, obtaining an additional Chinese and English correspondence by calling a search word in a Baidu translation Chinese search formula and translating Chinese in the key technical term obtained in the step 2.2 into English, and adding the additional Chinese and English correspondence to the Chinese and English translation anchor file;

step 3.2, performing word segmentation processing on the Chinese and English literature data by using Chinese and English word segmentation tools respectively to obtain a Chinese word sequence of the Chinese literature data and an English word sequence of the English literature data respectively, training the Chinese word sequence and the English word sequence of the literature data respectively by using word2vec to obtain a word vector of each word, wherein the word vectors respectively form a Chinese literature word embedding matrix and an English literature word embedding matrix, and the dimension of each matrix is the number of words in the corresponding Chinese literature or English literature data multiplied by the same word vector dimension d;

and 3.3, constructing a bilingual word vector mapping model from Chinese to English, as shown in a formula (1).

Where d denotes the word vector dimension, M_d(R) represents a real matrix defined on a real number field R by d, S and T represent a Chinese embedding matrix and an English literature word embedding matrix respectively, W is a weight matrix, argmin represents minimizing the distance between the Chinese word contribution embedding matrix S and the English literature word embedding matrix T | | | WS-T | | |_F，|| ||_FExpressing Frobenius norm, and obtaining an optimal weight matrix W by a calculation result^*And obtaining a Chinese and English bilingual word embedding matrix through the bilingual word vector mapping model.

Step 4, constructing a co-occurrence matrix of the key technical terms and the sub-fields, which comprises the following specific steps:

step 4.1, dividing the search word into 8 types according to 8 sub-fields, and taking a word vector corresponding to the search word as a word vector v of the search word according to the Chinese and English literature bilingual word embedding matrix obtained in the step 3;

step 4.2, selecting a word vector corresponding to the key technical term from the Chinese and English literature bilingual word embedding matrix obtained in the step 3 as a word vector u of the key technical term, calculating the similarity D (u, v) between the word vector u of the key technical term and the word vector v of the search word according to cosine similarity, and setting a similarity threshold value to be 0.3, wherein when D (u, v) > 0.3, the key technical term belongs to the sub-field corresponding to the word vector v of the search word;

step 4.3, obtaining the subordinate relationship between the key technical terms and the sub-fields according to the step 4.2, taking the sum of the times of all the key technical terms corresponding to the sub-fields appearing in the literature data as the co-occurrence times of the key technical terms and the sub-fields, and constructing a co-occurrence matrix of the key technical terms and the sub-fields according to the publication year division of the literature;

step 5, calculating the mutual information of any two sub-fields, specifically as follows:

step 5.1 for any two sub-domains x₁And x₂Respectively calculating the one-dimensional information entropy H (x) of the two sub-domains according to the formula (2)₁) And H (x)₂)。

Wherein x is a sub-domain, c_iThe number of co-occurrences of key technical terms in the i (i ═ 1, 2.., 8) th sub-domain;

step 5.1 separately calculating two sub-domains x according to the formula (3)₁And x₂Two-dimensional information entropy H (x)₁,x₂)，

Wherein, c₁And c₂Respectively being any two sub-fields x₁And x₂The number of co-occurrences of the key technical term of (c),

then the two sub-domains x₁And x₂The two-dimensional mutual information quantity is obtained by calculating the formula (4),

H(x₁)+H(x₂)-H(x₁,x₂) (4)，

judging any two sub-fields x according to the two-dimensional mutual information quantity₁And x₂The degree of correlation between the innovative technologies of (1).

The invention has the beneficial effects that: because the power internet of things comprises sub-fields of power and the internet of things, most of the existing power internet of things innovation technology research based on scientific literature is based on literature statistical measurement methods, and therefore data analysis on the key technology of the power internet of things and the relation between the key technology and the related sub-fields related to the content of the scientific literature is lacked; according to the method, from the perspective of analyzing the text data of the electric power Internet of things literature, the key technical terms contained in the text of the electric power Internet of things sub-field literature are mined, the subordination relation between the technical terms and the sub-field is established, the co-occurrence times of the electric power Internet of things key technical terms and the electric power Internet of things sub-field are counted, and a more reliable data source is provided for judging the degree of the cooperative association between the electric power Internet of things innovation technologies in different fields.

Drawings

The method for analyzing and judging the association degree of innovative technologies in different fields of the power internet of things is further described with reference to the accompanying drawings.

Fig. 1 is a distribution diagram of a Chinese word-donation embedding matrix in a two-dimensional plane.

Fig. 2 is a distribution diagram of an english literature word embedding matrix in a two-dimensional plane.

Fig. 3 is a distribution diagram of a chinese-english bilingual word embedding matrix on a two-dimensional plane.

FIG. 4 is a relationship diagram of mutual information quantity between three pairs of source-load, source-store, and network-store domains.

Detailed Description

Examples

The relevance degree analysis and judgment method for the innovative technologies in different fields of the power internet of things comprises the following steps:

dividing the power internet of things into 8 sub-fields of a power source end, a network end, a load end, a storage end, an internet of things sensing layer, a network layer, a computing layer and an application layer; constructing Chinese and English literature search formulas related to the sub-fields according to the definitions of the sub-fields, wherein each search formula comprises a plurality of search terms; searching the technical sub-fields from the HowNet and the Web of Science core database according to the search formula, respectively obtaining Chinese documents and English documents, and extracting titles, abstracts, keywords and publication years of the documents (including the Chinese documents and the English documents) as document data (including Chinese document data and English document data); the chinese and english literature search formula part of this example is shown in table 1 below,

TABLE 1

The number of documents retrieved and acquired in this embodiment is shown in table 2.

TABLE 2

step 2.1, extracting sentences containing the keywords corresponding to the documents in the abstract of each document, taking the extracted sentences as input of a space tool and training to obtain an entity recognition model, wherein the space tool is an open source tool designed aiming at NLP word segmentation, entity recognition and part of speech tagging and supports custom training of the entity recognition model;

and 2.2, traversing each sentence in all abstracts in the document data by using the entity identification model to identify an entity, if the identified entity is in the same sentence with the keyword of the document, taking the entity as a key technical term of the power internet of things, counting the occurrence frequency of all key technical terms of the power internet of things in the document data, and obtaining the key technical term with higher occurrence frequency as shown in a table 3.

TABLE 3

And 3, uniformly vectorizing the Chinese and English literature data of the power Internet of things, and mapping the Chinese and English literature data to a Chinese and English bilingual word embedding matrix by using a word embedding model. In order to avoid the influence of Chinese and English literature data difference on the attribution of the sub-fields for judging the key technical terms of the power internet of things, the multi-language natural language processing word embedding technology is used for vectorizing the Chinese and English literature data to obtain Chinese and English words and donation word embedding matrixes distributed in the same vector space, so that the dependency relationship between the key technical terms of the power internet of things and the sub-fields is conveniently established, and the method specifically comprises the following steps:

3.1, self-defining a Chinese and English translation anchor file, wherein the Chinese and English translation anchor file defines the one-to-one correspondence of Chinese and English translations of common words, obtaining an additional Chinese and English correspondence by calling a search word in a Baidu translation Chinese search formula and translating Chinese in the key technical term obtained in the step 2.3 into English, and adding the additional Chinese and English correspondence to the Chinese and English translation anchor file;

step 3.2, performing word segmentation processing on document data (including Chinese document data and English document data) by using Chinese and English word segmentation tools (such as jieba and nltk) respectively to obtain Chinese word sequences and English word sequences of the document data respectively, and training the Chinese word sequences and the English word sequences respectively by using word2vec to obtain word vectors of each word (the word2vec model can represent the words as multidimensional vectors so as to map texts to word embedding matrixes formed by the multidimensional vectors), wherein the word vectors respectively form a Chinese document word embedding matrix and an English document word embedding matrix, and the dimension of each matrix is the dimension d of the word vector multiplied by the number of the words in the corresponding document data (the Chinese document data or the English document data); the word vector represents each word as a vector, and the dimensions of the word vector indicate the number of elements contained in the vector. The word vector dimension is set to 300 in this embodiment. Fig. 1 and fig. 2 show the distribution of the chinese literature word embedding matrix and the english literature word embedding matrix in a two-dimensional plane, respectively.

Step 3.3, a bilingual word vector mapping model from Chinese to English is constructed, as shown in formula (1),

where d denotes the word vector dimension, M_d(R) represents a real matrix defined on a real number field R by d, S and T represent a Chinese embedding matrix and an English literature word embedding matrix respectively, W is a weight matrix, argmin represents minimizing the distance between the Chinese word contribution embedding matrix S and the English literature word embedding matrix T | | | WS-T | | |_F，|| ||_FExpressing Frobenius norm, and obtaining an optimal weight matrix W by a calculation result^*(ii) a And obtaining a Chinese and English bilingual word embedding matrix through the bilingual word vector mapping model.

The optimization goal of the model is to solve a weight matrix W so that the distance between the Chinese donation word embedding matrix and the English literature word embedding matrix is | | | WS-T | |_FThe shortest, thereby unifying the vector space where the Chinese word-donation embedding matrix and the English literature word embedding matrix are located; the model can be converted into a Procrustes problem, and iterative solution is carried out by adopting a singular value decomposition and gradient descent method to obtain the optimal W^*The translation anchor file provides a one-to-one correspondence relationship of partial Chinese and English reference words, and the Chinese and English reference words are embedded into any two Chinese sums in the matrixThe distance between the English word vectors can be indirectly calculated by solving the word vector distance between each word vector and the reference word in the same language. Thus, by the weight matrix W^*The Chinese document word embedding matrix can be mapped to the same vector space as the English document word embedding matrix, so that word vectors in the word embedding matrix can be compared with each other to jointly form a Chinese and English document bilingual word embedding matrix, the word embedding matrix comprises word vectors corresponding to all word sequences of Chinese and English document data, and the distribution of the Chinese and English bilingual word embedding matrix on a two-dimensional plane is shown as shown in fig. 3.

step 4.1, extracting the search words in the search formula in the table 1, classifying the search words into 8 types according to 8 sub-fields, and extracting word vectors corresponding to the search words as word vectors v of the search words according to the Chinese and English literature bilingual word embedding matrix obtained in the step 3;

step 4.2, selecting a word vector corresponding to each key technical term from the word vectors obtained in the step 4.1 as a word vector u of each key technical term, calculating the similarity D (u, v) of the word vector u of each key technical term and the word vector v of the search word according to cosine similarity, and setting a similarity threshold value to be 0.3, wherein when D (u, v) > 0.3, the key technical term belongs to a sub-field corresponding to the word vector v of the search word;

step 4.3: obtaining the dependency relationship between each key technical term and 8 sub-fields according to the calculation process in the step 4.2, taking the sum of the occurrence times of all key technical terms corresponding to the 8 sub-fields in the document data as the co-occurrence times of the key technical terms and the sub-fields (called term-field for short), and dividing according to the publication years of the document to which each key technical term belongs, and constructing a co-occurrence matrix of each key technical term and 8 sub-fields; as shown in table 4.

TABLE 4

And 5: and calculating mutual information of any two sub-fields.

Step 5.1: for any two sub-domains x₁And x₂Respectively calculating the one-dimensional information entropy H (x) of the two sub-domains according to the formula (2)₁) And H (x)₂)，

Wherein x is a sub-domain, c_iThe number of co-occurrences of key technical terms in the i (i ═ 1, 2.., 8) th sub-domain; for example, the one-dimensional entropy of the power source terminal field in 2010 is

Step 5.1: respectively calculating two sub-domains x according to formula (3)₁And x₂Two-dimensional information entropy H (x)₁,x₂)，

H(x₁)+H(x₂)-H(x₁,x₂) (4)，

FIG. 4 shows the calculation results of the mutual information amount between the three pairs of source-load, source-store and network-store sub-domains. Shown in table 5, the two-dimensional average mutual information calculation results of 8 sub-domains are obtained by adding and averaging the two-dimensional mutual information of any two sub-domains in 2010-2019,

TABLE 5(mbit)

According to the two-dimensional mutual information quantity obtained by the calculation, any two sub-fields x can be judged₁And x₂The degree of correlation between the innovative technologies of (1).

The above description is only for the preferred embodiment of the present invention, but the present invention is not limited thereto, for example. All equivalents and modifications of the inventive concept and its technical solutions are intended to be included within the scope of the present invention.

Claims

1. A correlation degree analysis and judgment method for innovative technologies in different fields of the power Internet of things is characterized by comprising the following steps:

step 2.2, traversing each sentence in all abstracts in the Chinese and English literature data by using the entity recognition model to perform entity recognition, and if the recognized entity is in the same sentence with the keyword of the literature, taking the entity as a key technical term of the power Internet of things;

Where d denotes the word vector dimension, M_d(R) represents a real matrix defined on a real number domain R by d, S and T represent a Chinese embedding matrix and an English literature word embedding matrix, respectively, W is a weight matrix, and argmin represents a minimized Chinese contribution word embedding matrix S toDistance | | | WS-T | | non-conducting phosphor of English literature word embedding matrix T_F，|| ||_FExpressing Frobenius norm, and obtaining an optimal weight matrix W by a calculation result^*And obtaining a Chinese and English bilingual word embedding matrix through the bilingual word vector mapping model.

Wherein x is a sub-domain, c_iNo. (i) ═ 1, 2.., 8) th sub-collarThe number of co-occurrences of the key technical term of the domain;

H(x₁)+H(x₂)-H(x₁,x₂) (4)，