CN112732889A

CN112732889A - Student retrieval method and device based on cooperative network

Info

Publication number: CN112732889A
Application number: CN202011420372.1A
Authority: CN
Inventors: 张道枫; 李微
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2021-04-30

Abstract

The invention relates to a student retrieval method based on a cooperative network, which comprises the following steps: the method comprises the following steps: building a student cooperative network; step two: optimization of a student cooperative network structure; step three: training a student node representation model; step four: establishing a learner vector index; step five: the learner searches; the scheme realizes a retrieval model based on word embedding, and compared with the traditional keyword retrieval mode, the retrieval model based on word embedding can make full use of semantic information and improve the recall rate of retrieval.

Description

Student retrieval method and device based on cooperative network

Technical Field

The invention relates to a retrieval method and a retrieval device, in particular to a student retrieval method based on a cooperative network, and belongs to the technical field of information retrieval.

Background

Innovation is the source of economic growth of modern countries or regions, and the main problem of regional innovation is talent introduction. Students in high schools undertake scientific research projects, apply high-value scientific and technological achievements to production, and can promote industry upgrading and improve competitiveness. How to accurately find scholars in mass information becomes a key for talent introduction.

The scholars search plays an important role in talent introduction, and the scholars are experts in one or more leading-edge fields, master latest scientific research trends, possess a large number of interpersonal relationships and can provide guidance opinions for research and development and production of enterprises. The domain characteristics of the scientific research data of the scholars are analyzed, a scholars retrieval model is constructed, and the scholars can be positioned by government enterprises to complete talent introduction work.

The scholars' search has been receiving extensive attention and intensive research from experts in various fields, and the research results thereof have been successfully applied to various systems, such as Aminer, encyclopedia, and the like. But existing systems and models only consider student attributes or thesis information. The cooperation relationship of the scholars also implies a large amount of information, and the accuracy of retrieval can be improved by applying the cooperation relationship to the scholars for retrieval.

Disclosure of Invention

The invention provides a learner retrieval method based on a cooperative network aiming at the problems in the prior art, the technical scheme provides the learner retrieval method based on word embedding, a word embedding model can learn semantic information, network topology attributes are added into the model, and the precision can be improved in the retrieval model. The learner search process includes a learner node representation and a learner vector search. Firstly, a learner cooperative network is constructed, network topology information is added into a word embedding model, and a learner node representation vector is obtained and consists of a text vector and a node vector. The text vector is generated from the text of the current node and the neighboring nodes, and the node vector is generated from the neighboring nodes. And then, the vector of the learner is searched by using the product quantization model, and the related learner is returned. Compared with the traditional model, the word embedding-based retrieval model can improve the recall ratio on the premise of ensuring the precision ratio through testing.

In order to achieve the above object, the technical solution of the present invention is as follows, a learner search method based on a cooperative network, the method comprising the steps of:

the method comprises the following steps: building a student cooperative network;

step two: optimization of a student cooperative network structure;

step three: training a student node representation model;

step four: establishing a learner vector index;

step five: and (5) searching by the scholars.

As an improvement of the invention, the method comprises the following steps: building a student cooperative network; the method comprises the following specific steps: reading scholars data and scholars cooperative output data from a database, comprising: scholars data (ID, author, organization, age, job title, profile);

scholars collaboratively producing data (ID, title, author, organization, abstract, journal, year, keyword); after the data is read, the data needs to be preprocessed, and a student cooperation network is constructed. The student cooperation network takes the students as nodes, and the quantity of cooperative output of the students is the weight of edges. The construction process of the student cooperative network mainly uses a network tool kit, and firstly inputs cooperative output of students, such as data of articles, patents and the like. And initializing an adjacency matrix of student cooperation according to the input statistical quantity of students, circularly processing participants of each thesis or project, updating the adjacency matrix, and finally outputting the adjacency matrix, namely the student cooperation network.

As an improvement of the invention, the step two is as follows: the optimization of the cooperative network structure of the scholars is as follows:

step 1) using the weights of the logarithmically normalized edges,

step 2) calculating the topological similarity, the text similarity and the path distance between the scholars pairwise,

step 3) calculating the total similarity between the scholars pairwise, sorting according to the similarity,

step 4) selecting the relationship of scholars with the similarity ranking of 10 percent, adding the cooperative relationship of scholars,

step 5) selecting the relationship of scholars with the similarity ranking of 10 percent, deleting the cooperative relationship of scholars,

step 6) updating the student cooperative network, updating the adjacent matrix,

the edge is normalized in step 1) to narrow the difference of the number of times of cooperation between different scholars, the more the number of times of cooperation, the higher the degree of similarity of the scholars, the smaller the distance between the scholars should be, and the calculation formula is shown in 1:

p_ij＝1/ln(p_ij+1) (1)

wherein p is_ijIs the weight of the edge.

The topological-based Similarity (Topology) represents the Similarity between nodes in the graph in terms of topological structure, if two nodes have common neighbor nodes, they are more likely to be similar and have a cooperative relationship, and the calculation formula is shown in fig. 2:

topoSim(H，v)＝2*|N(u)∩N(v)|/(d(u)+d(v)) (2)

wherein N (u) is the co-worker of the student u, N (u) n (v) is the co-worker of the student u and the student v, d (u) is the degree of the student u, d (v) is the degree of the student v;

text similarity represents the similarity of papers between scholars. The more similar the papers published among scholars and the more potential cooperative relationships, the calculation formula is shown in fig. 3:

wherein x_uFor the text vector of learner u, the text information is encoded using the BERT model.

The similarity of the learners is composed of the topological structure of the network, node attribute information and the path distance between the learners, and the calculation formula is shown as 4:

authorSim(u，v)＝textSim(u，v)*topoSim(u，v)/dist(u，v) (4)

wherein authorSim (u, v) is the similarity between the scholars, dist (u, v) is the shortest path distance between the scholars u and v, and when the shortest path between the scholars u and v is calculated, if a path already exists between scholars nodes, the shortest path between the nodes should be calculated after the current path is deleted.

The cooperation relationship of the scholars with low similarity is deleted through an algorithm, the cooperation relationship of the scholars with high similarity is added, and the optimized cooperation network is closer to a real scholars community. Better effects can be achieved in the tests of the student representation model and the student community discovery model.

As an improvement of the invention, step three: training a student node representation model; the method comprises the following specific steps:

the CANE model fuses text context information into network embedded vectors, and the representation capability of the model on nodes in the network is improved. However, the model focuses on mining the network topology information of the nodes, and the text information is only used as a supplement of the network information. Therefore, the CANE model is suitable for network related tasks such as community discovery, link prediction and the like, and is difficult to apply to information retrieval. In addition, the CANE adopts a bag-of-words model and ignores word sequence information.

A student node representation model suitable for information retrieval is provided by using a BERT model and taking the idea of a CANE model as reference. The method includes the steps that network topology information of a learner is added into an original word embedding model to obtain a learner representation vector and more accurately represent the learner, a learner partner network G is set to be (V, E), each vertex represents the learner, each edge E is set to be (u, V) to represent the relationship between the learner u and the learner V, the model is input to be { Tu, Tv, Su, Sv }, the model represents text information of the learner u, text information of the learner V, a learner u node and a learner V node respectively, and the calculation flow of the model is as follows:

(1) respectively coding the text information of a learner u, the text information of a learner v, a learner u node and a learner v node, wherein the text information of the learner u and the learner v is coded by BERT, and the initial coding of the learner node is randomly generated;

(2) respectively sending the coded information into a convolution layer to obtain a matrix after convolution;

(3) splicing the convolved matrixes to obtain a matrix T;

(4) obtaining a matrix M and implicit information between the learning text and the text, the nodes and between the learning text and the nodes of the Transfomer by the matrix T through self-attention coding;

(5) finally, multiplying the matrix M by the weight respectively to output u^t、v^t、u^s、v^sAnd respectively represent a learner u text vector, a learner v text vector, a learner u network structure vector and a learner v network structure vector.

Unlike the can model, which is a model in which text embedding and node embedding are learned separately, the node representation model implemented herein can learn text information and network structure information at the same time through a self-attention layer. The text and network structure information may be mapped into the same vector space. In the model training process, the student pairs u and v are input, and the model is updated. In addition, special data, namely empty node pairs, such as a learner u and an empty node, and a learner v and an empty node pair, are added into the input of the training model, and the learner can be coded independently.

Secondly, the loss function of the CANE model focuses more on the network information of the nodes than on the text information. The model realized by the method emphasizes the optimization target on the text information, and the optimization target of the model is shown in formula 5:

ε＝∑_e∈EL(e) (5)

the learner vector is composed of a paper information vector and a network structure vector, L_s(e)，L_t(e) Respectively, the optimization target based on the network structure and the optimization target based on the text, and the calculation formula is shown as 6:

L(e)＝L_s(e)+L_t(e) (6)

optimization objective L for network architecture_s(e) The neighboring nodes have similar structures, assuming there is an edge between the learner nodes u, v^sThe structural vector of the node v is shown in the following specific formula 7:

L_s(e)＝w_u，v*log p(v^s|u^a) (7)

optimizing an objective L for text_t(e) The model takes the learner u and the paper p as the input of the model to predict whether u writes the paper p, so as to make the node representation of the learner closer to the text representation thereof, and the current learner paper information and the collaborator paper have certain similarity, the specific formula is shown as 10:

L_p(e)＝w_u，v log p(v^t|v^s) (8)

L_tt(e)＝w_u，v log p(v^t|u^t) (9)

L_t(e)＝a*L_tt(e)+β*L_p(e) (10)

wherein v is^tAnd u^tArticles of scholars v and u, respectivelyThis representation. The loss function of the model is realized, the text information of a learner is more concerned than the network information, and in the actual test, the node representation of the model learner can obtain better effect in the retrieval.

As an improvement of the invention, step four: the establishment of the learner vector index is as follows:

the learner's retrieval is mainly realized through vector retrieval, and the time complexity of calculating the vector similarity by the cosine value is as follows: and when n and D are larger, the complexity is higher, and the vector index can be built to compress the vector and accelerate the retrieval speed.

(1) Learner node vector indexing;

product quantization with milvus integration^[37]The method comprises the steps of dividing an original vector space into different subspaces, clustering each subspace, representing the original vector by using the clustering center of each subspace, and calculating the similarity of the different subspaces to obtain the similarity of the original vector.

The main calculation process of model training is as follows: vector clustering, distance calculation between clustering centers, and mapping of original vectors to the clustering centers, wherein the vector clustering adopts a Kmeans algorithm, and the time complexity of clustering is shown as a formula 11 according to the principle of Kmeans:

O(n)＝(1*n*k*d) (11)

wherein 1 is the iteration number, n is the number of vectors, k is the number of clustering centers, d is the dimension of the vectors,

as can be seen from the vector distance calculation formula, the distance time complexity between the cluster centers is o (n) ═ k × d, and the time complexity of the cluster center corresponding to the original vector is calculated as o (n) ═ k × d; the training has a total of m subspaces,

d is the dimension of the original vector, and therefore, the total complexity of the training is shown in equation 12:

O(n)＝m*(l*n*k*D/m)+m*(k*k*D/m)+m*(n*k*D/d)

＝D*k*(n*l+k+n)(12)；

as shown in formula 12, when the iteration number 7 and the clustering center k in the clustering model are constant, the time complexity can be simplified to o (n) ═ D × n, so that the time complexity of model training for product quantization is only related to the vector dimensions and the number of vectors, and after the model training is completed, the clustering models in different subspaces are saved, and the distance between the clustering centers is stored in a table for query during vector retrieval.

As an improvement of the invention, step five: the learner vector retrieval is as follows:

after establishing the learner vector index, the user inputs the retrieved subject, the system returns the relevant learner, and the specific retrieval steps are as follows:

(1) vectorizing a user retrieval subject through a learner representation model, and outputting a retrieval vector Q;

(2) reading a learner vector index and a clustering center which is trained by a model;

(3) dividing the retrieval vector Q into m subspaces, and finding out clustering centers corresponding to different subspaces;

(4) the approximate distance between the retrieval vector and the retrieved vector Q in m subspaces is inquired through an index table look-up;

(5) adding the approximate distances of all subspaces to obtain the approximate distance between the retrieval vector Q and the learners to be retrieved;

(6) sorting the approximate distances of all scholars and returning the top n results.

From the above, it can be seen that the product quantization retrieval time complexity is: when D is much larger than m, the time complexity of product quantization is much smaller than that of directly calculating cosine value.

The device comprises a data storage module, a background management module and a user retrieval module; the data storage module stores data required by the system. Learner vector data for learner searches and unstructured data to show the learner, institution and outcome. The storage of data can be divided into two major modules according to data characteristics: the first is a milvus component for storing vector indexes; the ES component stores unstructured data; and the background management module is used for preprocessing the collected student and student scientific research data, constructing a student cooperation network, training a model and establishing a data index. The data preprocessing component reads new data from the database, and the main work includes data cleaning, student cooperative network construction and the like. The model training component is used for training a student node representation model and generating a student vector, and the index component generates a vector index by passing the student vector through a product quantization model; the user retrieval module has the main functions of student retrieval and information display, the retrieval component is responsible for calculating similarity and sequencing, and the visual component copies information display and user interaction functions.

Compared with the prior art, the invention has the following advantages that:

1) the scheme realizes a retrieval model based on word embedding, and compared with the traditional keyword retrieval mode, the retrieval model based on word embedding can make full use of semantic information and improve the recall rate of retrieval.

2) The scheme realizes a student retrieval model based on the cooperative network, and the traditional student retrieval model based on the text only considers the text information and ignores the cooperative relationship of the students, so that the student retrieval model based on the cooperative network has better effect.

3) The scheme adopts a synthesis quantization algorithm to establish the vector index, and has smaller occupied space and higher retrieval speed compared with the traditional vector retrieval.

Drawings

FIG. 1 is a schematic diagram of an algorithm process;

FIG. 2 is a flow diagram of a learner collaboration network element;

FIG. 3 is a partial diagram of a student collaboration network;

FIG. 4 is a schematic diagram of a learner node representation model;

FIG. 5 is a flow chart of vector index training;

FIG. 6 is a schematic diagram of a pseudo code for a product quantization algorithm;

FIG. 7 is a pseudo-code diagram of a product quantization similarity calculation algorithm;

FIG. 8 is a functional block diagram of the system;

FIG. 9 is a system implementation class diagram;

FIG. 10 is a graph of a trainee node representation model comparison;

FIG. 11 is a comparison of P-R curves for different query models;

FIG. 12 is a comparison graph of various model indexes under query;

FIG. 13 is a graph comparing model search times;

Detailed Description

For the purposes of promoting an understanding and appreciation of the invention, reference will now be made in detail to the present embodiments of the invention.

Example 1: a learner retrieval construction method based on a cooperative network comprises the following steps:

the method comprises the following steps: and building a student cooperative network.

Step two: and (5) optimizing the cooperative network structure of the students.

Step three: the learner nodes represent model training.

Step four: and establishing a learner vector index.

Step five: and (5) searching by the scholars.

The schematic diagram of the algorithm process is shown in fig. 1.

Each step is described in detail below.

The method comprises the following steps: the student cooperative network is constructed as follows:

reading scholars data and scholars cooperative output data from a database, comprising:

scholars data (ID, author, organization, age, job title, profile);

scholars collaboratively producing data (ID, title, author, organization, abstract, journal, year, keyword); after the data is read, the data needs to be preprocessed, and a student cooperation network is constructed. The student cooperation network takes the students as nodes, and the quantity of cooperative output of the students is the weight of edges. The network tool kit is mainly used in the building process of the student cooperative network, and the building process is shown in fig. 2.

As can be seen from FIG. 2, the cooperative outcome of the learner, such as data of articles, patents, etc., is first entered. And initializing an adjacency matrix of student cooperation according to the input statistic student number. The participants in each paper or project are then processed in a loop to update the adjacency matrix. And finally outputting the adjacency matrix, namely the learner cooperative network. The effect of the student cooperative network construction is shown in fig. 3.

Step two: the optimization of the cooperative network structure of the scholars is as follows:

the student cooperative network generated by the original data has the problems of data loss, noise and the like, so that the constructed student cooperative network cannot reflect a real student community. And optimizing the network structure in the original cooperative network according to a node similarity algorithm, adding a cooperative relationship among scholars with high similarity, and deleting the cooperative relationship among the scholars with low similarity. The algorithm combines the topological structure of the network, the node attribute information and the path distance between the learners, and the specific process is realized as follows:

step 1) the weights of the edges are normalized logarithmically.

And 2) calculating the topological similarity, the text similarity and the path distance between the scholars pairwise.

And 3) calculating the total similarity between the scholars pairwise, and sorting according to the similarity.

And 4) selecting a scholars relationship with the similarity ranking of 10 percent at the top between the scholars, and adding the scholars cooperative relationship.

And 5) selecting the relationship of the scholars with the similarity ranking of 10 percent, and deleting the cooperative relationship of the scholars.

And 6) updating the student cooperative network and updating the adjacency matrix.

The edge is normalized in step 1) to narrow the gap in the number of collaborations between different scholars. The more the cooperation times, the higher the similarity of the scholars, the smaller the distance between the scholars should be, and the calculation formula is shown in 1:

p_ij＝1/ln(p_ij+1) (1)

wherein p is_ijIs the weight of the edge.

The topological Similarity (Topology-based Similarity) represents the Similarity in terms of topological structure between nodes in the graph. If two nodes have a common neighbor node, they are more likely to be similar, having a cooperative relationship. The calculation formula is shown in fig. 2:

topoSim(u，v)＝2*|N(u)∩N(v)|/(d(u)+d(v)) (2)

wherein N (u) is the co-worker of the student u, N (u) n (v) is the co-worker of the student u and the student v, d (u) is the degree of the student u, and d (v) is the degree of the student v.

Text similarity represents the similarity of papers between scholars. The more similar the papers published between scholars, the more potential for collaboration. The calculation formula is shown in fig. 3:

The similarity of the learners is composed of the topological structure of the network, node attribute information and the path distance between the learners, and a calculation formula is shown as 5-4:

authorSim(u，v)＝textSim(u，v)*topoSim(u，v)/dist(u，v) (4)

Step three: the learner node represents model training as follows:

A student node representation model suitable for information retrieval is provided by using a BERT model and taking the idea of a CANE model as reference. Network topology information of the learner is added into the original word embedding model to obtain a learner representation vector, and the learner is represented more accurately. Assume a network of student partners G ═ (V, E), where each vertex represents a student and each edge E ═ u, V > represents a relationship between student u and student V. The model is shown in fig. 4, and the input of the model is { Tu, Tv, Su, Sv }, which represents the learner u text information, the learner v text information, the learner u node, and the learner v node, respectively. The calculation flow of the model is as follows:

(1) the text information of the scholars u, the text information of the scholars v, the nodes of the scholars u and the nodes of the scholars v are respectively coded, wherein the text information of the scholars u and v is coded by BERT, and the initial codes of the nodes of the scholars are generated randomly.

(2) And respectively sending the coded information into a convolution layer to obtain a matrix after convolution.

(3) And splicing the convolved matrixes to obtain a matrix T.

(4) The matrix M is obtained by coding the matrix T with self attention, and the Transfomer can learn the implicit information between texts and texts, between nodes and between texts and between nodes.

ε＝∑_e∈EL(e) (5)

L(e)＝L_s(e)+L_t(e) (6)

L_s(e)＝w_u，v*log p(v^s|u^s) (7)

optimizing an objective L for text_t(e) In that respect The model takes scholars u and papers p as input of the model to predict whether u writes papers p. The objective is to make the node representation of the scholars closer to the text representation thereof, and the current scholars 'papers information and collaborator's papers have certain similarity, then the specific formula is shown as 10:

L_p(e)＝w_u，v log p(v^t|v^s) (8)

L_tt(e)＝w_u，v log p(v^t|u^t) (9)

L_t(e)＝α*L_tt(e)+β*L_p(e) (10)

wherein v is^tAnd u^tTextual representations of scholars v and u, respectively. The loss function of the model is realized in the text, and the text information of a learner is more concerned than the network information, so that the loss function of the model is in actual testThe model learner node shows that better effect can be obtained in the search.

Step four: the establishment of the learner vector index is as follows:

the learner's retrieval is mainly realized through vector retrieval, and the time complexity of calculating the vector similarity by the cosine value is as follows: 0(n) ═ n × D, n is the number of vectors, D is the dimension of the vectors, and the complexity is higher when n and D are larger. The vector index is established, so that the vector can be compressed, and the retrieval speed is accelerated.

(1) Student node vector indexing

The vector index training process is shown in fig. 5, and the main calculation processes of model training include: vector clustering, distance calculation between clustering centers, and mapping of original vectors to the clustering centers. The vector clustering adopts a Kmeans algorithm, and the time complexity of clustering is shown in formula 11 according to the principle of Kmeans:

O(n)＝(1*n*k*d) (11)

wherein 1 is iteration times, n is vector number, k is the number of clustering centers, and d is the dimensionality of the vector.

As can be seen from the vector distance calculation formula, the time complexity of the distance between the cluster centers is o (n) ═ k × d, and the time complexity of the cluster center corresponding to the original vector is o (n) ═ k × d. The training has a total of m subspaces,

d is the dimension of the original vector. It can be seen that the total complexity of training is shown in equation 12:

O(n)＝m*(l*n*k*D/m)+m*(k*k*D/m)+m*(n*k*D/d)

＝D*k*(n*l+k+n)(12)

as shown in equation 12, when the iteration number 1 and the cluster center k in the cluster model are constant, the time complexity can be reduced to o (n) ═ D × n, so the time complexity of model training for product quantization is only related to the vector dimension and the number of vectors. After the model training is finished, the clustering models of different subspaces are stored, and the distance between the clustering centers is stored in a table for query during vector retrieval.

Algorithm 1 the pseudo code of the product quantization algorithm is shown in fig. 6.

Step five: the learner vector retrieval is as follows:

(1) vectorizing the user retrieval theme through the learner representation model, and outputting a retrieval vector Q.

(2) And reading the learner vector index and the clustering center which is trained by the model.

(3) And dividing the retrieval vector Q into m subspaces, and finding out clustering centers corresponding to different subspaces.

(4) And querying the approximate distance between the retrieval vector and the retrieved vector Q in m subspaces through an index lookup table.

(5) The sum of the approximate distances of all subspaces is the approximate distance between the retrieval vector Q and the learners to be retrieved.

From the above, it can be seen that the product quantization retrieval time complexity is: when D is much larger than m, the time complexity of product quantization is much smaller than the direct computation of cosine values.

Algorithm 2 pseudo code of the product quantization similarity calculation algorithm is shown in fig. 7.

Example 2: referring to fig. 8, a learner search building apparatus based on a collaboration network is shown in fig. 8, and the overall functional structural design of the system may be divided into a data storage module, a background management module, and a user search module according to functions. This section will explain the functional design of the three modules.

1. And a data storage module. And storing data required by the system. Learner vector data for learner searches and unstructured data to show the learner, institution and outcome. The storage of data can be divided into two major modules according to data characteristics: the first is a milvus component for storing vector indexes; the second is the ES component that stores unstructured data.

2. And a background management module. Preprocessing collected student and student scientific research data, constructing a student cooperation network, training a model and establishing a data index. The data preprocessing component reads new data from the database, and the main work includes data cleaning, student cooperative network construction and the like. The model training component trains the learner node representation model and generates a learner vector. The indexing component generates a vector index by passing the learner vector through a product quantization model.

3. And a user retrieval module. The main functions are the functions of student retrieval and information display. The retrieval component is responsible for calculating similarity and sequencing, and the visualization component duplicates information display and user interaction functions.

The specific functions of each class are described below:

the View class mainly has the functions of analyzing data, rendering the data into a visual page and presenting the data to a user.

The Controller class is mainly used for receiving the user request parameters, analyzing the parameters, calling the related services in the system and returning the results to the View class.

And 3, the User class is used for encapsulating User data and controlling the login registration and the authority control of the User.

The role of the textPreprocess class is the preprocessing work of the data, such as data consistency check, student cooperative network construction and the like.

The Model class is mainly responsible for the training and saving of the Model.

The role of the Index class is the versioning of the model and the loading of the model.

Timing tasks for Timer-like systems, such as training and updating of models.

The Connect class is responsible for the linking and reading of data, e.g. mysql, ES etc. linking of databases.

The Index class is responsible for the loading of the model and Index.

The Retrieval class is the key point of Retrieval, and is mainly responsible for calculating the similarity and sequencing the results.

Experiment design and result analysis:

in order to prove the advancement of the search method based on the student cooperative network, a contrast experiment is designed to compare the experimental effect.

And (3) testing environment:

1. hardware environment

CPU model: inter (R) core (TM) i5-6500CPU

Memory capacity: 16.0GB

Hard disk information: 240GB SSD

2. Software environment

Operating the system: microsoft Windows 10

A database: milvus, Mysql, ES

Developing a tool: pycharm (Python IDE)

Programming language: python 3.6

The browser: chrome 75.0.3770.100

3. Test tools and data

Testing the tool: pytest, BurnTest, jmeter, etc

cora data set: 2277 nodes, 2277 text messages, 5214 edges

Data set collected by the system: 6 thousands of scholars, 161 universities (including treatises, patents, projects, etc.), 80 thousands of scholars

And (3) testing a system algorithm:

the system algorithm test aims to verify the advancement and feasibility of the algorithm. The main algorithms tested are: the learner node representation model and the learner retrieval model based on word embedding.

(1) Student node representation model testing

The learner node vector representation model test is to verify whether the model represents the learner more accurately. The data set of the model test is the data set collected by the system. In the word embedding model, words with similar meanings are distributed relatively closely in space. Therefore, in the vector space, the scholars can judge the quality of the model through the spatial distribution. The contour factor is one of the evaluation methods, and it combines two factors of cohesion and separation to evaluate the model. The formula for calculating the profile coefficients for a certain vector i in a cluster is shown in fig. 11:

S(i)＝(b(i)-a(i))/max{a(i)，b(i)} (11)

where a (i) is the average of the distances of the i vector to other points in the same cluster, and b (i) is the minimum of the distances of the i vector to other clusters. It can be seen that the value of the profile factor is between [ -1, 1], and that approaching 1 means that both the cohesion and the separation are relatively good. And calculating the average value of the contour coefficients of all the vectors, namely the evaluation index of the model.

Table 1 shows that the learner node representation model implemented in this document can obtain a higher contour coefficient in the dimension of a different vector than other models in a comparison experiment with other models, which indicates that the clustering effect is better than that of other models.

TABLE 1 Profile factor comparison Table

The vector space is reduced to two dimensions through the TSNE algorithm, and the model effect is observed more visually and conveniently. The specific effect is shown in fig. 10, wherein the students of different disciplines are distributed in a cluster shape in the vector space, and the clustering effect is automatically formed. Compared with the BERT model, the model has the advantages that under the same theme, the gathering effect of the scholars is better, the coincidence degree of the scholars in different disciplines is lower, and the discrimination is large. The model realized by the method adds node information on the basis of word embedding, and students in the same discipline have more cooperative relationships. Therefore, the students in the same discipline are closer to each other in space in vector representation, and the clustering effect is better.

(2) A learner search model based on word embedding;

the learner search model test based on word embedding is used for verifying whether the model can ensure the search precision and speed compared with the traditional search mode. The test data of the model is a data set collected by the system. Results can be classified into four categories in a single query test: retrieved relevant RRs, retrieved irrelevant RNs, not retrieved relevant NRs and not retrieved irrelevant NNs. Precision ratio and recall ratio of the search result can be defined, and the calculation formula is shown as 12 and 13:

on the basis of given evaluation indexes, comparison experiments of a learner retrieval model based on word embedding and a traditional retrieval model are designed respectively. Neither precision alone nor recall reflects the performance of the search, while PR curves can integrate precision and recall to evaluate the model. The PR curve can more accurately represent the performance of the search, and the test result is shown in fig. 11. The experimental result shows that the precision ratio is superior to or equal to that of the traditional retrieval mode under different conditions under the condition that the precision ratio is almost equal in the retrieval mode based on word embedding. With the improvement of the recall ratio, higher precision ratio can still be kept. Overall, the upper curve in the PR graph represents the superior system performance.

And respectively calculating the P @5, the P @10 and the P @20 of the result for the query result. P @ n is the precision of the first n returned results, R-pre is the recall, and the experimental results are shown in FIG. 12. The experimental results show that the word embedding-based retrieval mode can ensure higher accuracy under the condition of different retrieval result numbers, and the recall ratio is superior to other models.

The performance of the different models was verified using different data scales and the test results are shown in fig. 13. The search model based on word embedding can meet the requirements of users although the search time is slightly higher than other models.

Through comparison of the three experiments, the retrieval mode based on word embedding can meet the requirement on precision, meanwhile, the recall ratio of retrieval is improved, but the retrieval speed is slightly insufficient. The traditional retrieval model adopts a keyword matching mode and adopts a bag-of-words model, so that different but related data cannot be retrieved. The learner retrieval model based on the word embedding adopts a word embedding mode, makes full use of semantic information, can retrieve related data, and improves the recall ratio of the retrieval model.

It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and all equivalent substitutions or substitutions made on the basis of the above-mentioned technical solutions belong to the scope of the present invention.

Claims

1. A learner retrieval method based on a cooperative network, the method comprising the steps of:

step two: optimization of a student cooperative network structure;

step three: training a student node representation model;

step four: establishing a learner vector index;

step five: and (5) searching by the scholars.

2. The collaborating network based learner retrieval method according to claim 1, wherein the first step: building a student cooperative network; the method comprises the following specific steps: reading scholars data and scholars cooperative output data from a database, comprising: student data ID, author, institution, age, job title, profile;

the scholars collaboratively produce data including ID, title, author, organization, abstract, periodical, year and keyword; after data is read, preprocessing is needed to be carried out on the data, a student cooperation network is constructed, the student cooperation network takes students as nodes, the quantity of the student cooperation output is the weight of edges, a network tool kit is mainly used in the construction process of the student cooperation network, the cooperation output of the students is firstly input, then the student cooperation output is counted according to the input quantity, an adjacency matrix of student cooperation is initialized, then participants of each paper or project are processed in a circulating mode, the adjacency matrix is updated, and finally the output adjacency matrix is the student cooperation network.

3. The learner's search method based on cooperative network as claimed in claim 1, wherein step two: the optimization of the cooperative network structure of the scholars is as follows:

step 1) using the weights of the logarithmically normalized edges,

step 6) updating the student cooperative network, updating the adjacent matrix,

p_ij＝1/ln(p_ij+1) (1)

wherein p is_ijIs the weight of the edge;

the topological-based Similarity (Topology) represents the Similarity between nodes in a graph in terms of topological structure, and if two nodes have common neighbor nodes, the two nodes may be similar and have a cooperative relationship, and the calculation formula is shown in fig. 2:

topoSim(u，v)＝2*|N(u)∩N(v)|/(d(u)+d(v)) (2)

the text similarity represents the similarity of papers among scholars, the more similar the papers published among scholars are, the more potential cooperative relationship exists, and the calculation formula is shown as 3:

wherein x_uCoding text information by adopting a BERT model for a text vector of a learner u;

authorSim(u，v)＝textSim(u，v)*topoSim(u，v)/dist(u，v) (4)

4. The learner's search method based on cooperative network as claimed in claim 1, wherein step three: training a student node representation model; the method comprises the following specific steps:

the method includes the steps that network topology information of a learner is added into an original word embedding model to obtain a learner representation vector and more accurately represent the learner, a learner partner network G is set to be (V, E), each vertex represents the learner, each edge E is set to be (u, V) to represent the relationship between the learner u and the learner V, the model is input to be { Tu, Tv, Su, Sv }, the model represents text information of the learner u, text information of the learner V, a learner u node and a learner V node respectively, and the calculation flow of the model is as follows:

(3) splicing the convolved matrixes to obtain a matrix T;

(5) finally, multiplying the matrix M by the weight respectively to output u^t、v^t、u^s、v^sRespectively representing a learner u text vector, a learner v text vector, a learner u network structure vector and a learner v network structure vector;

the model realized by the method emphasizes the optimization target on the text information, and the optimization target of the model is shown in formula 5:

ε＝∑_e∈EL(e) (5)

L(e)＝L_s(e)+L_t(e) (6)

L_s(e)＝w_u，v*log p(v^s|u^s) (7)

optimizing an objective L for text_t(e) If the model takes the scholars u and the paper p as the input of the model to predict whether u writes the paper p, the specific formula is shown as 10:

L_p(e)＝w_u，vlog p(v^t|v^s) (8)

L_tt(e)＝w_u，vlog p(v^t|u^t) (9)

L_t(e)＝α*L_tt(e)+β*L_p(e) (10)

wherein v is^tAnd u^tTextual representations of scholars v and u, respectively.

5. The learner's search method based on cooperative network as set forth in claim 1, wherein the fourth step: the establishment of the learner vector index is as follows:

the learner's retrieval is mainly realized through vector retrieval, and the time complexity of calculating the vector similarity by the cosine value is as follows: when n and D are larger, the complexity is higher, and the vector index can be built to compress the vector and accelerate the retrieval speed;

(1) learner node vector indexing;

the learner vector is indexed by a milvus integrated product quantification method, the algorithm principle is that an original vector space is divided into different subspaces, each subspace is clustered, then the original vector is represented by the clustering center of each subspace, and the similarity of the original vector can be obtained by calculating the similarity of the different subspaces;

O(n)＝(l*n*k*d) (11)

O(n)＝m*(l*n*k*D/m)+m*(k*k*D/m)+m*(n*k*D/d)

＝D*k*(n*l+k+n) (12)；

as shown in formula 12, when the iteration number 1 and the clustering center k in the clustering model are constant, the time complexity can be simplified to o (n) ═ D × n, so that the time complexity of model training for product quantization is only related to the vector dimensions and the number of vectors, and after the model training is completed, the clustering models in different subspaces are saved, and the distance between the clustering centers is stored in a table for query during vector retrieval.

6. The scholars search method based on cooperative network as claimed in claim 1, characterized by the step five: the learner vector retrieval is as follows:

(6) sorting the approximate distances of all scholars and returning the first n results;

7. The device for realizing the student search method based on the cooperative network as claimed in any one of the claims 1 to 6, characterized in that the device comprises a data storage module, a background management module and a user search module;

the data storage module is used for storing data required by the system, learner vector data for learner retrieval and unstructured data for demonstrating students, mechanisms and achievements, and the data storage module can be divided into two modules according to data characteristics: the first is a milvus component for storing vector indexes; the ES component stores unstructured data;

the background management module is used for preprocessing collected scientific research data of scholars and scholars, constructing a scholars cooperation network, training a model and establishing a data index, the data preprocessing component reads new data from a database, the main work is data cleaning, the scholars cooperation network is constructed, the model training component is a model for training a scholars node representation and generating a scholars vector, and the scholars vector passes through a product quantization model during indexing the component to generate a vector index;

the user retrieval module has the main functions of student retrieval and information display, the retrieval component is responsible for calculating similarity and sequencing, and the visual component copies information display and user interaction functions.