CN114818737B

CN114818737B - Method, system and storage medium for extracting semantic features of scientific and technological paper data text

Info

Publication number: CN114818737B
Application number: CN202210745539.4A
Authority: CN
Inventors: 薛哲; 杜军平; 郑长伟; 李文玲; 梁美玉; 邵蓥侠; 寇菲菲
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-11-18
Anticipated expiration: 2042-06-29
Also published as: CN114818737A

Abstract

The invention provides a method, a system and a storage medium for extracting semantic features of scientific and technical paper data texts, wherein the method comprises the following steps: acquiring text information of a scientific and technological thesis, and constructing an entity relationship graph based on the acquired text information of the scientific and technological thesis, wherein the text information comprises a thesis title and a keyword, a node in the entity relationship graph is the thesis title or the keyword, and an edge in the entity relationship graph is an association relationship between the nodes; extracting semantic features based on the acquired text information of the scientific and technical paper to obtain a semantic feature matrix; determining an original adjacency matrix based on the entity relationship graph, and inputting the semantic feature matrix and the original adjacency matrix into a graph network model to obtain a spatial feature matrix; and performing feature fusion on the semantic feature matrix and the spatial feature matrix to obtain the final semantic features of the scientific and technological paper. The feature extraction method can well extract the semantic features of the scientific and technical paper by utilizing the spatial correlation of the knowledge map on the basis of extracting the semantic features of the linguistic data of the scientific and technical paper.

Description

Method, system and storage medium for extracting semantic features of scientific and technological paper data text

Technical Field

The invention relates to the technical field of computers, in particular to a method, a system and a storage medium for extracting semantic features of scientific and technological thesis data texts.

Background

Scientific and technological papers are taken as important research result showing and information obtaining sources, a large number of scientific and technological papers are published almost every day, the academic results comprise a plurality of latest professional field information, and the technology and technological papers are effectively and quickly obtained and semantic feature representation and learning are particularly important. However, the scientific and technical paper data often contains a large number of complex attributes, such as abstracts, keywords, cited documents, etc. of the papers, the association between the papers is tighter, and in addition, a large amount of professional knowledge covers a wide range of subjects, so that a large amount of professional knowledge is required for feature extraction of the scientific and technical paper; the feature of effectively extracting the thesis data can provide support for the processing of the scientific and technological thesis data.

TF-IDF (term frequency-inverse document frequency) is a traditional text feature extraction mode. It uses word frequency and inverse document frequency to represent documents as multidimensional vector representation of keyword weights, which is a typical vector space model. Mikolov et al introduced a Word vector representation model Word2Vec based on the continuous bag of words model (CBOW) and Skip-Gram model, and the whole NLP field quickly entered the embedding world. The traditional coding mode mainly takes onehot coding as a main mode, vectors obtained by the coding mode are often sparse, word vectors trained by Word2Vec are low-dimensional and dense, context information of words is effectively utilized, and semantic information of the vectors is richer. Li et al uses Word2vec algorithm to process semantic gap and implement Word frequency-document reciprocal frequency (TF-IDF) weighted mapping of HTTP traffic to construct low-dimensional paragraph vector representation to reduce complexity; after Word2vec is retrained, the semantic vector of each Word is not changed, and different vectors cannot be obtained by combining context semantics. Peters et al propose an ELMo model in order to solve the problem of lack of context adaptation in Word2 vec; different from the characteristic that semantic expression vectors acquired by a static word embedding expression model are not changed, ELMo needs to be pre-trained on a large-scale corpus at first, and after the pre-training is finished, fine adjustment is performed according to a specific application field, so that the purpose of field adaptation is achieved, and a word can acquire special vectors according to the current context. The GPT also adopts a corpus to obtain a pre-training model, and then fine tuning is carried out through a small-scale corpus, compared with ELMo, the GPT and the ELMo have the main difference that the network structures adopted for feature extraction are different, the GPT adopts a Transformer, and the ELMo adopts an LSTM. The Transformer is an end-to-end sequence model proposed by google, and on the basis of the model, a plurality of improved methods are widely applied to the fields of natural language processing, even images and the like; compared with the traditional sequence model, the Transformer completely adopts an attention mechanism to form a network, and the whole network is formed by the structures of an encoder and a decoder; based on the method, a BERT model is further provided, the BERT model shields partial words in the corpus to perform a prediction task through a mask mechanism, so that pre-training of the model is performed, and context semantics of the text are effectively extracted by adopting a bidirectional coding mode.

The traditional pre-training model mainly adopts a one-way network or simply splices a two-way network, the context relation of a text cannot be effectively mined by the network structure, BERT emphasizes that context information is obtained in an omnidirectional manner, and the whole model is constructed by a deep two-way Transformer unit, so that a deep two-way language representation capable of fusing the context information is finally generated. In the pre-training process, the BERT respectively adopts a mask Language Model and a Next sequence Prediction task to perform pre-training, wherein the mask Language Model randomly replaces partial words in the corpus and enables the Model to predict the replaced words so as to realize pre-training, so that the Model can effectively learn the characteristics of a single Sentence, however, for common questions, answers and other sentences, the context relationship of the Sentence needs to be captured, and the mask Language Model task tends to extract the word level representation and cannot directly acquire the Sentence level characteristics. In order to enable the model to understand the relationship between sentences, BERT uses a Next sequence Prediction task for pre-training, namely, a training task is carried out by judging whether two sentences are context-related; the BERT carries out the two pre-training tasks through a deep network and large-scale corpora to obtain rich text semantic representation, and on the basis, the best performance at the time is obtained in a plurality of downstream tasks of the NLP.

In the existing method for extracting features based on a pre-trained corpus model, semantic features are extracted for text contexts, however, for scientific and technical papers, a large number of associations exist among various attributes of the papers, especially keywords and titles, and the papers can also be associated through co-occurrence relations of the keywords, the attributes cover main semantic information of the papers, and the existing feature extraction method ignores the associations among the paper attributes and cannot extract features from the context and the paper association relations at the same time, so that the existing feature extraction method cannot well extract the semantic features of the scientific and technical papers. Therefore, how to better extract semantic features of the scientific and technical papers is a technical problem to be solved urgently.

Disclosure of Invention

In view of the above, the present invention provides a method, a system, and a storage medium for extracting semantic features of a scientific and technological paper data text, so as to solve one or more problems in the prior art.

According to one aspect of the invention, the invention discloses a method for extracting semantic features of a scientific and technical paper data text, which comprises the following steps:

acquiring text information of a scientific and technological thesis, and constructing an entity relationship graph based on the acquired text information of the scientific and technological thesis, wherein the text information comprises a thesis title and a keyword, nodes in the entity relationship graph are the thesis title or the keyword, and edges in the entity relationship graph are incidence relations among the nodes;

extracting semantic features based on the acquired text information of the scientific and technological thesis to obtain a semantic feature matrix;

determining an original adjacency matrix based on the entity relationship graph, and inputting the semantic feature matrix and the original adjacency matrix into a graph network model to obtain a spatial feature matrix;

and performing feature fusion on the semantic feature matrix and the spatial feature matrix to obtain the final semantic features of the scientific and technological paper.

In some embodiments of the present invention, extracting semantic features based on the obtained text information of the scientific and technological paper to obtain a semantic feature matrix, where the semantic feature matrix includes:

and inputting the acquired text information of the scientific and technological paper to a BERT model to obtain a semantic feature matrix.

In some embodiments of the present invention, constructing an entity relationship graph based on the obtained text information of the scientific and technical paper includes:

and calculating the correlation between any two nodes through a point-to-point mutual information algorithm based on the acquired text information of the scientific and technological thesis, and constructing an entity relation graph based on the calculated correlation.

In some embodiments of the present invention, the calculation formula of the point-by-point mutual information algorithm is:

；

wherein the content of the first and second substances,W _i representing nodesi，W _j Representing nodesj，

，

，

，MShows the text abstract length of scientific paper, CW _i， W _j ) Representing nodesiAnd nodejNumber of co-occurrences in the text abstract of the same scientific paper, C: (W _i ) Representing nodesiTotal number of occurrences in the summary, C: (W _j ) Representing nodesjTotal number of occurrences in the summary.

In some embodiments of the invention, the graph network model comprises a plurality of convolutional layers, the output of each convolutional layer being:

；

wherein the content of the first and second substances,L ⁽ⁱ⁾ is as followsiThe output of the layer(s) is,L ^(i-1) is as followsi-1The output of the layer(s) is,ρin order to activate the function(s),W ⁽ⁱ⁾ in order to be the parameters of the model,

is the laplace transform of the adjacency matrix,

and D is a degree matrix used for normalization.

In some embodiments of the invention, the method further comprises:

calculating the cosine similarity of any two nodes in the entity relationship graph based on the obtained final semantic features;

obtaining a reconstructed adjacency matrix based on the cosine similarity;

and calculating loss values of the original adjacency matrix and the reconstructed adjacency matrix, and optimizing parameters of the graph network model based on the loss values.

In some embodiments of the invention, the loss function of the graph network model is:

；

wherein;Sis a matrix of the cosine similarity, and,Ain the form of an original adjacency matrix,Fis composed ofFAnd (4) norm.

In some embodiments of the invention, the data is represented by a formulaZ=λZ _GCN +(1-λ) Z _BERT Performing feature fusion on the semantic feature matrix and the spatial feature matrix; wherein, the first and the second end of the pipe are connected with each other,Z _GCN in the form of a matrix of spatial features,Z _BERT in the form of a matrix of semantic features,λis a hyperparameter, andλ∈（0,1）。

according to another aspect of the present invention, a scientific paper data text semantic feature extraction system is also disclosed, which comprises a processor and a memory, wherein the memory stores computer instructions, the processor is used for executing the computer instructions stored in the memory, and when the computer instructions are executed by the processor, the system realizes the steps of the method according to any one of the above embodiments.

According to yet another aspect of the present invention, a computer-readable storage medium is also disclosed, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the steps of the method according to any of the embodiments above.

The invention discloses a method and a system for extracting text semantic features of scientific and technological thesis data.A thesis semantic feature matrix is obtained first, a spatial feature matrix is determined through a graph network model, and finally feature fusion is carried out on the semantic feature matrix and the spatial feature matrix to obtain final semantic features of a scientific and technological thesis; the method and the system enrich the semantic representation of the thesis by utilizing the spatial correlation of the knowledge map on the basis of extracting the semantic features of the scientific and technological thesis corpus, thereby being capable of better extracting the semantic features of the scientific and technological thesis.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to the specific details set forth above, and that these and other objects that can be achieved with the present invention will be more clearly understood from the detailed description that follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. For purposes of illustrating and describing some portions of the present invention, corresponding parts may be exaggerated in the drawings, i.e., may be larger relative to other components in an exemplary device actually made according to the present invention. In the drawings:

fig. 1 is a schematic flow chart of a method for extracting semantic features of a scientific and technical paper data text according to an embodiment of the present invention.

Fig. 2 is a schematic flow chart of a method for extracting semantic features of a scientific paper data text according to another embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a system for extracting semantic features of a scientific paper data text according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so related to the present invention are omitted.

It should be emphasized that the term "comprises/comprising/comprises/having" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the drawings, the same reference numerals denote the same or similar parts, or the same or similar steps.

Fig. 1 is a flowchart illustrating a method for extracting semantic features of a text of a scientific and technological paper according to an embodiment of the present invention, and as shown in fig. 1, the method for extracting semantic features of a text of a scientific and technological paper at least includes steps S10 to S40.

Step S10: the method comprises the steps of obtaining text information of a scientific and technological thesis, and constructing an entity relationship graph based on the obtained text information of the scientific and technological thesis, wherein the text information comprises a thesis title and a keyword, nodes in the entity relationship graph are the thesis title or the keyword, and edges in the entity relationship graph are incidence relations among the nodes.

In this step, the text data of the scientific paper may be preprocessed first, and then the keywords of the scientific paper may be extracted to obtain the text information including the title content and the keywords content. Furthermore, based on the structural characteristics of the scientific paper, the association between the title of the scientific paper entity and the keyword is established. Specifically, an entity relationship graph G is constructed based on the acquired text information of the scientific and technological paper, the G is an undirected graph, nodes in the undirected graph are titles or keywords, and edges between the nodes represent the association relationship between two corresponding nodes. In constructing a relationship graphGAnd then, establishing edges of title nodes and keyword nodes of the scientific paper entity by using the association relationship between the title and the keyword or between the keyword and the keyword of the scientific paper.

The establishing of the entity relationship diagram based on the acquired text information of the scientific and technological thesis may include: and calculating the correlation between any two nodes through a point-to-point mutual information algorithm based on the acquired text information of the scientific and technological paper, and constructing an entity relation graph based on the calculated correlation. It should be understood that the construction of the entity relationship graph based on the correlation between nodes listed herein is merely an example, and in addition, the entity relationship graph can be constructed according to other methods; besides the point-by-point mutual information algorithm, other algorithms can also be adopted when calculating the correlation between the two nodes.

Further, the calculation formula of the point-by-point mutual information algorithm is as follows:

；

wherein the content of the first and second substances,W _i representing nodesi，W _j Representing nodesjExemplary, a nodeiAs title node, nodejAs the node of the key word(s),PMI（W _i ,W _j ) Representing nodesiAnd nodejThe score of the correlation between the two or more,

，

，

，Mrepresenting the number of headings in the entity relationship graph, which can also be considered as the number of documents,C（W _i， W _j ）representing nodesiAnd nodejThe number of co-occurrences in the text abstract of the same scientific paper,C（W _i ）representing nodesiThe total number of times that it occurs in the summary,C（W _j ）representing nodesjTotal number of occurrences in the summary. In this embodiment, the association between the paper and the keyword is defined by using an index based on word frequency-inverse document frequency (TF-IDF), that is, two nodesi And j the weight of the edge in between is defined as: for the association between the keywords, calculating scores by using point-to-Point Mutual Information (PMI), considering two nodes with scores larger than a certain value as having an association relationship, and generating an edge between the two nodes with the association relationship correspondingly at the moment; similarly, two nodes with scores smaller than a certain value can be considered to have no association relationship, and then no edge is generated between the two corresponding nodes without association relationship.

Step S20: and extracting semantic features based on the acquired text information of the scientific and technical paper to obtain a semantic feature matrix.

In this step, the semantic features of the scientific paper may be extracted based on the pre-trained language model, that is, the text information of the scientific paper acquired in step S10 is input into the pre-trained language model to obtain the semantic feature matrix of the scientific paper.

The pre-training language model may be a BERT model, a Word2Vec model, an ELMo model, etc. When the pre-training language model is a BERT model, extracting semantic features based on the acquired text information of the scientific and technological thesis to obtain a semantic feature matrix, which specifically comprises the following steps: and inputting the acquired text information of the scientific and technological paper to a BERT model to obtain a semantic feature matrix. The network structure of the BERT model is mainly formed by a Transformer and is trained in an unsupervised mode. When the model is trained, a data sample set is determined, and an initial network model is trained on the basis of the data samples to optimize network parameters, so that a trained BERT model is obtained.

Illustratively, the output of the BERT model may be expressed asX，XA semantic feature matrix representing a scientific paper;

document node embedding code composed of

Is shown in whichdIs the dimension of the embedding, and,X _doc a semantic feature matrix representing correspondence of a title in the text information, andX _word representing a semantic feature matrix corresponding to a keyword in the text information,n _doc indicates the number of titles in the text information, andn _word represents the number of keywords in the text message, andn _doc +n _word and may also be considered the total number of nodes in the entity relationship graph.

Step S30: and determining an original adjacency matrix based on the entity relationship diagram, and inputting the semantic feature matrix and the original adjacency matrix into a diagram network model to obtain a spatial feature matrix.

The original adjacency matrix may be written asA，AIs composed ofN×NThe matrix is a matrix of a plurality of pixels,Nis the number of nodes in the entity relationship graph, anA∈R ^N×N In a contiguous matrixAIn, if the nodeiAnd nodejConnected and then correspondingA _ij =1; if nodeiAnd nodejIf not connected, it is correspondingA _ij =0; from this, the matrixAThe element in (1) is 0 or 1. It should be understood that adjacent will be used hereinSetting the elements of the connection matrix to 0 or 1, respectively, is only a preferred example, and in other application scenarios, the adjustment may be performed according to the actual application.

In this step, the original adjacency matrix is dividedAAnd inputting the semantic feature matrix acquired in the step S20 into a trained graph network model for learning, thereby obtaining a spatial feature matrix of the scientific and technological paper. Wherein, embedded features are obtained based on a BERT model and are regarded as a relational graphGCharacteristic representation of middle nodes, i.e.XAnd inputting the initial characteristic matrix serving as the node into a trained graph network model for learning. When the graph network model is trained, similar to the BERT model, a training sample set is firstly constructed, the training sample set comprises a plurality of sample data, the initial graph network model is trained based on the sample data to update the network parameters of the network model, and therefore the trained graph network model is obtained.

Illustratively, the graph network model includes a plurality of convolutional layers, i.e., an initial feature matrix as an input of the graph network model and an original adjacency matrix, specifically, an input of a first convolutional layer of the graph network model. In the graph network modeliThe output of a layer convolution layer can be expressed as:

(ii) a Wherein, the first and the second end of the pipe are connected with each other,L ⁽ⁱ⁾ is a firstiThe output of the layer(s) is,L ^(i-1) is as followsi-1The output of the layer(s) is,ρin order to activate the function(s),W ⁽ⁱ⁾ in order to be the parameters of the model,

is the laplace transform of the adjacency matrix,

and D is a degree matrix. And the final output of the graphical network model can be represented as:Z _GCN =g（X,A）；Ain the form of an original adjacency matrix,Xis a matrix of semantic features that is,Z _GCN is a spatial feature matrix.

Step S40: and performing feature fusion on the semantic feature matrix and the spatial feature matrix to obtain the final semantic features of the scientific and technological thesis.

In this step, vector fusion is performed on the semantic feature vector obtained in step S20 and the spatial feature vector obtained in step S30, so as to obtain the final semantic features of the scientific paper. In the step, by combining the advantages of the BERT module and the GCN module, large-scale pre-training of a large amount of original scientific and technical paper data is utilized, and on the basis of extracting semantic features of the scientific and technical paper corpus, the semantic representation of the scientific and technical paper is enriched by utilizing the spatial correlation of the knowledge map.

In fact, directly utilize

When downstream classification tasks are carried out, the convergence speed and the final classification result are not as good as those of the feature vector of the original BERT, so that the invention constructs a weighted output by using a residual error network to synthesize the feature representation and the semantic feature representation of the GCN; i.e. by the formulaZ=λZ _GCN +(1-λ) Z _BERT Performing feature fusion on the semantic feature matrix and the spatial feature matrix; wherein the content of the first and second substances,Z _GCN in the form of a matrix of spatial features,Z _BERT in the form of a matrix of semantic features,λis a hyperparameter, andλ∈（0,1）。λa weight between the feature representation used to control the GCN output and the original semantic features;λ=1 means that representation learning is performed on scientific papers using GCN model only, and

meaning that only the BERT module is adopted to carry out representation learning on the scientific and technical paper; thus, in the present invention

When is coming into contact with

When it is needed toThe predictions from the two models can be balanced, so that the BERT-GCN model can better optimize the output result; BERT operating directly on GCNλValue of, ensure GCNλThe values are adjusted and optimized towards the target; this helps the multilayer GCN model to overcome inherent defects such as gradient disappearance or excessive smoothing, thereby achieving better performance. According to the embodiment, the method for extracting the text semantic features of the scientific and technological paper data can be found out that the original semantic features acquired by the BERT and the features output by the graph convolution network are weighted by using the residual error network, and the semantic information and the associated information in the paper text data are fully mined.

In an embodiment, the method for extracting semantic features of a scientific and technological paper data text further includes: calculating the cosine similarity of any two nodes in the entity relationship graph based on the obtained final semantic features; obtaining a reconstructed adjacency matrix based on the cosine similarity; and calculating loss values of the original adjacency matrix and the reconstructed adjacency matrix, and optimizing parameters of the graph network model based on the loss values.

Namely, in the pre-training process, link prediction is adopted as a pre-training task, so that the final output is realized

The structural features of the graph can be learned. For nodei、jIn other words, the coded characteristics areZi，ZjThen calculating the cosine similarity of the two pointsS _ij Finally, a reconstructed adjacency matrix is obtainedS. And calculating a loss value between the original adjacent matrix A and the reconstruction matrix S through a cross entropy loss function to serve as an optimization target. Illustratively, the loss function of the graph network model is:

(ii) a Wherein; s is a cosine similarity matrix, A is an adjacency matrix, and F isFAnd (4) norm.

Fig. 2 is a schematic flow chart of a method for extracting semantic features of a scientific and technological paper data text according to another embodiment of the present invention, as shown in fig. 2, in the method for extracting semantic features of a scientific and technological paper data text, first, original scientific and technological paper data is obtained; further extracting keywords of the scientific and technical papers; constructing a keyword relation based on the relation among the keywords; further extracting original semantic features of the scientific and technological thesis to obtain a semantic feature matrix; performing spatial correlation feature extraction on the scientific and technological thesis based on the adjacent matrix corresponding to the entity relational graph and the obtained semantic feature matrix to obtain a spatial feature matrix; and finally, carrying out vector fusion on the spatial feature matrix and the semantic feature matrix to obtain the final semantic representation of the scientific and technological paper. The method enriches the semantic representation of the scientific paper by using the spatial correlation of the knowledge map on the basis of extracting the semantic features of the linguistic data of the scientific paper, thereby better extracting the semantic features of the scientific paper.

Correspondingly, the invention also provides a system for extracting semantic features of a scientific paper data text, which comprises a processor and a memory, wherein the memory stores computer instructions, the processor is used for executing the computer instructions stored in the memory, and when the computer instructions are executed by the processor, the system realizes the steps of the method according to any one of the above embodiments. Fig. 3 is a schematic diagram of an architecture of a text semantic feature extraction system of scientific and technological paper data according to another embodiment of the present invention, and as shown in fig. 3, the system first extracts semantic features of the scientific and technological paper data based on a BERT model, further extracts spatial features of the scientific and technological paper based on a GCN network model, and further performs feature fusion on the semantic features and the spatial features to finally obtain final semantic features of the scientific and technological paper.

Through the embodiment, when the feature extraction is carried out on the scientific and technical paper data, the association between the scientific and technical papers is combined, the co-occurrence relation of the keywords is utilized to construct the heterogeneous network of the paper documents and the keywords, and the vector representation of the paper is obtained from the triple of the heterogeneous network of the paper and the keywords by adopting the Graph Convolution Network (GCN) on the basis of the original semantic features extracted by the BERT. That is, in the BERT-GCN model proposed by the present invention, the BERT model is used to initialize the representations of the document nodes in the text graph, which are used as inputs to the GCN, and then the document representations will be iteratively updated using the GCN based on the graph structure, whose outputs are treated as final representations of the document nodes, so that they can take advantage of the advantages of the pre-trained and graph models.

In addition, the knowledge graph is constructed by linguistic data with extremely high requirements on professional knowledge in the fields of scientific and technological papers to train, so that the feature representation capability of the model in the professional fields is improved. Firstly, constructing a document and keyword heteromorphic graph for a corpus, wherein nodes are words or documents, node vectors are initialized by using a pre-trained BERT model, and features are aggregated by using a Graph Convolution Network (GCN). By jointly training the BERT module and the GCN module, the method can utilize the advantages of the BERT module and the GCN module, utilize large-scale pre-training of a large amount of original scientific and technological paper data, and utilize spatial correlation of a knowledge map on the basis of extracting semantic features of scientific and technological paper linguistic data to enrich semantic representation of the paper.

In addition, the invention also discloses a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method according to any of the above embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein may be implemented as hardware, software, or combinations thereof. Whether this is done in hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.

It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.

Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments in the present invention.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for extracting semantic features of scientific and technological thesis data texts is characterized by comprising the following steps:

acquiring text information of a scientific and technological thesis, and constructing an entity relationship graph based on the acquired text information of the scientific and technological thesis, wherein the text information comprises a thesis title and a keyword, nodes in the entity relationship graph are the thesis title or the keyword, edges in the entity relationship graph are association relations between the nodes, and the entity relationship graph comprises the association relations between the thesis title and the keyword of the scientific and technological thesis;

extracting semantic features based on the acquired text information of the scientific and technical paper to obtain a semantic feature matrix;

performing feature fusion on the semantic feature matrix and the spatial feature matrix to obtain a final semantic feature of the scientific and technological paper;

by the formulaZ=λZ _GCN +(1-λ) Z _BERT Performing feature fusion on the semantic feature matrix and the spatial feature matrix; wherein the content of the first and second substances,Z _GCN in the form of a matrix of spatial features,Z _BERT in the form of a matrix of semantic features,λis hyper-parametric, andλ∈（0,1）；

obtaining a reconstructed adjacency matrix based on the cosine similarity;

calculating loss values of the original adjacency matrix and the reconstructed adjacency matrix, and optimizing parameters of the graph network model based on the loss values;

the loss function of the graph network model is as follows:

；

wherein;Sis a matrix of the cosine similarity, and,Ain the form of an original adjacency matrix,FrepresentsFA norm;

inputting the acquired text information of the scientific and technological thesis into a BERT model to obtain a semantic feature matrix;

wherein the semantic feature matrix is represented as

； dRepresenting the dimensions of the embedding dimension(s),X _doc represents a semantic feature matrix corresponding to a paper title in the text information, andX _word representing a semantic feature matrix corresponding to a keyword in the text information,n _doc indicates the number of paper titles in the text information, andn _word representing the number of keywords in the text information.

2. The method for extracting semantic features of the text of the scientific paper data according to claim 1, wherein constructing an entity relationship diagram based on the acquired text information of the scientific paper comprises:

and calculating the correlation between any two nodes through a point-to-point mutual information algorithm based on the acquired text information of the scientific and technological paper, and constructing an entity relation graph based on the calculated correlation.

3. The method of claim 2, wherein the computation formula of the point-by-point mutual information algorithm is as follows:

，

，

，MThe length of the text abstract of the scientific paper is shown,C（W _i， W _j ）representing nodesiAnd nodejNumber of co-occurrences in the text abstract of the same scientific paper, C: (W _i ) Representing nodesiNumber of occurrences in the summary, C: (W _j ) Representing nodesjNumber of occurrences in the summary.

4. The method as claimed in claim 1, wherein the graph network model includes a plurality of convolutional layers, and an output of each convolutional layer is:

；

is the laplace transform of the adjacency matrix,

，Dis a degree matrix for normalization.

5. A scientific and technological paper data text semantic feature extraction system, which comprises a processor and a memory, wherein the memory stores computer instructions, the processor is used for executing the computer instructions stored in the memory, when the computer instructions are executed by the processor, the system realizes the steps of the method according to any one of claims 1 to 4.

6. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.