CN114818737B - Method, system and storage medium for extracting semantic features of scientific and technological paper data text - Google Patents

Method, system and storage medium for extracting semantic features of scientific and technological paper data text Download PDF

Info

Publication number
CN114818737B
CN114818737B CN202210745539.4A CN202210745539A CN114818737B CN 114818737 B CN114818737 B CN 114818737B CN 202210745539 A CN202210745539 A CN 202210745539A CN 114818737 B CN114818737 B CN 114818737B
Authority
CN
China
Prior art keywords
scientific
matrix
paper
semantic
technological
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210745539.4A
Other languages
Chinese (zh)
Other versions
CN114818737A (en
Inventor
薛哲
杜军平
郑长伟
李文玲
梁美玉
邵蓥侠
寇菲菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202210745539.4A priority Critical patent/CN114818737B/en
Publication of CN114818737A publication Critical patent/CN114818737A/en
Application granted granted Critical
Publication of CN114818737B publication Critical patent/CN114818737B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method, a system and a storage medium for extracting semantic features of scientific and technical paper data texts, wherein the method comprises the following steps: acquiring text information of a scientific and technological thesis, and constructing an entity relationship graph based on the acquired text information of the scientific and technological thesis, wherein the text information comprises a thesis title and a keyword, a node in the entity relationship graph is the thesis title or the keyword, and an edge in the entity relationship graph is an association relationship between the nodes; extracting semantic features based on the acquired text information of the scientific and technical paper to obtain a semantic feature matrix; determining an original adjacency matrix based on the entity relationship graph, and inputting the semantic feature matrix and the original adjacency matrix into a graph network model to obtain a spatial feature matrix; and performing feature fusion on the semantic feature matrix and the spatial feature matrix to obtain the final semantic features of the scientific and technological paper. The feature extraction method can well extract the semantic features of the scientific and technical paper by utilizing the spatial correlation of the knowledge map on the basis of extracting the semantic features of the linguistic data of the scientific and technical paper.

Description

Method, system and storage medium for extracting semantic features of scientific and technological paper data text
Technical Field
The invention relates to the technical field of computers, in particular to a method, a system and a storage medium for extracting semantic features of scientific and technological thesis data texts.
Background
Scientific and technological papers are taken as important research result showing and information obtaining sources, a large number of scientific and technological papers are published almost every day, the academic results comprise a plurality of latest professional field information, and the technology and technological papers are effectively and quickly obtained and semantic feature representation and learning are particularly important. However, the scientific and technical paper data often contains a large number of complex attributes, such as abstracts, keywords, cited documents, etc. of the papers, the association between the papers is tighter, and in addition, a large amount of professional knowledge covers a wide range of subjects, so that a large amount of professional knowledge is required for feature extraction of the scientific and technical paper; the feature of effectively extracting the thesis data can provide support for the processing of the scientific and technological thesis data.
TF-IDF (term frequency-inverse document frequency) is a traditional text feature extraction mode. It uses word frequency and inverse document frequency to represent documents as multidimensional vector representation of keyword weights, which is a typical vector space model. Mikolov et al introduced a Word vector representation model Word2Vec based on the continuous bag of words model (CBOW) and Skip-Gram model, and the whole NLP field quickly entered the embedding world. The traditional coding mode mainly takes onehot coding as a main mode, vectors obtained by the coding mode are often sparse, word vectors trained by Word2Vec are low-dimensional and dense, context information of words is effectively utilized, and semantic information of the vectors is richer. Li et al uses Word2vec algorithm to process semantic gap and implement Word frequency-document reciprocal frequency (TF-IDF) weighted mapping of HTTP traffic to construct low-dimensional paragraph vector representation to reduce complexity; after Word2vec is retrained, the semantic vector of each Word is not changed, and different vectors cannot be obtained by combining context semantics. Peters et al propose an ELMo model in order to solve the problem of lack of context adaptation in Word2 vec; different from the characteristic that semantic expression vectors acquired by a static word embedding expression model are not changed, ELMo needs to be pre-trained on a large-scale corpus at first, and after the pre-training is finished, fine adjustment is performed according to a specific application field, so that the purpose of field adaptation is achieved, and a word can acquire special vectors according to the current context. The GPT also adopts a corpus to obtain a pre-training model, and then fine tuning is carried out through a small-scale corpus, compared with ELMo, the GPT and the ELMo have the main difference that the network structures adopted for feature extraction are different, the GPT adopts a Transformer, and the ELMo adopts an LSTM. The Transformer is an end-to-end sequence model proposed by google, and on the basis of the model, a plurality of improved methods are widely applied to the fields of natural language processing, even images and the like; compared with the traditional sequence model, the Transformer completely adopts an attention mechanism to form a network, and the whole network is formed by the structures of an encoder and a decoder; based on the method, a BERT model is further provided, the BERT model shields partial words in the corpus to perform a prediction task through a mask mechanism, so that pre-training of the model is performed, and context semantics of the text are effectively extracted by adopting a bidirectional coding mode.
The traditional pre-training model mainly adopts a one-way network or simply splices a two-way network, the context relation of a text cannot be effectively mined by the network structure, BERT emphasizes that context information is obtained in an omnidirectional manner, and the whole model is constructed by a deep two-way Transformer unit, so that a deep two-way language representation capable of fusing the context information is finally generated. In the pre-training process, the BERT respectively adopts a mask Language Model and a Next sequence Prediction task to perform pre-training, wherein the mask Language Model randomly replaces partial words in the corpus and enables the Model to predict the replaced words so as to realize pre-training, so that the Model can effectively learn the characteristics of a single Sentence, however, for common questions, answers and other sentences, the context relationship of the Sentence needs to be captured, and the mask Language Model task tends to extract the word level representation and cannot directly acquire the Sentence level characteristics. In order to enable the model to understand the relationship between sentences, BERT uses a Next sequence Prediction task for pre-training, namely, a training task is carried out by judging whether two sentences are context-related; the BERT carries out the two pre-training tasks through a deep network and large-scale corpora to obtain rich text semantic representation, and on the basis, the best performance at the time is obtained in a plurality of downstream tasks of the NLP.
In the existing method for extracting features based on a pre-trained corpus model, semantic features are extracted for text contexts, however, for scientific and technical papers, a large number of associations exist among various attributes of the papers, especially keywords and titles, and the papers can also be associated through co-occurrence relations of the keywords, the attributes cover main semantic information of the papers, and the existing feature extraction method ignores the associations among the paper attributes and cannot extract features from the context and the paper association relations at the same time, so that the existing feature extraction method cannot well extract the semantic features of the scientific and technical papers. Therefore, how to better extract semantic features of the scientific and technical papers is a technical problem to be solved urgently.
Disclosure of Invention
In view of the above, the present invention provides a method, a system, and a storage medium for extracting semantic features of a scientific and technological paper data text, so as to solve one or more problems in the prior art.
According to one aspect of the invention, the invention discloses a method for extracting semantic features of a scientific and technical paper data text, which comprises the following steps:
acquiring text information of a scientific and technological thesis, and constructing an entity relationship graph based on the acquired text information of the scientific and technological thesis, wherein the text information comprises a thesis title and a keyword, nodes in the entity relationship graph are the thesis title or the keyword, and edges in the entity relationship graph are incidence relations among the nodes;
extracting semantic features based on the acquired text information of the scientific and technological thesis to obtain a semantic feature matrix;
determining an original adjacency matrix based on the entity relationship graph, and inputting the semantic feature matrix and the original adjacency matrix into a graph network model to obtain a spatial feature matrix;
and performing feature fusion on the semantic feature matrix and the spatial feature matrix to obtain the final semantic features of the scientific and technological paper.
In some embodiments of the present invention, extracting semantic features based on the obtained text information of the scientific and technological paper to obtain a semantic feature matrix, where the semantic feature matrix includes:
and inputting the acquired text information of the scientific and technological paper to a BERT model to obtain a semantic feature matrix.
In some embodiments of the present invention, constructing an entity relationship graph based on the obtained text information of the scientific and technical paper includes:
and calculating the correlation between any two nodes through a point-to-point mutual information algorithm based on the acquired text information of the scientific and technological thesis, and constructing an entity relation graph based on the calculated correlation.
In some embodiments of the present invention, the calculation formula of the point-by-point mutual information algorithm is:
Figure DEST_PATH_IMAGE001
wherein the content of the first and second substances,W i representing nodesiW j Representing nodesj
Figure 543087DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE003
Figure 335594DEST_PATH_IMAGE004
MShows the text abstract length of scientific paper, CW i, W j ) Representing nodesiAnd nodejNumber of co-occurrences in the text abstract of the same scientific paper, C: (W i ) Representing nodesiTotal number of occurrences in the summary, C: (W j ) Representing nodesjTotal number of occurrences in the summary.
In some embodiments of the invention, the graph network model comprises a plurality of convolutional layers, the output of each convolutional layer being:
Figure DEST_PATH_IMAGE005
wherein the content of the first and second substances,L (i) is as followsiThe output of the layer(s) is,L (i-1) is as followsi-1The output of the layer(s) is,ρin order to activate the function(s),W (i) in order to be the parameters of the model,
Figure 489232DEST_PATH_IMAGE006
is the laplace transform of the adjacency matrix,
Figure DEST_PATH_IMAGE007
and D is a degree matrix used for normalization.
In some embodiments of the invention, the method further comprises:
calculating the cosine similarity of any two nodes in the entity relationship graph based on the obtained final semantic features;
obtaining a reconstructed adjacency matrix based on the cosine similarity;
and calculating loss values of the original adjacency matrix and the reconstructed adjacency matrix, and optimizing parameters of the graph network model based on the loss values.
In some embodiments of the invention, the loss function of the graph network model is:
Figure 759808DEST_PATH_IMAGE008
wherein;Sis a matrix of the cosine similarity, and,Ain the form of an original adjacency matrix,Fis composed ofFAnd (4) norm.
In some embodiments of the invention, the data is represented by a formulaZ=λZ GCN +(1-λ) Z BERT Performing feature fusion on the semantic feature matrix and the spatial feature matrix; wherein, the first and the second end of the pipe are connected with each other,Z GCN in the form of a matrix of spatial features,Z BERT in the form of a matrix of semantic features,λis a hyperparameter, andλ∈(0,1)。
according to another aspect of the present invention, a scientific paper data text semantic feature extraction system is also disclosed, which comprises a processor and a memory, wherein the memory stores computer instructions, the processor is used for executing the computer instructions stored in the memory, and when the computer instructions are executed by the processor, the system realizes the steps of the method according to any one of the above embodiments.
According to yet another aspect of the present invention, a computer-readable storage medium is also disclosed, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the steps of the method according to any of the embodiments above.
The invention discloses a method and a system for extracting text semantic features of scientific and technological thesis data.A thesis semantic feature matrix is obtained first, a spatial feature matrix is determined through a graph network model, and finally feature fusion is carried out on the semantic feature matrix and the spatial feature matrix to obtain final semantic features of a scientific and technological thesis; the method and the system enrich the semantic representation of the thesis by utilizing the spatial correlation of the knowledge map on the basis of extracting the semantic features of the scientific and technological thesis corpus, thereby being capable of better extracting the semantic features of the scientific and technological thesis.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to the specific details set forth above, and that these and other objects that can be achieved with the present invention will be more clearly understood from the detailed description that follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. For purposes of illustrating and describing some portions of the present invention, corresponding parts may be exaggerated in the drawings, i.e., may be larger relative to other components in an exemplary device actually made according to the present invention. In the drawings:
fig. 1 is a schematic flow chart of a method for extracting semantic features of a scientific and technical paper data text according to an embodiment of the present invention.
Fig. 2 is a schematic flow chart of a method for extracting semantic features of a scientific paper data text according to another embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a system for extracting semantic features of a scientific paper data text according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so related to the present invention are omitted.
It should be emphasized that the term "comprises/comprising/comprises/having" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.
Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the drawings, the same reference numerals denote the same or similar parts, or the same or similar steps.
Fig. 1 is a flowchart illustrating a method for extracting semantic features of a text of a scientific and technological paper according to an embodiment of the present invention, and as shown in fig. 1, the method for extracting semantic features of a text of a scientific and technological paper at least includes steps S10 to S40.
Step S10: the method comprises the steps of obtaining text information of a scientific and technological thesis, and constructing an entity relationship graph based on the obtained text information of the scientific and technological thesis, wherein the text information comprises a thesis title and a keyword, nodes in the entity relationship graph are the thesis title or the keyword, and edges in the entity relationship graph are incidence relations among the nodes.
In this step, the text data of the scientific paper may be preprocessed first, and then the keywords of the scientific paper may be extracted to obtain the text information including the title content and the keywords content. Furthermore, based on the structural characteristics of the scientific paper, the association between the title of the scientific paper entity and the keyword is established. Specifically, an entity relationship graph G is constructed based on the acquired text information of the scientific and technological paper, the G is an undirected graph, nodes in the undirected graph are titles or keywords, and edges between the nodes represent the association relationship between two corresponding nodes. In constructing a relationship graphGAnd then, establishing edges of title nodes and keyword nodes of the scientific paper entity by using the association relationship between the title and the keyword or between the keyword and the keyword of the scientific paper.
The establishing of the entity relationship diagram based on the acquired text information of the scientific and technological thesis may include: and calculating the correlation between any two nodes through a point-to-point mutual information algorithm based on the acquired text information of the scientific and technological paper, and constructing an entity relation graph based on the calculated correlation. It should be understood that the construction of the entity relationship graph based on the correlation between nodes listed herein is merely an example, and in addition, the entity relationship graph can be constructed according to other methods; besides the point-by-point mutual information algorithm, other algorithms can also be adopted when calculating the correlation between the two nodes.
Further, the calculation formula of the point-by-point mutual information algorithm is as follows:
Figure DEST_PATH_IMAGE009
wherein the content of the first and second substances,W i representing nodesiW j Representing nodesjExemplary, a nodeiAs title node, nodejAs the node of the key word(s),PMIW i ,W j ) Representing nodesiAnd nodejThe score of the correlation between the two or more,
Figure 682502DEST_PATH_IMAGE010
Figure 100002_DEST_PATH_IMAGE011
Figure 380331DEST_PATH_IMAGE012
Mrepresenting the number of headings in the entity relationship graph, which can also be considered as the number of documents,C(W i, W j representing nodesiAnd nodejThe number of co-occurrences in the text abstract of the same scientific paper,C(W i representing nodesiThe total number of times that it occurs in the summary,C(W j representing nodesjTotal number of occurrences in the summary. In this embodiment, the association between the paper and the keyword is defined by using an index based on word frequency-inverse document frequency (TF-IDF), that is, two nodesi And j the weight of the edge in between is defined as: for the association between the keywords, calculating scores by using point-to-Point Mutual Information (PMI), considering two nodes with scores larger than a certain value as having an association relationship, and generating an edge between the two nodes with the association relationship correspondingly at the moment; similarly, two nodes with scores smaller than a certain value can be considered to have no association relationship, and then no edge is generated between the two corresponding nodes without association relationship.
Step S20: and extracting semantic features based on the acquired text information of the scientific and technical paper to obtain a semantic feature matrix.
In this step, the semantic features of the scientific paper may be extracted based on the pre-trained language model, that is, the text information of the scientific paper acquired in step S10 is input into the pre-trained language model to obtain the semantic feature matrix of the scientific paper.
The pre-training language model may be a BERT model, a Word2Vec model, an ELMo model, etc. When the pre-training language model is a BERT model, extracting semantic features based on the acquired text information of the scientific and technological thesis to obtain a semantic feature matrix, which specifically comprises the following steps: and inputting the acquired text information of the scientific and technological paper to a BERT model to obtain a semantic feature matrix. The network structure of the BERT model is mainly formed by a Transformer and is trained in an unsupervised mode. When the model is trained, a data sample set is determined, and an initial network model is trained on the basis of the data samples to optimize network parameters, so that a trained BERT model is obtained.
Illustratively, the output of the BERT model may be expressed asXXA semantic feature matrix representing a scientific paper;
Figure DEST_PATH_IMAGE013
document node embedding code composed of
Figure 27125DEST_PATH_IMAGE014
Is shown in whichdIs the dimension of the embedding, and,X doc a semantic feature matrix representing correspondence of a title in the text information, andX word representing a semantic feature matrix corresponding to a keyword in the text information,n doc indicates the number of titles in the text information, andn word represents the number of keywords in the text message, andn doc +n word and may also be considered the total number of nodes in the entity relationship graph.
Step S30: and determining an original adjacency matrix based on the entity relationship diagram, and inputting the semantic feature matrix and the original adjacency matrix into a diagram network model to obtain a spatial feature matrix.
The original adjacency matrix may be written asAAIs composed ofN×NThe matrix is a matrix of a plurality of pixels,Nis the number of nodes in the entity relationship graph, anA∈R N×N In a contiguous matrixAIn, if the nodeiAnd nodejConnected and then correspondingA ij =1; if nodeiAnd nodejIf not connected, it is correspondingA ij =0; from this, the matrixAThe element in (1) is 0 or 1. It should be understood that adjacent will be used hereinSetting the elements of the connection matrix to 0 or 1, respectively, is only a preferred example, and in other application scenarios, the adjustment may be performed according to the actual application.
In this step, the original adjacency matrix is dividedAAnd inputting the semantic feature matrix acquired in the step S20 into a trained graph network model for learning, thereby obtaining a spatial feature matrix of the scientific and technological paper. Wherein, embedded features are obtained based on a BERT model and are regarded as a relational graphGCharacteristic representation of middle nodes, i.e.XAnd inputting the initial characteristic matrix serving as the node into a trained graph network model for learning. When the graph network model is trained, similar to the BERT model, a training sample set is firstly constructed, the training sample set comprises a plurality of sample data, the initial graph network model is trained based on the sample data to update the network parameters of the network model, and therefore the trained graph network model is obtained.
Illustratively, the graph network model includes a plurality of convolutional layers, i.e., an initial feature matrix as an input of the graph network model and an original adjacency matrix, specifically, an input of a first convolutional layer of the graph network model. In the graph network modeliThe output of a layer convolution layer can be expressed as:
Figure DEST_PATH_IMAGE015
(ii) a Wherein, the first and the second end of the pipe are connected with each other,L (i) is a firstiThe output of the layer(s) is,L (i-1) is as followsi-1The output of the layer(s) is,ρin order to activate the function(s),W (i) in order to be the parameters of the model,
Figure DEST_PATH_IMAGE017
is the laplace transform of the adjacency matrix,
Figure 39074DEST_PATH_IMAGE018
and D is a degree matrix. And the final output of the graphical network model can be represented as:Z GCN =g(X,A)Ain the form of an original adjacency matrix,Xis a matrix of semantic features that is,Z GCN is a spatial feature matrix.
Step S40: and performing feature fusion on the semantic feature matrix and the spatial feature matrix to obtain the final semantic features of the scientific and technological thesis.
In this step, vector fusion is performed on the semantic feature vector obtained in step S20 and the spatial feature vector obtained in step S30, so as to obtain the final semantic features of the scientific paper. In the step, by combining the advantages of the BERT module and the GCN module, large-scale pre-training of a large amount of original scientific and technical paper data is utilized, and on the basis of extracting semantic features of the scientific and technical paper corpus, the semantic representation of the scientific and technical paper is enriched by utilizing the spatial correlation of the knowledge map.
In fact, directly utilize
Figure DEST_PATH_IMAGE019
When downstream classification tasks are carried out, the convergence speed and the final classification result are not as good as those of the feature vector of the original BERT, so that the invention constructs a weighted output by using a residual error network to synthesize the feature representation and the semantic feature representation of the GCN; i.e. by the formulaZ=λZ GCN +(1-λ) Z BERT Performing feature fusion on the semantic feature matrix and the spatial feature matrix; wherein the content of the first and second substances,Z GCN in the form of a matrix of spatial features,Z BERT in the form of a matrix of semantic features,λis a hyperparameter, andλ∈(0,1)。λa weight between the feature representation used to control the GCN output and the original semantic features;λ=1 means that representation learning is performed on scientific papers using GCN model only, and
Figure 81855DEST_PATH_IMAGE020
meaning that only the BERT module is adopted to carry out representation learning on the scientific and technical paper; thus, in the present invention
Figure DEST_PATH_IMAGE021
When is coming into contact with
Figure 747322DEST_PATH_IMAGE021
When it is needed toThe predictions from the two models can be balanced, so that the BERT-GCN model can better optimize the output result; BERT operating directly on GCNλValue of, ensure GCNλThe values are adjusted and optimized towards the target; this helps the multilayer GCN model to overcome inherent defects such as gradient disappearance or excessive smoothing, thereby achieving better performance. According to the embodiment, the method for extracting the text semantic features of the scientific and technological paper data can be found out that the original semantic features acquired by the BERT and the features output by the graph convolution network are weighted by using the residual error network, and the semantic information and the associated information in the paper text data are fully mined.
In an embodiment, the method for extracting semantic features of a scientific and technological paper data text further includes: calculating the cosine similarity of any two nodes in the entity relationship graph based on the obtained final semantic features; obtaining a reconstructed adjacency matrix based on the cosine similarity; and calculating loss values of the original adjacency matrix and the reconstructed adjacency matrix, and optimizing parameters of the graph network model based on the loss values.
Namely, in the pre-training process, link prediction is adopted as a pre-training task, so that the final output is realized
Figure 937870DEST_PATH_IMAGE022
The structural features of the graph can be learned. For nodei、jIn other words, the coded characteristics areZi,ZjThen calculating the cosine similarity of the two pointsS ij Finally, a reconstructed adjacency matrix is obtainedS. And calculating a loss value between the original adjacent matrix A and the reconstruction matrix S through a cross entropy loss function to serve as an optimization target. Illustratively, the loss function of the graph network model is:
Figure DEST_PATH_IMAGE023
(ii) a Wherein; s is a cosine similarity matrix, A is an adjacency matrix, and F isFAnd (4) norm.
Fig. 2 is a schematic flow chart of a method for extracting semantic features of a scientific and technological paper data text according to another embodiment of the present invention, as shown in fig. 2, in the method for extracting semantic features of a scientific and technological paper data text, first, original scientific and technological paper data is obtained; further extracting keywords of the scientific and technical papers; constructing a keyword relation based on the relation among the keywords; further extracting original semantic features of the scientific and technological thesis to obtain a semantic feature matrix; performing spatial correlation feature extraction on the scientific and technological thesis based on the adjacent matrix corresponding to the entity relational graph and the obtained semantic feature matrix to obtain a spatial feature matrix; and finally, carrying out vector fusion on the spatial feature matrix and the semantic feature matrix to obtain the final semantic representation of the scientific and technological paper. The method enriches the semantic representation of the scientific paper by using the spatial correlation of the knowledge map on the basis of extracting the semantic features of the linguistic data of the scientific paper, thereby better extracting the semantic features of the scientific paper.
Correspondingly, the invention also provides a system for extracting semantic features of a scientific paper data text, which comprises a processor and a memory, wherein the memory stores computer instructions, the processor is used for executing the computer instructions stored in the memory, and when the computer instructions are executed by the processor, the system realizes the steps of the method according to any one of the above embodiments. Fig. 3 is a schematic diagram of an architecture of a text semantic feature extraction system of scientific and technological paper data according to another embodiment of the present invention, and as shown in fig. 3, the system first extracts semantic features of the scientific and technological paper data based on a BERT model, further extracts spatial features of the scientific and technological paper based on a GCN network model, and further performs feature fusion on the semantic features and the spatial features to finally obtain final semantic features of the scientific and technological paper.
Through the embodiment, when the feature extraction is carried out on the scientific and technical paper data, the association between the scientific and technical papers is combined, the co-occurrence relation of the keywords is utilized to construct the heterogeneous network of the paper documents and the keywords, and the vector representation of the paper is obtained from the triple of the heterogeneous network of the paper and the keywords by adopting the Graph Convolution Network (GCN) on the basis of the original semantic features extracted by the BERT. That is, in the BERT-GCN model proposed by the present invention, the BERT model is used to initialize the representations of the document nodes in the text graph, which are used as inputs to the GCN, and then the document representations will be iteratively updated using the GCN based on the graph structure, whose outputs are treated as final representations of the document nodes, so that they can take advantage of the advantages of the pre-trained and graph models.
In addition, the knowledge graph is constructed by linguistic data with extremely high requirements on professional knowledge in the fields of scientific and technological papers to train, so that the feature representation capability of the model in the professional fields is improved. Firstly, constructing a document and keyword heteromorphic graph for a corpus, wherein nodes are words or documents, node vectors are initialized by using a pre-trained BERT model, and features are aggregated by using a Graph Convolution Network (GCN). By jointly training the BERT module and the GCN module, the method can utilize the advantages of the BERT module and the GCN module, utilize large-scale pre-training of a large amount of original scientific and technological paper data, and utilize spatial correlation of a knowledge map on the basis of extracting semantic features of scientific and technological paper linguistic data to enrich semantic representation of the paper.
In addition, the invention also discloses a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method according to any of the above embodiments.
Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein may be implemented as hardware, software, or combinations thereof. Whether this is done in hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.
It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.
Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments in the present invention.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

1. A method for extracting semantic features of scientific and technological thesis data texts is characterized by comprising the following steps:
acquiring text information of a scientific and technological thesis, and constructing an entity relationship graph based on the acquired text information of the scientific and technological thesis, wherein the text information comprises a thesis title and a keyword, nodes in the entity relationship graph are the thesis title or the keyword, edges in the entity relationship graph are association relations between the nodes, and the entity relationship graph comprises the association relations between the thesis title and the keyword of the scientific and technological thesis;
extracting semantic features based on the acquired text information of the scientific and technical paper to obtain a semantic feature matrix;
determining an original adjacency matrix based on the entity relationship graph, and inputting the semantic feature matrix and the original adjacency matrix into a graph network model to obtain a spatial feature matrix;
performing feature fusion on the semantic feature matrix and the spatial feature matrix to obtain a final semantic feature of the scientific and technological paper;
by the formulaZ=λZ GCN +(1-λ) Z BERT Performing feature fusion on the semantic feature matrix and the spatial feature matrix; wherein the content of the first and second substances,Z GCN in the form of a matrix of spatial features,Z BERT in the form of a matrix of semantic features,λis hyper-parametric, andλ∈(0,1);
calculating the cosine similarity of any two nodes in the entity relationship graph based on the obtained final semantic features;
obtaining a reconstructed adjacency matrix based on the cosine similarity;
calculating loss values of the original adjacency matrix and the reconstructed adjacency matrix, and optimizing parameters of the graph network model based on the loss values;
the loss function of the graph network model is as follows:
Figure 940788DEST_PATH_IMAGE002
wherein;Sis a matrix of the cosine similarity, and,Ain the form of an original adjacency matrix,FrepresentsFA norm;
inputting the acquired text information of the scientific and technological thesis into a BERT model to obtain a semantic feature matrix;
wherein the semantic feature matrix is represented as
Figure 966513DEST_PATH_IMAGE004
dRepresenting the dimensions of the embedding dimension(s),X doc represents a semantic feature matrix corresponding to a paper title in the text information, andX word representing a semantic feature matrix corresponding to a keyword in the text information,n doc indicates the number of paper titles in the text information, andn word representing the number of keywords in the text information.
2. The method for extracting semantic features of the text of the scientific paper data according to claim 1, wherein constructing an entity relationship diagram based on the acquired text information of the scientific paper comprises:
and calculating the correlation between any two nodes through a point-to-point mutual information algorithm based on the acquired text information of the scientific and technological paper, and constructing an entity relation graph based on the calculated correlation.
3. The method of claim 2, wherein the computation formula of the point-by-point mutual information algorithm is as follows:
Figure 171229DEST_PATH_IMAGE006
wherein the content of the first and second substances,W i representing nodesiW j Representing nodesj
Figure 806347DEST_PATH_IMAGE008
Figure 412909DEST_PATH_IMAGE010
Figure DEST_PATH_IMAGE011
MThe length of the text abstract of the scientific paper is shown,C(W i, W j representing nodesiAnd nodejNumber of co-occurrences in the text abstract of the same scientific paper, C: (W i ) Representing nodesiNumber of occurrences in the summary, C: (W j ) Representing nodesjNumber of occurrences in the summary.
4. The method as claimed in claim 1, wherein the graph network model includes a plurality of convolutional layers, and an output of each convolutional layer is:
Figure 547218DEST_PATH_IMAGE012
wherein the content of the first and second substances,L (i) is as followsiThe output of the layer(s) is,L (i-1) is as followsi-1The output of the layer(s) is,ρin order to activate the function(s),W (i) in order to be the parameters of the model,
Figure 3345DEST_PATH_IMAGE014
is the laplace transform of the adjacency matrix,
Figure 678040DEST_PATH_IMAGE016
Dis a degree matrix for normalization.
5. A scientific and technological paper data text semantic feature extraction system, which comprises a processor and a memory, wherein the memory stores computer instructions, the processor is used for executing the computer instructions stored in the memory, when the computer instructions are executed by the processor, the system realizes the steps of the method according to any one of claims 1 to 4.
6. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.
CN202210745539.4A 2022-06-29 2022-06-29 Method, system and storage medium for extracting semantic features of scientific and technological paper data text Active CN114818737B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210745539.4A CN114818737B (en) 2022-06-29 2022-06-29 Method, system and storage medium for extracting semantic features of scientific and technological paper data text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210745539.4A CN114818737B (en) 2022-06-29 2022-06-29 Method, system and storage medium for extracting semantic features of scientific and technological paper data text

Publications (2)

Publication Number Publication Date
CN114818737A CN114818737A (en) 2022-07-29
CN114818737B true CN114818737B (en) 2022-11-18

Family

ID=82523321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210745539.4A Active CN114818737B (en) 2022-06-29 2022-06-29 Method, system and storage medium for extracting semantic features of scientific and technological paper data text

Country Status (1)

Country Link
CN (1) CN114818737B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115858725B (en) * 2022-11-22 2023-07-04 广西壮族自治区通信产业服务有限公司技术服务分公司 Text noise screening method and system based on unsupervised graph neural network

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110688474A (en) * 2019-09-03 2020-01-14 西北工业大学 Embedded representation obtaining and citation recommending method based on deep learning and link prediction
CN110705260A (en) * 2019-09-24 2020-01-17 北京工商大学 Text vector generation method based on unsupervised graph neural network structure
CN113254616A (en) * 2021-06-07 2021-08-13 佰聆数据股份有限公司 Intelligent question-answering system-oriented sentence vector generation method and system
CN113378547A (en) * 2021-06-16 2021-09-10 武汉大学 GCN-based Chinese compound sentence implicit relation analysis method and device
CN113704415A (en) * 2021-09-09 2021-11-26 北京邮电大学 Vector representation generation method and device for medical text
CN114048350A (en) * 2021-11-08 2022-02-15 湖南大学 Text-video retrieval method based on fine-grained cross-modal alignment model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11625540B2 (en) * 2020-02-28 2023-04-11 Vinal AI Application and Research Joint Stock Co Encoder, system and method for metaphor detection in natural language processing
JP2022035314A (en) * 2020-08-20 2022-03-04 富士フイルムビジネスイノベーション株式会社 Information processing unit and program

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110688474A (en) * 2019-09-03 2020-01-14 西北工业大学 Embedded representation obtaining and citation recommending method based on deep learning and link prediction
CN110705260A (en) * 2019-09-24 2020-01-17 北京工商大学 Text vector generation method based on unsupervised graph neural network structure
CN113254616A (en) * 2021-06-07 2021-08-13 佰聆数据股份有限公司 Intelligent question-answering system-oriented sentence vector generation method and system
CN113378547A (en) * 2021-06-16 2021-09-10 武汉大学 GCN-based Chinese compound sentence implicit relation analysis method and device
CN113704415A (en) * 2021-09-09 2021-11-26 北京邮电大学 Vector representation generation method and device for medical text
CN114048350A (en) * 2021-11-08 2022-02-15 湖南大学 Text-video retrieval method based on fine-grained cross-modal alignment model

Also Published As

Publication number Publication date
CN114818737A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
CN109840287B (en) Cross-modal information retrieval method and device based on neural network
CN110717017B (en) Method for processing corpus
WO2021179640A1 (en) Graph model-based short video recommendation method, intelligent terminal and storage medium
Liu et al. Is a single vector enough? exploring node polysemy for network embedding
CN106484674B (en) Chinese electronic medical record concept extraction method based on deep learning
CN111079532A (en) Video content description method based on text self-encoder
CN112380435A (en) Literature recommendation method and recommendation system based on heterogeneous graph neural network
JP2010250814A (en) Part-of-speech tagging system, training device and method of part-of-speech tagging model
CN112597296B (en) Abstract generation method based on plan mechanism and knowledge graph guidance
CN112215837A (en) Multi-attribute image semantic analysis method and device
CN114419642A (en) Method, device and system for extracting key value pair information in document image
CN116628171B (en) Medical record retrieval method and system based on pre-training language model
CN111985520A (en) Multi-mode classification method based on graph convolution neural network
CN114818737B (en) Method, system and storage medium for extracting semantic features of scientific and technological paper data text
CN114896377A (en) Knowledge graph-based answer acquisition method
Han et al. Augmented sentiment representation by learning context information
CN114841140A (en) Dependency analysis model and Chinese combined event extraction method based on dependency analysis
CN114281982B (en) Book propaganda abstract generation method and system adopting multi-mode fusion technology
Choi et al. Knowledge graph extension with a pre-trained language model via unified learning method
US20230281400A1 (en) Systems and Methods for Pretraining Image Processing Models
CN115630223A (en) Service recommendation method and system based on multi-model fusion
CN115631504A (en) Emotion identification method based on bimodal graph network information bottleneck
CN115564013B (en) Method for improving learning representation capability of network representation, model training method and system
CN114896983A (en) Model training method, text processing device and computer equipment
Liu et al. Generalized few-shot classification with knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant