CN114841173A - Academic text semantic feature extraction method and system based on pre-training model and storage medium - Google Patents

Academic text semantic feature extraction method and system based on pre-training model and storage medium Download PDF

Info

Publication number
CN114841173A
CN114841173A CN202210778073.8A CN202210778073A CN114841173A CN 114841173 A CN114841173 A CN 114841173A CN 202210778073 A CN202210778073 A CN 202210778073A CN 114841173 A CN114841173 A CN 114841173A
Authority
CN
China
Prior art keywords
model
academic
training
training model
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210778073.8A
Other languages
Chinese (zh)
Other versions
CN114841173B (en
Inventor
杜军平
王岳
薛哲
梁美玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202210778073.8A priority Critical patent/CN114841173B/en
Publication of CN114841173A publication Critical patent/CN114841173A/en
Application granted granted Critical
Publication of CN114841173B publication Critical patent/CN114841173B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method, a system and a storage medium for extracting academic text semantic features based on a pre-training model, wherein the method comprises the following steps: acquiring academic resource text data; inputting the acquired academic resource text data into a pre-training model to obtain a multi-dimensional academic text semantic feature vector; the pre-training model is a student pre-training model obtained by finely tuning a Bert pre-training model based on a multiple load sample loss function, and training the student model by knowledge distillation by taking the finely tuned Bert pre-training model as a teacher model; and performing dimension reduction compression on the multidimensional academic text semantic feature vector, and outputting the final academic text semantic feature. The method and the device improve the quality of vector generation, accelerate the speed of vector generation, and are suitable for text vector generation in an academic big data scene.

Description

基于预训练模型的学术文本语义特征提取方法、系统和存储 介质Method, system and storage medium for academic text semantic feature extraction based on pre-training model

技术领域technical field

本发明涉及大数据技术领域,尤其涉及一种基于预训练模型的学术文本语义特征提取方法、系统和存储介质。The invention relates to the technical field of big data, in particular to a method, system and storage medium for extracting semantic features of academic texts based on a pre-training model.

背景技术Background technique

学术资源相比与传统互联网数据表现出更为复杂的特征,在提取学术资源文本表示特征方面,通常使用基于词频的统计模型、主题模型或者基于深度学习的词向量表示方法。TF-IDF(词频-逆文档频率)就是一种典型的文本向量表示方法,其利用统计方法提取文本特征,通过计算词频、以逆文档频率计算文档词权重,利用文档中所有词的权重集合构造文档向量表示。伍哲等人还提出了TTF-LDA(Time+TF-IDF+Latent Dirichlet Allocation)算法,该算法基于TF-IDF与LDA,用主题分析的方法对学术文献的摘要进行处理。Mikolov等人还出了Word2Vec模型,使用连续词袋模型和Skip-Gram方法,通过预测下一个词的任务获取一个词的隐藏层向量表示。Compared with traditional Internet data, academic resources show more complex features. In terms of extracting text representation features of academic resources, word frequency-based statistical models, topic models, or deep learning-based word vector representation methods are usually used. TF-IDF (word frequency - inverse document frequency) is a typical text vector representation method, which uses statistical methods to extract text features, calculates the word frequency by calculating the word frequency, calculates the document word weight with the inverse document frequency, and uses the weight set of all words in the document. Document vector representation. Wu Zhe et al. also proposed the TTF-LDA (Time+TF-IDF+Latent Dirichlet Allocation) algorithm, which is based on TF-IDF and LDA, and uses the method of topic analysis to process the abstracts of academic literature. Mikolov et al. also introduced the Word2Vec model, which uses the continuous bag of words model and the Skip-Gram method to obtain the hidden layer vector representation of a word through the task of predicting the next word.

随着人工智能技术的快速发展,深度学习技术也被用来进行学术文本的特征提取。自编码器能够有效学习文本数据的语义表示,Eisa等人于是提出利用深度自编码器技术抽取词汇特征集。递归神经网络(RNN)的网络结构能够处理不同时间序列的输入,非常适合从序列化的文本数据中提取特征,在文本处理任务上被广泛使用。在此基础上,为了解决长距离带来的RNN梯度消失问题,人们对RNN进行改进,构造了LSTM(长短期记忆网络)和GRU(门控循环单元)模型,通过融入记忆、遗忘、输出阶段保留长距离的语义信息,使得模型在长文本的特征抽取上能够达到较好的效果。Devlin等人还提出了Bert(BidirectionalEncoder Representation from Transformers,基于变换器的双向编码表征)预训练模型,该模型最大程度的利用了多头自注意力机制捕获上下文语义,在多个自然语言任务上取得很好的效果。With the rapid development of artificial intelligence technology, deep learning technology has also been used for feature extraction of academic texts. Autoencoders can effectively learn the semantic representation of text data, so Eisa et al. proposed to use deep autoencoder technology to extract lexical feature sets. The network structure of Recurrent Neural Network (RNN) can process input of different time series, which is very suitable for extracting features from serialized text data, and is widely used in text processing tasks. On this basis, in order to solve the problem of RNN gradient disappearance caused by long distance, people improved RNN and constructed LSTM (Long Short-Term Memory Network) and GRU (Gated Recurrent Unit) models. By integrating memory, forgetting, and output stages Retaining long-distance semantic information enables the model to achieve better results in feature extraction from long texts. Devlin et al. also proposed the Bert (BidirectionalEncoder Representation from Transformers, transformer-based bidirectional encoding representation) pre-training model, which maximizes the use of multi-head self-attention mechanism to capture context semantics, and achieves good results in multiple natural language tasks. Good results.

Bert预训练模型是一种使用双向变换器的编码器,相对于传统模型,Bert模型使用遮掩语言模型(Masked LM)在预训练过程中捕获文本中词语级别的语义表示,并基于下一句预测(Next Sentence Prediction)捕获句子之间的语义关系,获取文本中句子级别的语义表示。为了保证预训练的质量,Bert预训练模型会概率性的选择一些相邻和不相邻的句子作为输入,保证模型能够理解不同句子之间的关联关系。Bert预训练模型的双向变换器引入了自注意力机制,将句子内部的关联关系、目标句内部的关系、源句与目标句之间的关系进行学习,利用了多头注意力机制,使用了全注意力的结构。The Bert pre-training model is an encoder that uses a bidirectional transformer. Compared with the traditional model, the Bert model uses a masked language model (Masked LM) to capture the word-level semantic representation in the text during the pre-training process, and predict based on the next sentence ( Next Sentence Prediction) captures the semantic relationships between sentences and obtains sentence-level semantic representations in the text. In order to ensure the quality of pre-training, the Bert pre-training model will probabilistically select some adjacent and non-adjacent sentences as input to ensure that the model can understand the relationship between different sentences. The bidirectional transformer of the Bert pre-training model introduces a self-attention mechanism, which learns the relationship within the sentence, the relationship within the target sentence, and the relationship between the source sentence and the target sentence. The structure of attention.

预训练得到的Bert模型可以在具体的NLP(Natural Language Processing,自然语言处理)任务下进行微调,从而使用单个模型可适应多个不同的NLP任务,节约计算资源。尽管可以在NLP任务下对Bert预训练模型进行微调,但当前Bert模型提供的文本向量在学术资源文本特征表示场景下表现欠佳,生成的向量表示会出现“坍缩”现象,即Bert模型对所有的句子都倾向于编码到一个较小的空间区域内,这使得大多数的句子对都具有较高的相似度分数,即使是那些语义上完全无关的句子对。此外,当前Bert模型的向量生成速度过慢,极大影响了学术文本特征提取的速度。The pre-trained Bert model can be fine-tuned under specific NLP (Natural Language Processing) tasks, so that a single model can be adapted to multiple different NLP tasks, saving computing resources. Although the Bert pre-training model can be fine-tuned under NLP tasks, the text vector provided by the current Bert model does not perform well in the scenario of academic resource text feature representation, and the generated vector representation will appear "collapse". Sentences tend to be encoded into a small spatial region, which makes most sentence pairs have high similarity scores, even those that are completely semantically irrelevant. In addition, the vector generation speed of the current Bert model is too slow, which greatly affects the speed of academic text feature extraction.

因此,如何针对学术大数据场景下的文本向量的生成,在提高向量生成质量的同时加快向量生成的速度,是一个亟待解决的问题。Therefore, how to accelerate the speed of vector generation while improving the quality of vector generation for text vector generation in academic big data scenarios is an urgent problem to be solved.

发明内容SUMMARY OF THE INVENTION

本发明针对现有技术中存在的问题,提出了一种基于预训练模型的学术文本语义特征提取方法和系统,以提高向量生成质量并提高向量生成速度。Aiming at the problems existing in the prior art, the present invention proposes a method and system for extracting semantic features of academic texts based on a pre-training model, so as to improve the quality of vector generation and the speed of vector generation.

本发明的一个方面提供了一种基于预训练模型的学术文本语义特征提取方法,该方法包括以下步骤:One aspect of the present invention provides a method for extracting semantic features of academic text based on a pre-training model, the method comprising the following steps:

获取学术资源文本数据;Obtain academic resource text data;

将获得的学术资源文本数据输入至预训练模型,得到多维的学术文本语义特征向量;所述预训练模型是基于多重负样例损失函数对Bert预训练模型进行微调、将微调后的Bert预训练模型作为教师模型通过知识蒸馏来训练学生模型得到的学生预训练模型;Input the obtained academic resource text data into the pre-training model to obtain multi-dimensional academic text semantic feature vectors; the pre-training model fine-tunes the Bert pre-training model based on multiple negative sample loss functions, and pre-trains the fine-tuned Bert The model is a student pre-training model obtained by training the student model through knowledge distillation as a teacher model;

将所述多维的学术文本语义特征向量进行降维压缩,输出最终的学术文本语义特征向量。The multi-dimensional academic text semantic feature vector is subjected to dimensionality reduction and compression, and the final academic text semantic feature vector is output.

在本发明一些实施例中,通过爬虫技术爬取网页学术资源数据,得到学术资源文本数据;在scrapy爬虫爬取网页学术资源文本数据过程中,针对存在反爬机制的待爬取网页,将待爬取网页的原URL的文档ID进行提取,利用提取出的文档ID构造新的URL,将爬虫引导到无反爬机制的详情页面中,由此获取待爬取网页完整的文档信息。In some embodiments of the present invention, the academic resource data of the webpage is crawled by the crawler technology, and the text data of the academic resource is obtained; in the process of crawling the academic resource text data of the webpage by the scrapy crawler, for the webpage to be crawled with an anti-crawling mechanism, the Extract the document ID of the original URL of the crawled webpage, use the extracted document ID to construct a new URL, and guide the crawler to the detail page without anti-crawling mechanism, thereby obtaining the complete document information of the webpage to be crawled.

在本发明一些实施例中,所述基于多重负样例损失函数对Bert预训练模型进行微调、将微调后的Bert预训练模型作为教师模型通过知识蒸馏来训练学生模型包括:利用自然语言推理数据集或语义文本相似度基准数据集基于多重负样例损失函数对Bert预训练模型进行微调;利用维基数据集将微调后的Bert预训练模型作为教师模型通过知识蒸馏来训练学生模型;所述Bert预训练模型的输入为自然语言推理数据集中蕴含关系标签的句子对。In some embodiments of the present invention, the fine-tuning of the Bert pre-training model based on multiple negative sample loss functions, and using the fine-tuned Bert pre-training model as a teacher model to train the student model through knowledge distillation includes: using natural language inference data Set or semantic text similarity benchmark dataset to fine-tune the Bert pre-training model based on multiple negative example loss functions; use the Wikidata set to use the fine-tuned Bert pre-training model as a teacher model to train the student model through knowledge distillation; the Bert The input to the pretrained model is sentence pairs containing relational labels in the natural language inference dataset.

在本发明一些实施例中,所述多重负样例损失函数满足如下公式:In some embodiments of the present invention, the multiple negative sample loss function satisfies the following formula:

Figure 100002_DEST_PATH_IMAGE001
Figure 100002_DEST_PATH_IMAGE001
;

其中,uv分别表示基于Bert预训练模型得到的句向量序列[u 1, …, u i, …,u K]和[v 1, …, v i , …, v K],

Figure 543752DEST_PATH_IMAGE002
表示句向量
Figure 100002_DEST_PATH_IMAGE003
Figure 250546DEST_PATH_IMAGE004
之间的点积,
Figure DEST_PATH_IMAGE005
表示在计算句向量之间的点积时使用的预训练模型,K表示向Bert预训练模型输入的句子对的数量。Among them, u and v represent sentence vector sequences [ u 1 , …, u i , …, u K ] and [ v 1 , …, v i , …, v K ] obtained based on the Bert pre-training model, respectively,
Figure 543752DEST_PATH_IMAGE002
Represents a sentence vector
Figure 100002_DEST_PATH_IMAGE003
and
Figure 250546DEST_PATH_IMAGE004
The dot product between ,
Figure DEST_PATH_IMAGE005
represents the pretrained model used when computing the dot product between sentence vectors, and K represents the number of sentence pairs input to the Bert pretrained model.

在本发明一些实施例中,学生模型训练过程中采用的损失函数为MSE损失函数,MSE损失函数表示为:In some embodiments of the present invention, the loss function used in the training process of the student model is the MSE loss function, and the MSE loss function is expressed as:

Figure 392814DEST_PATH_IMAGE006
Figure 392814DEST_PATH_IMAGE006
;

其中,

Figure 100002_DEST_PATH_IMAGE007
表示教师模型生成的句向量,
Figure 28326DEST_PATH_IMAGE008
表示学生模型生成的句向量;n表示句向量数量。in,
Figure 100002_DEST_PATH_IMAGE007
represents the sentence vector generated by the teacher model,
Figure 28326DEST_PATH_IMAGE008
Represents the sentence vector generated by the student model; n represents the number of sentence vectors.

在本发明一些实施例中,所述方法还包括:学生预训练模型训练步骤,其包括:基于自然语言推理数据集或语义文本相似度基准数据集利用多重负样例损失函数对Bert预训练模型进行微调,获得微调后的Bert预训练模型;以微调后的Bert预训练模型作为教师模型,通过知识蒸馏来训练学生模型后得到的学生预训练模型。In some embodiments of the present invention, the method further includes: a student pre-training model training step, which includes: using multiple negative sample loss functions to pre-train the Bert model based on a natural language inference dataset or a semantic text similarity benchmark dataset Perform fine-tuning to obtain the fine-tuned Bert pre-training model; take the fine-tuned Bert pre-training model as the teacher model, and train the student model through knowledge distillation to obtain the student pre-training model.

在本发明一些实施例中,将所述多维的学术文本语义特征向量进行降维压缩包括:利用主成分分析降维算法对所述预训练模型输出的多维的学术文本特征向量进行降维压缩。In some embodiments of the present invention, performing dimension reduction and compression on the multi-dimensional academic text semantic feature vector includes: using a principal component analysis dimension reduction algorithm to perform dimension reduction and compression on the multi-dimensional academic text feature vector output by the pre-training model.

在本发明一些实施例中,所述学术资源文本数据包括结构化学术资源文本数据和/或非结构化学术资源文本数据。In some embodiments of the present invention, the academic resource text data includes structured academic resource text data and/or unstructured academic resource text data.

在本发明一些实施例中,在将微调后的Bert预训练模型作为教师模型通过知识蒸馏来训练学生模型时,所述教师模型包括12层隐藏层,所述教师模型的第[1,4,7,10]隐藏层被保留作为学生模型的隐藏层。In some embodiments of the present invention, when the fine-tuned Bert pre-training model is used as a teacher model to train a student model through knowledge distillation, the teacher model includes 12 hidden layers, and the teacher model has the number of [1, 4, 7, 10] The hidden layer is reserved as the hidden layer of the student model.

本发明的另一方面,还提供了一种基于预训练模型的学术文本语义特征提取系统,该系统包括处理器和存储器,所述存储器中存储有计算机指令,所述处理器用于执行所述存储器中存储的计算机指令,当所述计算机指令被处理器执行时该系统实现如前所述方法的步骤。Another aspect of the present invention further provides a system for extracting semantic features of academic text based on a pre-training model. The system includes a processor and a memory, where computer instructions are stored in the memory, and the processor is configured to execute the memory. Computer instructions stored in the system, when executed by the processor, implement the steps of the method as previously described.

本发明的另一方面,还提供了一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现如前所述方法的步骤。Another aspect of the present invention also provides a computer storage medium on which a computer program is stored, and when the program is executed by a processor, implements the steps of the aforementioned method.

本发明的基于预训练模型的学术文本语义特征提取方法和系统,可以使用任意语音驱动的说话人头动合成方法作为输入,且结果的自然度都能得到提升。The academic text semantic feature extraction method and system based on the pre-training model of the present invention can use any speech-driven speaker head movement synthesis method as input, and the naturalness of the result can be improved.

本发明的附加优点、目的,以及特征将在下面的描述中将部分地加以阐述,且将对于本领域普通技术人员在研究下文后部分地变得明显,或者可以根据本发明的实践而获知。本发明的目的和其它优点可以通过在说明书以及附图中具体指出的结构实现到并获得。Additional advantages, objects, and features of the present invention will be set forth in part in the description that follows, and in part will become apparent to those of ordinary skill in the art upon study of the following, or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the description and drawings.

本领域技术人员将会理解的是,能够用本发明实现的目的和优点不限于以上具体所述,并且根据以下详细说明将更清楚地理解本发明能够实现的上述和其他目的。Those skilled in the art will appreciate that the objects and advantages that can be achieved with the present invention are not limited to those specifically described above, and that the above and other objects that can be achieved by the present invention will be more clearly understood from the following detailed description.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,并不构成对本发明的限定。The accompanying drawings described herein are used to provide a further understanding of the present invention, and constitute a part of the present application, and do not constitute a limitation to the present invention.

图1为本发明一实施例中对Bert预训练模型进行微调的示意图。FIG. 1 is a schematic diagram of fine-tuning a Bert pre-training model according to an embodiment of the present invention.

图2为本发明一实施例中对微调的Bert模型通过知识蒸馏进行模型压缩的示意图。FIG. 2 is a schematic diagram of model compression of a fine-tuned Bert model through knowledge distillation in an embodiment of the present invention.

图3为本发明一实施例中基于预训练模型的学术文本语义特征提取方法的流程示意图。FIG. 3 is a schematic flowchart of a method for extracting semantic features of academic text based on a pre-training model according to an embodiment of the present invention.

图4为本发明一实施例中进行学术资源收集、预处理和存储的示意图。FIG. 4 is a schematic diagram of academic resource collection, preprocessing and storage in an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白,下面结合实施方式和附图,对本发明做进一步详细说明。在此,本发明的示意性实施方式及其说明用于解释本发明,但并不作为对本发明的限定。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments and accompanying drawings. Here, the exemplary embodiments of the present invention and their descriptions are used to explain the present invention, but not to limit the present invention.

需要说明的是,在附图或说明书描述中,相似或相同的部分都使用相同的图号。且在附图中,以简化或是方便标示。再者,附图中未绘示或描述的实现方式,为所属技术领域中普通技术人员所知的形式。另外,虽然本文可提供包含特定值的参数的示范,但应了解,参数无需确切等于相应的值,而是在可接受的误差容限或设计约束内近似于相应的值。It should be noted that, in the drawings or descriptions in the specification, the same drawing numbers are used for similar or identical parts. In the accompanying drawings, they are marked for simplification or convenience. Furthermore, implementations not shown or described in the accompanying drawings are in the form known to those of ordinary skill in the art. Additionally, although examples of parameters including specific values may be provided herein, it should be understood that the parameters need not be exactly equal to the corresponding value, but rather approximate the corresponding value within acceptable error tolerances or design constraints.

在此,还需要说明的是,为了避免因不必要的细节而模糊了本发明,在附图中仅仅示出了与根据本发明的方案密切相关的结构和/或处理步骤,而省略了与本发明关系不大的其他细节。Here, it should also be noted that, in order to avoid obscuring the present invention due to unnecessary details, only the structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, and the related structures and/or processing steps are omitted. Other details not relevant to the invention.

应该强调,术语“包括/包含”在本文使用时指特征、要素、步骤或组件的存在,但并不排除一个或更多个其它特征、要素、步骤或组件的存在或附加。It should be emphasized that the term "comprising/comprising" when used herein refers to the presence of a feature, element, step or component, but does not exclude the presence or addition of one or more other features, elements, steps or components.

针对现有的Bert预训练模型(简称Bert模型)提供的文本向量在学术资源文本特征表示场景下表现欠佳的问题,本发明提出了一种基于Bert预训练模型的新的预训练模型,并基于该新的预训练模型提出了一种学术文本语义特征提取方法。本发明提出的新的预训练模型是在现有Bert模型的基础上结合模型微调(Finetune)、知识蒸馏(KnowledgeDistilling)和主成分分析(Principal Component Analysis,PCA)降维优化而生成的深度学习模型,基于该新的预训练模型可以利用学术资源文本准确、快速地进行文本语义特征的提取,从而更精准、快速地生成文本向量。在本发明实施例中,将提出的新的预训练模型称为Bert-FDP预训练模型,其中,FDP为Finetune、Distilling和PCA首字母的缩写,表示微调、知识蒸馏和PCA降维优化。Aiming at the problem that the text vector provided by the existing Bert pre-training model (Bert model for short) does not perform well in the scenario of academic resource text feature representation, the present invention proposes a new pre-training model based on the Bert pre-training model, and Based on this new pre-training model, a semantic feature extraction method for academic texts is proposed. The new pre-training model proposed by the present invention is a deep learning model generated by combining model fine-tuning (Finetune), knowledge distillation (Knowledge Distilling) and principal component analysis (Principal Component Analysis, PCA) on the basis of the existing Bert model. , based on the new pre-training model, it can use academic resource text to accurately and quickly extract text semantic features, so as to generate text vectors more accurately and quickly. In the embodiment of the present invention, the proposed new pre-training model is called Bert-FDP pre-training model, where FDP is the abbreviation of Finetune, Distilling and PCA, which means fine-tuning, knowledge distillation and PCA dimensionality reduction optimization.

下面将对本发明实施例提出的Bert-FDP预训练模型进行说明。The Bert-FDP pre-training model proposed in the embodiment of the present invention will be described below.

模型微调是指将训练好的模型在新的任务场景下再次进行训练,实现模型参数的调整,使模型在原有基础上达到更好的效果。本发明一实施例中,是在自然语言推理(Natural language inference,NLI))数据集上对Bert模型进行微调,来学习自然语言推理数据集中潜在的语义蕴含关系。其中,作为训练数据的自然语言推理数据集可以采用当前已有的自然语言推理数据集中的训练样本,这些训练样本为大量句子对。图1所示为本发明实施例中对Bert预训练模型进行微调的示意图,如图1所示,该Bert预训练模型在自然语言推理数据集上使用双塔结构进行训练,训练数据由多个蕴含关系的句子对组成,双塔结构的两组输入分别为句子序列,即第一组输入为句子1及其后续句子组成的句子序列,第二组输入为句子2及其后续句子组成的句子序列。其中,句子1和句子2属于相似的句子对,句子1和第二组输入中的其他句子不相似,句子2和第一组输入中的其他句子不相似。输入的多组句子对经过由多层转换器的编码器堆叠而成的Bert基础网络结构,Bert基础网络结构的每一层的编码器由一层多头注意力层和一层前向反馈层(feed-forword)组成,由于作为本发明的被微调对象的Bert预训练模型的Bert基础网络结构为现有结构,在此不进行赘述。Bert预训练模型用于捕捉句子之间的双向关系,经Bert基础网络结构输出的句子向量经过池化层进行降维处理,分别输出句子向量序列[u 1, u 2, …, u n]和[v 1, v 2, …, v n],然后对两个句子向量序列中的句子进行相似度计算,如进行点积计算。Model fine-tuning refers to re-training the trained model in a new task scenario to adjust the model parameters, so that the model can achieve better results on the original basis. In an embodiment of the present invention, the Bert model is fine-tuned on a natural language inference (NLI) data set to learn potential semantic implications in the natural language inference data set. Among them, the natural language inference data set as the training data may adopt the training samples in the currently existing natural language inference data set, and these training samples are a large number of sentence pairs. Fig. 1 shows a schematic diagram of fine-tuning the Bert pre-training model in an embodiment of the present invention. As shown in Fig. 1 , the Bert pre-training model is trained on a natural language inference data set using a double-tower structure, and the training data consists of multiple The two sets of inputs of the double-tower structure are sentence sequences respectively, that is, the first set of inputs is a sentence sequence composed of sentence 1 and its subsequent sentences, and the second set of inputs is a sentence composed of sentence 2 and its subsequent sentences. sequence. Among them, sentence 1 and sentence 2 belong to similar sentence pairs, sentence 1 and other sentences in the second group of inputs are not similar, and sentence 2 and other sentences in the first group of inputs are not similar. The input multiple sets of sentence pairs pass through the Bert basic network structure stacked by the encoder of the multi-layer converter. The encoder of each layer of the Bert basic network structure consists of a multi-head attention layer and a forward feedback layer ( Feed-forword), since the Bert basic network structure of the Bert pre-training model, which is the fine-tuned object of the present invention, is an existing structure, it will not be repeated here. The Bert pre-training model is used to capture the bidirectional relationship between sentences. The sentence vectors output by the Bert basic network structure undergo dimension reduction processing through the pooling layer, and output sentence vector sequences [ u 1 , u 2 , …, u n ] and [ v 1 , v 2 , …, v n ], and then perform a similarity calculation on the sentences in the two sentence vector sequences, such as a dot product calculation.

在本发明实施例中,多重负样例损失函数的训练数据由句子对

Figure 100002_DEST_PATH_IMAGE009
组成,其中
Figure 357676DEST_PATH_IMAGE010
是一对相似的句子,
Figure DEST_PATH_IMAGE011
在i!=j时不相似,该多重负样例损失函数在最小化
Figure 759839DEST_PATH_IMAGE010
之间距离的同时最大化i!=j时的所有
Figure 271460DEST_PATH_IMAGE011
句子对之间的距离。对于自然语言推理数据集,本发明将具有蕴含标签的句子对作为正例输入到本发明的预训练模型中,对于一组句子对构成的批(batch),具体的损失函数
Figure 733666DEST_PATH_IMAGE012
形式如下:In the embodiment of the present invention, the training data of the multiple negative example loss function consists of sentence pairs
Figure 100002_DEST_PATH_IMAGE009
composed of
Figure 357676DEST_PATH_IMAGE010
is a pair of similar sentences,
Figure DEST_PATH_IMAGE011
Dissimilar when i!=j, the multiple negative loss function is minimizing
Figure 759839DEST_PATH_IMAGE010
while maximizing the distance between i!=j for all
Figure 271460DEST_PATH_IMAGE011
The distance between sentence pairs. For the natural language inference data set, the present invention inputs sentence pairs with implied labels as positive examples into the pre-training model of the present invention. For a batch composed of a set of sentence pairs, the specific loss function
Figure 733666DEST_PATH_IMAGE012
The form is as follows:

Figure DEST_PATH_IMAGE013
Figure DEST_PATH_IMAGE013
;

其中,uv分别表示句向量序列[u 1, …, u i, …, u K]和[v 1, …, v i , …, v K],

Figure 160099DEST_PATH_IMAGE014
表示句向量
Figure DEST_PATH_IMAGE015
Figure 100373DEST_PATH_IMAGE016
之间的点积,
Figure DEST_PATH_IMAGE017
表示在计算句向量点积时使用的预训练模型,K表示向Bert预训练模型输入的句子对的数量。where u and v represent sentence vector sequences [ u 1 , …, u i , …, u K ] and [ v 1 , …, v i , …, v K ], respectively,
Figure 160099DEST_PATH_IMAGE014
Represents a sentence vector
Figure DEST_PATH_IMAGE015
and
Figure 100373DEST_PATH_IMAGE016
The dot product between ,
Figure DEST_PATH_IMAGE017
Represents the pretrained model used in computing the sentence vector dot product, and K represents the number of sentence pairs input to the Bert pretrained model.

本发明实施例中,利用Bert模型生成的向量构造向量集点积

Figure 951654DEST_PATH_IMAGE018
,然后使用多重负样例损失函数对Bert模型进行微调。相比现有模型中使用
Figure DEST_PATH_IMAGE019
Figure 958662DEST_PATH_IMAGE020
构造组合向量
Figure DEST_PATH_IMAGE021
,并将该向量作为全连接层构造的Softmax分类器的输入,本发明实施例使用多重负样例损失函数来微调Bert模型,可以获得更好的句子表示效果。In the embodiment of the present invention, the vector generated by the Bert model is used to construct the vector set dot product
Figure 951654DEST_PATH_IMAGE018
, and then fine-tune the Bert model using a multiple negative loss function. compared to existing models used
Figure DEST_PATH_IMAGE019
and
Figure 958662DEST_PATH_IMAGE020
Construct a combined vector
Figure DEST_PATH_IMAGE021
, and the vector is used as the input of the Softmax classifier constructed by the fully connected layer. The embodiment of the present invention uses multiple negative sample loss functions to fine-tune the Bert model, so that better sentence representation effects can be obtained.

多重负样例损失函数在减少了相似句向量之间距离的同时,可以让负样例之间的句向量距离更远,使得Bert-FDP模型能够生成更准确合理的句向量。While reducing the distance between similar sentence vectors, the multiple negative example loss function can make the sentence vectors between negative examples farther apart, so that the Bert-FDP model can generate more accurate and reasonable sentence vectors.

对于经多重负样例损失函数微调后的预训练模型,为了保证大数据场景下句嵌入生成的速度,本发明进一步使用知识蒸馏的手段进行模型压缩,生成学生网络模型,将学生网络模型作为预训练模型来生成学术文本(如科研成果标题)的句嵌入向量,从而在保证句向量质量的前提下优化向量生成的速度。知识蒸馏是指利用教师网络模型(简称教师模型)提供的输出结果训练层数更少、速度更快的学生网络模型(简称学生模型)。在本发明实施例中,教师网络模型为经多重负样例损失函数微调后的Bert预训练模型。图2为本发明一实施例中对微调的Bert模型通过知识蒸馏进行模型压缩的示意图。本发明一实施例中,利用维基数据集对学生模型进行训练,教师模型例如可包括12层隐藏层,在学生模型的选择上,将教师模型的[1,4,7,10]隐藏层保留作为学生模型的隐藏层,并在此基础上使用外部句子语料进行句向量生成,使用MSE(Mean Square Error,均方差)损失函数对比向量的生成结果。在此,将教师模型的[1,4,7,10]隐藏层保留作为学生模型的隐藏层仅为本发明的示例,本发明并不限于此,还可以将更多、更少或其他一些层来作为学生模型的隐藏层。For the pre-training model fine-tuned by the loss function of multiple negative samples, in order to ensure the speed of sentence embedding generation in the big data scenario, the present invention further uses the means of knowledge distillation to compress the model, generate a student network model, and use the student network model as the pre-training model. Train the model to generate sentence embedding vectors of academic texts (such as scientific research results titles), so as to optimize the speed of vector generation while ensuring the quality of sentence vectors. Knowledge distillation refers to using the output results provided by the teacher network model (referred to as the teacher model) to train the student network model (referred to as the student model) with fewer layers and faster. In the embodiment of the present invention, the teacher network model is a Bert pre-training model fine-tuned by multiple negative example loss functions. FIG. 2 is a schematic diagram of model compression of a fine-tuned Bert model through knowledge distillation in an embodiment of the present invention. In an embodiment of the present invention, the wiki data set is used to train the student model. For example, the teacher model may include 12 hidden layers. In the selection of the student model, the [1, 4, 7, 10] hidden layers of the teacher model are reserved. As the hidden layer of the student model, and on this basis, the external sentence corpus is used to generate sentence vectors, and the MSE (Mean Square Error, mean square error) loss function is used to compare the generated results of the vectors. Here, keeping the [1, 4, 7, 10] hidden layers of the teacher model as the hidden layers of the student model is only an example of the present invention, and the present invention is not limited to this, and more, less or other layer as the hidden layer of the student model.

在本发明一实施例中,MSE损失函数表示为:In an embodiment of the present invention, the MSE loss function is expressed as:

Figure 528184DEST_PATH_IMAGE022
Figure 528184DEST_PATH_IMAGE022
;

其中,

Figure DEST_PATH_IMAGE023
表示教师模型生成的句向量,
Figure 6570DEST_PATH_IMAGE024
表示学生模型生成的句向量;n表示句向量数量。在本发明实施例中,在微调的Bert预训练模型的基础上通过知识蒸馏生成学生模型后,利用学生模型作为预训练模型进行学术文本语义特征的提取,可以在提高向量生成质量的同时加快向量生成的速度。in,
Figure DEST_PATH_IMAGE023
represents the sentence vector generated by the teacher model,
Figure 6570DEST_PATH_IMAGE024
Represents the sentence vector generated by the student model; n represents the number of sentence vectors. In the embodiment of the present invention, after the student model is generated by knowledge distillation on the basis of the fine-tuned Bert pre-training model, the student model is used as the pre-training model to extract the semantic features of academic text, which can improve the quality of vector generation and speed up the vector Generation speed.

在利用训练好的学生模型作为预训练模型进行学术文本语义特征提取后,针对学生模型输出的向量,在本发明实施例中,还可利用降维算法来进一步缩减向量规模,从而进一步降低相似度比较的时间消耗。可以从现有的降维算法中选择一种降维算法来对学生模型输出的特征向量进行降维压缩。现有降维算法可以包括主成分分析(PCA)降维算法、独立成分分析(ICA)降维算法、线性判别分析(LDA)降维算法等等,但本发明并不限于此。After using the trained student model as the pre-training model to extract the semantic features of the academic text, for the vector output by the student model, in this embodiment of the present invention, a dimensionality reduction algorithm can also be used to further reduce the size of the vector, thereby further reducing the similarity Time consumption for comparison. A dimension reduction algorithm can be selected from the existing dimension reduction algorithms to perform dimension reduction and compression on the feature vector output by the student model. Existing dimension reduction algorithms may include Principal Component Analysis (PCA) dimension reduction algorithms, Independent Component Analysis (ICA) dimension reduction algorithms, Linear Discriminant Analysis (LDA) dimension reduction algorithms, etc., but the present invention is not limited thereto.

以PCA降维算法为例,PCA降维算法主要是通过线性变换(分解特征矩阵)将具有高维度的数据通过数据预处理(数据清洗)进行降维,实现高维数据在低维空间上的投影。假设学生模型输出的数据为n条d维的数据,则该数据按照排列可以组成n行d列的矩阵X,将矩阵X的每一列进行零均值化(即减去这一列的均值)后,进一步求出协方差矩阵,获得协方差矩阵的特征值及对应的特征向量,然后将特征向量按照特征值的大小从上到下按行排列成矩阵,取前k行组成矩阵P;通过Y=PX即可得到降到k维后的数据。由于本发明实施例中的PCA降维算法可以采用现有的PCA降维算法,因此在此不再赘述。Taking the PCA dimensionality reduction algorithm as an example, the PCA dimensionality reduction algorithm mainly reduces the dimension of high-dimensional data through data preprocessing (data cleaning) through linear transformation (decomposing feature matrix), so as to realize the transformation of high-dimensional data in low-dimensional space. projection. Assuming that the data output by the student model is n pieces of d-dimensional data, the data can be arranged to form a matrix X with n rows and d columns. After zero-meaning each column of the matrix X (that is, subtracting the mean of this column), Further find the covariance matrix, obtain the eigenvalues of the covariance matrix and the corresponding eigenvectors, then arrange the eigenvectors into a matrix from top to bottom according to the size of the eigenvalues, and take the first k rows to form a matrix P; PX can get the data reduced to k-dimension. Since the PCA dimensionality reduction algorithm in the embodiment of the present invention may adopt the existing PCA dimensionality reduction algorithm, details are not described herein again.

本发明实施例通过有效结合预训练模型微调、对微调后的模型进行知识蒸馏以及对学生模型生成的向量使用主成分分析(PCA)压缩向量维度,既可以保证预训练模型生成句嵌入的质量,又可以加快句嵌入生成的速度和向量相似度计算的速度。The embodiment of the present invention can ensure the quality of the sentence embedding generated by the pre-training model by effectively combining the fine-tuning of the pre-training model, performing knowledge distillation on the fine-tuned model, and compressing the vector dimension by using principal component analysis (PCA) on the vector generated by the student model. It can also speed up the speed of sentence embedding generation and the speed of vector similarity calculation.

本发明在中英文的多个文本匹配的数据集上验证了Bert-FDP生成句向量的实际效果,在英文数据集的对比实验中,使用了SNLI(斯坦福自然语言推理)和MultiNLI(多流派自然语言推理)语料库构造了包含98万条句子对的数据集作为训练数据对微调的Bert模型进行预训练;在知识蒸馏过程中,使用包含787万个句子的维基数据集对学生模型进行训练,最后使用PCA降维构造的句向量在语义相似度匹配公开数据集上进行了对比实验。The present invention verifies the actual effect of Bert-FDP generating sentence vectors on multiple text matching data sets in Chinese and English. Linguistic reasoning) corpus constructed a dataset containing 980,000 sentence pairs as training data to pre-train the fine-tuned Bert model; in the process of knowledge distillation, the Wikidata set containing 7.87 million sentences was used to train the student model, and finally The sentence vectors constructed using PCA dimensionality reduction are compared on a public dataset of semantic similarity matching.

实验结果表明,本发明所训练出来的Bert-FDP模型对比常用的文本特征提取模型在各个数据集上的相关系数值均有提升。The experimental results show that the Bert-FDP model trained by the present invention has improved correlation coefficient values in each data set compared with the commonly used text feature extraction models.

以上,利用自然语言推理数据集作为训练数据来基于多重负样例损失函数对Bert预训练模型进行微调仅为示例,在本发明另一实施例中,还可以利用语义文本相似度基准(Semantic Textual Similarity Benchmark,STS-B)数据集作为训练数据来基于多重负样例损失函数对Bert预训练模型进行微调,实验发现,在STS-B数据集上对比Bert预训练模型进行调整,可使得特征向量提取准确率提升了13%,说明该Bert-FDP模型能够生成更好的文本向量,有效地缓解了学术文本向量扁平化,区分度不高的问题。In the above, using the natural language inference data set as training data to fine-tune the Bert pre-training model based on multiple negative sample loss functions is only an example. In another embodiment of the present invention, a semantic text similarity benchmark (Semantic Textual Similarity Benchmark, STS-B) data set is used as training data to fine-tune the Bert pre-training model based on multiple negative sample loss functions. It is found in experiments that comparing the Bert pre-training model on the STS-B data set can make the feature vector The extraction accuracy is increased by 13%, indicating that the Bert-FDP model can generate better text vectors, effectively alleviating the problem of flattening academic text vectors and low discrimination.

下面描述本发明实施例中基于如上生成的预训练模型进行学术文本语义特征提取的实现方法。图3所示为本发明实施例中基于预训练模型的学术文本语义特征提取方法的流程示意图。如图3所示,该方法包括如下步骤:The following describes an implementation method for extracting semantic features of academic texts based on the pre-training model generated above in the embodiments of the present invention. FIG. 3 is a schematic flowchart of a method for extracting semantic features of academic text based on a pre-training model according to an embodiment of the present invention. As shown in Figure 3, the method includes the following steps:

步骤S110,获取学术资源文本数据。Step S110, acquiring academic resource text data.

作为示例,在学术资源文本数据的获取方面,可使用爬虫技术(如Scrapy爬虫技术)对在线网络学术数据(即网页学术资源数据)进行爬取和整理。可采用正则表达式、Xpath或CSS(Cascading Style Sheets,层叠样式表)选择器,针对不同类型的网页学术资源数据制定相应的爬取规则,从而获取互联网及各知识库中的论文、学者、专利、项目等学术信息。例如,如图4所示,对于论文索引页和专利公开页等类型的网页学术资源数据,可以通过URL解析与拼接爬取规则,来实现论文详情和专利详情的爬取,对于学科领域树页面类型的网页资源数据,可以通过递归爬取来实现学科领域树内容的爬取。在学术资源数据的获取和处理上,本发明在使用scrapy爬虫对论文和专利等科研成果相关网页学术资源数据进行数据爬取时,由于原始网页往往存在反爬机制,使得原始文章详情页面的数据项不完整。为了解决这个问题,本发明在爬取过程中将待爬取网页的原URL的文档ID提取出来构造新的URL,将爬虫引导到无反爬机制的详情页面中,由此获取待爬取网页完整的文档信息,这样可以获取到完整的科研成果数据信息。As an example, in terms of obtaining academic resource text data, a crawler technology (such as the Scrapy crawler technology) can be used to crawl and organize online network academic data (that is, web page academic resource data). Regular expressions, Xpath or CSS (Cascading Style Sheets) selectors can be used to formulate corresponding crawling rules for different types of web page academic resource data, so as to obtain papers, scholars and patents in the Internet and various knowledge bases , projects and other academic information. For example, as shown in Figure 4, for academic resource data of web pages such as paper index pages and patent disclosure pages, URL parsing and splicing crawling rules can be used to crawl paper details and patent details. For the subject area tree page Types of web resource data can be crawled by recursive crawling of the subject area tree content. In the acquisition and processing of academic resource data, the invention uses scrapy crawler to crawl the academic resource data of the web pages related to scientific research achievements such as papers and patents. Since the original web page often has an anti-crawling mechanism, the data of the original article details page is Item is incomplete. In order to solve this problem, the present invention extracts the document ID of the original URL of the webpage to be crawled during the crawling process to construct a new URL, and guides the crawler to the detail page without anti-crawling mechanism, thereby obtaining the webpage to be crawled. Complete document information, so that you can obtain complete scientific research results data information.

基于爬取到的文档内容,可得到学术资源文本数据,该学术资源文本数据可包括结构化文本数据和非结构化文本数据,在此结构化文本数据指适于由二维表逻辑表达来实现的文本数据,而非结构化文本数据指不方便用二维表逻辑表达表现的文本数据。例如,对于学术资源文本数据,非结构化文本数据可包括论文摘要文本、专利摘要文本和其他文本;通过报表解析得到的结构化的文本数据可包括以学者、研究机构、学科领域、专利、论文、会议等属性下的表格化的文本数据。scrapy爬虫技术可初步实现学术资源的筛选和抓取。Based on the crawled document content, academic resource text data can be obtained. The academic resource text data can include structured text data and unstructured text data. Here, structured text data refers to a logical representation of a two-dimensional table. Unstructured text data refers to text data that is inconvenient to express logically in two-dimensional tables. For example, for academic resource text data, unstructured text data may include paper abstracts, patent abstracts and other texts; structured text data obtained through report parsing may include scholars, research institutions, subject areas, patents, papers Tabular text data under properties such as , meetings, etc. The scrapy crawler technology can initially realize the screening and crawling of academic resources.

在学术资源文本数据爬取过程中,对于获取到的结构化文本数据及其属性,本发明将其直接保存到MySql数据库中,结构化文本数据中还可包括结构化实体数据;对于非结构化的文本数据,为了方便知识服务构件进行文本的检索,本发明将其保存到基于倒排索引的Elasticsearch数据库中。进一步地,本发明还可以使用基于深度语义表示的信息抽取算法抽取学术文本中的学术主题词实体、学科领域与学术实体的关联关系,再结合MySql数据库中保存的结构化实体数据以及实体间的关系,可构造实体与实体关系三元组,进一步保存到Neo4j数据库中。During the crawling process of academic resource text data, the present invention directly saves the acquired structured text data and its attributes into the MySql database, and the structured text data may also include structured entity data; In order to facilitate the retrieval of the text by the knowledge service component, the present invention saves the text data in the Elasticsearch database based on the inverted index. Further, the present invention can also use an information extraction algorithm based on deep semantic representation to extract the academic subject heading entities, subject areas and the association relationship between academic entities in academic texts, and then combine the structured entity data saved in the MySql database and the relationship between entities. relationship, which can construct entity-entity relationship triples, which are further saved in the Neo4j database.

通过本发明的步骤S110,设计了学术资源文本数据的采集和存储,包括文本数据获取部分和文本数据存储部分,可以获得结构化和非结构化的学术资源文本数据并存储至相应的数据库中,实现了多源、多领域学术成果的结构化。本步骤S110可获取到互联网及各知识库中的论文、学者、专利、项目等学术资源文本信息。Through step S110 of the present invention, the collection and storage of academic resource text data is designed, including a text data acquisition part and a text data storage part, structured and unstructured academic resource text data can be obtained and stored in the corresponding database, It realizes the structuring of multi-source and multi-field academic achievements. In this step S110, the text information of academic resources such as papers, scholars, patents, projects, etc. in the Internet and various knowledge bases can be obtained.

步骤S120,将获得的学术资源文本数据输入预训练模型,得到多维的学术文本语义特征向量。Step S120, inputting the obtained academic resource text data into a pre-training model to obtain a multi-dimensional academic text semantic feature vector.

在本步骤中,该预训练模型便为通过前面描述的步骤得到的学生预训练模型。也即,该学生预训练模型是基于多重负样例损失函数对Bert预训练模型进行微调、将微调后的Bert预训练模型作为教师模型通过知识蒸馏来训练学生模型得到的学生预训练模型。In this step, the pre-training model is the student pre-training model obtained through the steps described above. That is, the student pre-training model is a student pre-training model obtained by fine-tuning the Bert pre-training model based on multiple negative example loss functions, and using the fine-tuned Bert pre-training model as a teacher model to train the student model through knowledge distillation.

可将在步骤S110中获得的学术资源文本数据输入至学生预训练模型,由学生预训练模型输出多维的学术文本语义特征向量。The academic resource text data obtained in step S110 may be input into the student pre-training model, and the student pre-training model outputs a multi-dimensional academic text semantic feature vector.

在本发明实施例中,输入至学生预训练模型的学术资源文本数据可以是获取的结构化文本数据,也可以是非结构化文本数据。例如,在输入的数据为非结构化文本数据的情况下,输入的非结构化文本数据可以是论文摘要文本、专利摘要文本等文本数据,也可以是其他学术类型的文本数据;在输入的数据为非结构化文本数据的情况下,输入的结构化文本数据可以是特定属性下的文本数据,还可以是实体内容文本数据。In this embodiment of the present invention, the academic resource text data input to the student pre-training model may be acquired structured text data or unstructured text data. For example, when the input data is unstructured text data, the input unstructured text data can be text data such as paper abstract text, patent abstract text, etc., or other academic types of text data; in the input data In the case of unstructured text data, the input structured text data may be text data under a specific attribute, or may be entity content text data.

步骤S130,将学生预训练模型输出的多维学术文本语义特征向量进行降维压缩,输出最终的学术文本语义特征。In step S130, the multi-dimensional academic text semantic feature vector output by the student's pre-training model is subjected to dimension reduction and compression, and the final academic text semantic feature is output.

在本发明实施例中,可以从现有的降维算法中选择一种降维算法来对特征向量进行降维压缩。现有降维算法可以包括主成分分析(PCA)降维算法、独立成分分析(ICA)降维算法、线性判别分析(LDA)降维算法等等,但本发明并不限于此。In this embodiment of the present invention, a dimension reduction algorithm may be selected from existing dimension reduction algorithms to perform dimension reduction and compression on the feature vector. Existing dimension reduction algorithms may include Principal Component Analysis (PCA) dimension reduction algorithms, Independent Component Analysis (ICA) dimension reduction algorithms, Linear Discriminant Analysis (LDA) dimension reduction algorithms, etc., but the present invention is not limited thereto.

作为示例,将学生预训练模型输出的多维向量数据经由主成分分析(PCA)降维算法进行降维,得到最终学术文本语义特征向量。As an example, the multidimensional vector data output by the student's pre-training model is dimensionally reduced through the Principal Component Analysis (PCA) dimensionality reduction algorithm to obtain the final academic text semantic feature vector.

本发明通过使用Bert预训练模型,在执行具体的下游任务编码前,利用大规模语料对变换器模型进行预训练,再基于预训练的参数在具体的下游任务下基于多重负样例损失函数对Bert预训练模型进行微调,并将微调后的Bert预训练模型作为教师模型通过知识蒸馏来训练学生模型,从而利用预训练的学生模型作为本发明中进行学术文本语义特征提取所采用的预训练模型,可以获得更好的效果,可以在提高向量生成质量的同时加快向量生成的速度,适用于学术大数据场景下的文本向量生成。By using the Bert pre-training model, the present invention uses a large-scale corpus to pre-train the converter model before executing specific downstream task coding, and then based on the pre-trained parameters under specific downstream tasks based on multiple negative sample loss functions The Bert pre-training model is fine-tuned, and the fine-tuned Bert pre-training model is used as a teacher model to train the student model through knowledge distillation, so that the pre-trained student model is used as the pre-training model used for the semantic feature extraction of academic texts in the present invention. , can get better results, can improve the quality of vector generation while speeding up the speed of vector generation, and is suitable for text vector generation in academic big data scenarios.

本发明实施例中,由于采用自然语言推理数据集对Bert预训练模型进行了训练,所提出的在Bert预训练模型基础上进行了微调的预训练模型可以包含更多的语义信息,同时多重负样例损失函数在减少了相似句向量之间距离的同时,让负样例之间的句向量距离更远,使得本发明的预训练模型能够生成更准确合理的句向量。同时PCA降维一定程度上校正了句向量的分布,从而使得计算出来的相似度更为合理。由于模型层数的减少,模型生成向量的速度相对原始模型也更快。In the embodiment of the present invention, since the natural language inference data set is used to train the Bert pre-training model, the proposed pre-training model fine-tuned on the basis of the Bert pre-training model can contain more semantic information, while multiple negative The sample loss function reduces the distance between similar sentence vectors, and at the same time makes the sentence vectors between negative samples farther apart, so that the pre-training model of the present invention can generate more accurate and reasonable sentence vectors. At the same time, PCA dimensionality reduction corrects the distribution of sentence vectors to a certain extent, so that the calculated similarity is more reasonable. Due to the reduced number of model layers, the model also generates vectors faster than the original model.

本发明实施例有效结合预训练模型微调、知识蒸馏与PCA降维。在提高向量生成质量的同时加快向量生成的速度,适用于学术大数据场景下的文本向量生成。由于将Bert预训练模型在自然语言推理数据集上基于多重负样例损失函数进行了微调,这显著提升了学术大数据场景下语义向量分布情况。本发明提出的Bert-FDP预训练模型,在实现学术资源文本的精准向量表示的同时,保证了大数据场景下句嵌入生成的速度。The embodiments of the present invention effectively combine pre-training model fine-tuning, knowledge distillation and PCA dimensionality reduction. While improving the quality of vector generation, it speeds up the speed of vector generation, and is suitable for text vector generation in academic big data scenarios. Since the Bert pretrained model is fine-tuned on the natural language inference dataset based on the multiple negative example loss function, this significantly improves the semantic vector distribution in academic big data scenarios. The Bert-FDP pre-training model proposed by the present invention not only realizes the accurate vector representation of academic resource texts, but also ensures the speed of sentence embedding generation in big data scenarios.

实验结果表明,本发明所训练出来的Bert-FDP模型对比常用的文本特征提取模型在各个数据集上的相关系数值均有提升,该Bert-FDP模型能够生成更好的文本向量,有效地缓解了学术文本向量扁平化,区分度不高的问题。The experimental results show that the Bert-FDP model trained by the present invention has improved correlation coefficient values in each data set compared with the commonly used text feature extraction models. The Bert-FDP model can generate better text vectors and effectively alleviate the problem. It solves the problem that academic text vectors are flattened and the degree of discrimination is not high.

与上述方法相应地,本发明还提供了一种基于预训练模型的学术文本语义特征提取系统,该系统包括计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有计算机指令,所述处理器用于执行所述存储器中存储的计算机指令,当所述计算机指令被处理器执行时该系统实现如前所述方法的步骤。Corresponding to the above method, the present invention also provides an academic text semantic feature extraction system based on a pre-training model, the system includes a computer device, the computer device includes a processor and a memory, and the memory stores computer instructions, The processor is configured to execute computer instructions stored in the memory, and when the computer instructions are executed by the processor, the system implements the steps of the aforementioned method.

本发明实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时以实现前述边缘计算服务器部署方法的步骤。该计算机可读存储介质可以是有形存储介质,诸如随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、软盘、硬盘、可移动存储盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质。Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of the foregoing method for deploying an edge computing server. The computer-readable storage medium may be a tangible storage medium such as random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disk, hard disk, removable storage disk, CD-ROM, or any other form of storage medium known in the art.

本领域普通技术人员应该可以明白,结合本文中所公开的实施方式描述的各示例性的组成部分、系统和方法,能够以硬件、软件或者二者的结合来实现。具体究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。当以硬件方式实现时,其可以例如是电子电路、专用集成电路(ASIC)、适当的固件、插件、功能卡等等。当以软件方式实现时,本发明的元素是被用于执行所需任务的程序或者代码段。程序或者代码段可以存储在机器可读介质中,或者通过载波中携带的数据信号在传输介质或者通信链路上传送。It should be understood by those of ordinary skill in the art that the various exemplary components, systems and methods described in conjunction with the embodiments disclosed herein can be implemented in hardware, software or a combination of the two. Whether it is implemented in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an application specific integrated circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, elements of the invention are programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted over a transmission medium or communication link by a data signal carried in a carrier wave.

需要明确的是,本发明并不局限于上文所描述并在图中示出的特定配置和处理。为了简明起见,这里省略了对已知方法的详细描述。在上述实施例中,描述和示出了若干具体的步骤作为示例。但是,本发明的方法过程并不限于所描述和示出的具体步骤,本领域的技术人员可以在领会本发明的精神后,作出各种改变、修改和添加,或者改变步骤之间的顺序。It is to be understood that the present invention is not limited to the specific arrangements and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above-described embodiments, several specific steps are described and shown as examples. However, the method process of the present invention is not limited to the specific steps described and shown, and those skilled in the art can make various changes, modifications and additions, or change the sequence of steps after comprehending the spirit of the present invention.

本发明中,针对一个实施方式描述和/或例示的特征,可以在一个或更多个其它实施方式中以相同方式或以类似方式使用,和/或与其他实施方式的特征相结合或代替其他实施方式的特征。In the present invention, features described and/or illustrated with respect to one embodiment may be used in the same or similar manner in one or more other embodiments, and/or in combination with or in place of features of other embodiments Features of the implementation.

以上所述仅为本发明的优选实施例,并不用于限制本发明,对于本领域的技术人员来说,本发明实施例可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, various modifications and changes may be made to the embodiments of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims (10)

1. A semantic feature extraction method for academic texts based on a pre-training model is characterized by comprising the following steps:
acquiring academic resource text data;
inputting the acquired academic resource text data into a pre-training model to obtain a multi-dimensional academic text semantic feature vector; the pre-training model is a student pre-training model obtained by finely tuning a Bert pre-training model based on a multiple load sample loss function, and training the student model by knowledge distillation by taking the finely tuned Bert pre-training model as a teacher model;
and performing dimensionality reduction compression on the multidimensional academic text semantic feature vector, and outputting a final academic text semantic feature vector.
2. The method of claim 1, wherein obtaining academic resource text data comprises: crawling webpage academic resource data through a crawler technology to obtain academic resource text data;
in the process of crawling webpage academic resource text data by a script crawler, aiming at a webpage to be crawled with a reverse crawling mechanism, extracting a document ID of an original URL of the webpage to be crawled, constructing a new URL by using the extracted document ID, and guiding the crawler to a detailed page without the reverse crawling mechanism, so that complete document information of the webpage to be crawled is obtained.
3. The method of claim 1, wherein the fine-tuning the Bert pre-trained model based on the multiple negative sample loss function, and training the student model by knowledge distillation using the fine-tuned Bert pre-trained model as a teacher model comprises: fine-tuning the Bert pre-training model by utilizing a natural language reasoning data set or a semantic text similarity benchmark data set based on a multiple heavy sample loss function, and training a student model by utilizing a wiki data set to take the fine-tuned Bert pre-training model as a teacher model through knowledge distillation;
and the input of the Bert pre-training model is a sentence pair containing a relation label in the natural language reasoning data set.
4. The method of claim 3, wherein the multiple heavy load sample loss function satisfies the following equation:
Figure DEST_PATH_IMAGE001
wherein,uandvrespectively show the sentence vector sequences obtained based on the Bert pre-training modelu 1 , …, u i , …, u K ]And 2v 1 , …, v i , …, v K ],
Figure 76290DEST_PATH_IMAGE002
Representing a sentence vector
Figure DEST_PATH_IMAGE003
And
Figure 540901DEST_PATH_IMAGE004
the dot product between the two (C) and (D),
Figure 914113DEST_PATH_IMAGE006
representing a pre-trained model used in computing dot products between sentence vectors,Krepresenting the number of sentence pairs input to the Bert pre-trained model.
5. The method of claim 1, wherein the loss function used in the student model training process is an MSE loss function, and the MSE loss function is expressed as:
Figure DEST_PATH_IMAGE007
wherein,
Figure 43743DEST_PATH_IMAGE008
a sentence vector generated by the teacher model is represented,
Figure DEST_PATH_IMAGE009
sentence vectors generated by the student model are represented; n represents the number of sentence vectors.
6. The method of any one of claims 1-5, wherein performing dimension reduction compression on the multidimensional academic text semantic feature vector comprises: and performing dimensionality reduction compression on the multidimensional academic text feature vector output by the pre-training model by using a principal component analysis dimensionality reduction algorithm.
7. The method of any of claims 1-5, wherein the academic resource text data comprises structural chemistry source text data and/or non-structural chemistry source text data.
8. The method of claim 1, wherein when training the student models by knowledge distillation using the trimmed Bert pre-training model as the teacher model, the teacher model comprises 12 hidden layers, and the [1,4,7,10] hidden layer of the teacher model is reserved as the hidden layer of the student model.
9. An academic text semantic feature extraction system based on a pre-trained model, the system comprising a processor and a memory, wherein the memory stores computer instructions, the processor is used for executing the computer instructions stored in the memory, and when the computer instructions are executed by the processor, the system realizes the steps of the method according to any one of claims 1 to 8.
10. A computer storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1 to 8.
CN202210778073.8A 2022-07-04 2022-07-04 Academic text semantic feature extraction method and system based on pre-training model and storage medium Active CN114841173B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210778073.8A CN114841173B (en) 2022-07-04 2022-07-04 Academic text semantic feature extraction method and system based on pre-training model and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210778073.8A CN114841173B (en) 2022-07-04 2022-07-04 Academic text semantic feature extraction method and system based on pre-training model and storage medium

Publications (2)

Publication Number Publication Date
CN114841173A true CN114841173A (en) 2022-08-02
CN114841173B CN114841173B (en) 2022-11-18

Family

ID=82573934

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210778073.8A Active CN114841173B (en) 2022-07-04 2022-07-04 Academic text semantic feature extraction method and system based on pre-training model and storage medium

Country Status (1)

Country Link
CN (1) CN114841173B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116187163A (en) * 2022-12-20 2023-05-30 北京知呱呱科技服务有限公司 Construction method and system of pre-training model for patent document processing
CN117116408A (en) * 2023-10-25 2023-11-24 湖南科技大学 A relationship extraction method for electronic medical record parsing

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110852426A (en) * 2019-11-19 2020-02-28 成都晓多科技有限公司 Pre-training model integration acceleration method and device based on knowledge distillation
CN111062489A (en) * 2019-12-11 2020-04-24 北京知道智慧信息技术有限公司 Knowledge distillation-based multi-language model compression method and device
US20210182662A1 (en) * 2019-12-17 2021-06-17 Adobe Inc. Training of neural network based natural language processing models using dense knowledge distillation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110852426A (en) * 2019-11-19 2020-02-28 成都晓多科技有限公司 Pre-training model integration acceleration method and device based on knowledge distillation
CN111062489A (en) * 2019-12-11 2020-04-24 北京知道智慧信息技术有限公司 Knowledge distillation-based multi-language model compression method and device
US20210182662A1 (en) * 2019-12-17 2021-06-17 Adobe Inc. Training of neural network based natural language processing models using dense knowledge distillation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LU, WH等: "TwinBERT: Distilling Knowledge to Twin-Structured Compressed BERT Models for Large-Scale Retrieval", 《2020 | CIKM 20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT》 *
岳增营等: "基于语言模型的预训练技术研究综述", 《中文信息学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116187163A (en) * 2022-12-20 2023-05-30 北京知呱呱科技服务有限公司 Construction method and system of pre-training model for patent document processing
CN116187163B (en) * 2022-12-20 2024-02-20 北京知呱呱科技有限公司 Construction method and system of pre-training model for patent document processing
CN117116408A (en) * 2023-10-25 2023-11-24 湖南科技大学 A relationship extraction method for electronic medical record parsing
CN117116408B (en) * 2023-10-25 2024-01-26 湖南科技大学 A relationship extraction method for electronic medical record parsing

Also Published As

Publication number Publication date
CN114841173B (en) 2022-11-18

Similar Documents

Publication Publication Date Title
Zheng et al. Knowledge base graph embedding module design for Visual question answering model
Liu et al. A survey of sentiment analysis based on transfer learning
CN108717408B (en) A sensitive word real-time monitoring method, electronic equipment, storage medium and system
Karpathy et al. Deep visual-semantic alignments for generating image descriptions
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
Gao et al. Convolutional neural network based sentiment analysis using Adaboost combination
Zhang et al. A review on entity relation extraction
Rahimi et al. An overview on extractive text summarization
CN104834747A (en) Short text classification method based on convolution neutral network
CN115329101A (en) A method and device for constructing a standard knowledge graph of the power Internet of things
CN113435190B (en) A text relation extraction method that combines multi-level information extraction and noise reduction
CN114841173B (en) Academic text semantic feature extraction method and system based on pre-training model and storage medium
Panchenko Best of both worlds: Making word sense embeddings interpretable
CN114897167A (en) Method and device for constructing knowledge graph in biological field
CN117708336B (en) A multi-strategy sentiment analysis method based on topic enhancement and knowledge distillation
Tao et al. News text classification based on an improved convolutional neural network
CN119294506A (en) A method for managing multimodal knowledge graphs of intelligent media assets based on large models
Imad et al. Automated Arabic News Classification using the Convolutional Neural Network.
CN112100382B (en) Clustering method and device, computer readable storage medium and processor
CN114510568A (en) Author name disambiguation method and author name disambiguation device
Mutinda Sentiment Lexicon-Augmented Text Representation Model for Social Media Text Sentiment Analysis
Zhu et al. A sample extension method based on Wikipedia and its application in text classification
Altaf et al. Efficient natural language classification algorithm for detecting duplicate unsupervised features
Qi et al. Bie—Modernism with Cultural Calculations in Multiple Dimensions
El Bazzi et al. Toward a Complex System for Context Discovery to Index Arabic Documents.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant