WO2023092985A1 - 实体知识自动抽取方法和计算机装置、计算机可读介质 - Google Patents

实体知识自动抽取方法和计算机装置、计算机可读介质 Download PDF

Info

Publication number
WO2023092985A1
WO2023092985A1 PCT/CN2022/097154 CN2022097154W WO2023092985A1 WO 2023092985 A1 WO2023092985 A1 WO 2023092985A1 CN 2022097154 W CN2022097154 W CN 2022097154W WO 2023092985 A1 WO2023092985 A1 WO 2023092985A1
Authority
WO
WIPO (PCT)
Prior art keywords
layer
output
entity
representation vector
input
Prior art date
Application number
PCT/CN2022/097154
Other languages
English (en)
French (fr)
Inventor
夏振涛
谈辉
李艳
朱立烨
石雁
Original Assignee
永中软件股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 永中软件股份有限公司 filed Critical 永中软件股份有限公司
Publication of WO2023092985A1 publication Critical patent/WO2023092985A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention relates to the field of text processing, in particular to a method for automatically extracting entity knowledge, a computer device, and a computer-readable medium.
  • information extraction is a text processing technology that extracts meaningful entity, attribute, relationship, event and other factual structured information from original unstructured natural language text.
  • official document writing its substantive knowledge plays an important role and can assist official document writing, such as content review.
  • the current automatic entity knowledge extraction methods still have the disadvantages of low accuracy and difficulty in optimization. Therefore, it is necessary to propose an improved automatic entity knowledge extraction method.
  • the object of the present invention is to provide a method for automatically extracting entity knowledge, a computer device, and a computer-readable medium, which can improve the feature extraction capability of the BERT model for entity knowledge.
  • the present invention provides a method for automatic extraction of entity knowledge, which includes: inputting the input text H 0 to the first K layers of the BERT model composed of N layers for processing, so as to output the context at the K layer Representation vector H K ; the context representation vector H K output by the Kth layer is input to the remaining NK layer of the BERT model for the first time due to the first task for processing to output the context representation vector at the Nth layer
  • each layer of the remaining NK layer processes the input based on the first mask matrix, and based on the context representation vector output by the Nth layer Carry out the first layer of entity recognition to obtain the first layer of entities in the input text H 0 ; and, the context representation vector H K of the Kth layer output is input to the remaining NK layer of the BERT model again for the second task to be processed in The Nth layer outputs the contextual representation vector
  • each layer of the remaining NK layer processes the input based on the second mask matrix, and based on the context representation vector
  • the elements of the first mask matrix are 1 within the sentence length n, and 0 beyond the sentence length, and the N layers of the BERT model are sequentially connected in series, N is greater than K, K is greater than or equal to 2, and N and K are positive Integer.
  • Each layer of the first K layers processes the input based on the global mask matrix.
  • the elements of the global mask matrix are 1 within the sentence length and 0 beyond the sentence length.
  • the present invention provides a computing device, which includes a processor and a memory, wherein program instructions are stored in the memory, and the program instructions are executed by the processor to implement the above-mentioned automatic entity knowledge extraction method.
  • the automatic extraction method of entity knowledge includes: inputting the input text H to the first K layers of the BERT model composed of N layers for processing, so as to output the context representation vector H K at the K layer;
  • the vector H K is input to the remaining NK layers of the BERT model for the first time due to the first task for processing to output the context representation vector at the Nth layer
  • each layer of the remaining NK layer processes the input based on the first mask matrix, and based on the context representation vector output by the Nth layer Carry out the first layer of entity recognition to obtain the first layer of entities in the input text H 0 ; and, the context representation vector H K of the Kth layer output is input to the remaining NK layer of the BERT model again for the second task to be processed in The Nth layer outputs the contextual representation
  • the present invention provides a computer-readable medium having program instructions stored therein, and the program instructions are executed to realize: inputting an input text H 0 to the top K of a BERT model consisting of N layers The layer is processed to output the context representation vector H K at the K layer; the context representation vector H K output from the K layer is input to the remaining NK layer of the BERT model for the first time due to the first task for processing to output at the N layer context representation vector At this time, each layer of the remaining NK layer processes the input based on the first mask matrix, and based on the context representation vector output by the Nth layer Carry out the first layer of entity recognition to obtain the first layer of entities in the input text H 0 ; and, the context representation vector H K of the Kth layer output is input to the remaining NK layer of the BERT model again for the second task to be processed in The Nth layer outputs the contextual representation vector At this time, each layer of the remaining NK layer processes the input based on the second mask matrix, and based on the
  • the present invention can perform two-layer entity recognition, thereby improving the feature extraction ability of the model for entity knowledge.
  • Fig. 1 is a schematic flow chart of the method for automatically extracting entity knowledge of the present invention
  • FIG. 2 is a schematic diagram of the principle of the automatic entity knowledge extraction method of the present invention.
  • the entity categories defined in this paper have interrelated effects, such as the sentence "The Fifth Plenary Session of the Nineteenth Central Committee of the Communist Party of China was held in Beijing, and ** delivered an important speech.”
  • the entities that can be extracted are "ten The Fifth Plenary Session of the Ninth Central Committee” and "Beijing", among which, the category of the entity “Fifth Plenary Session of the 19th Central Committee” is "event activities", and the category of the entity “Beijing” is "regional venue”.
  • the event activities here are closely related to the regional places and characters, and the regional places and characters play an auxiliary role in the identification of event activities. Therefore, the entity category can be divided into two layers.
  • the first layer of entities is "person", “regional place”, “time legislation”, “organization”, “laws and regulations”, “position”, and the second layer of entity is " Events and Activities", "Ideological Theory”.
  • the invention provides an improved automatic extraction method of entity knowledge, which can perform two-layer entity recognition, thereby improving the feature extraction ability of the BERT (Bidirectional Encoder Representation from Transformers) model for entity knowledge.
  • BERT Bidirectional Encoder Representation from Transformers
  • FIG. 1 is a schematic flowchart of a method 100 for automatically extracting entity knowledge in the present invention.
  • FIG. 2 is a schematic diagram of the principle of the automatic entity knowledge extraction method of the present invention.
  • the entity knowledge automatic extraction method 100 includes the following steps.
  • Step 110 input the input text H 0 to the first K layers of the BERT model composed of N layers for processing, so as to output the context representation vector H K at the Kth layer.
  • the BERT model 210 may also be called a BERT pre-trained language model.
  • the N layers of the BERT model are sequentially connected in series, N is greater than K, K is greater than or equal to 2, and N and K are positive integers.
  • the input text may be an ordinary piece of natural language text.
  • Each layer of the first K layers processes the input based on the global mask matrix MASK all .
  • the MASK all element of the global mask matrix is 1 within the sentence length and 0 beyond the sentence length.
  • attention is used to capture context information, and the context representation vector H m output by the m-th layer is calculated according to the context representation vector H m-1 output by the m -1 layer,
  • H′ m LN(H m-1 +MultiHead h (H m-1 ,MASK all ))
  • H m LN(H' m +FFN(H' m ))
  • MASK all is the global mask matrix
  • i, j are the positions of elements in the global mask matrix
  • n is the length of the sentence
  • m is greater than or equal to 1 and less than or equal to K.
  • Step 120 the context representation vector H K output by the Kth layer is input to the remaining NK layers of the BERT model 210 for the first time due to the first task, so as to output the context representation vector at the Nth layer
  • each layer of the remaining NK layer processes the input based on the first mask matrix MASK N1 , and based on the context representation vector output by the Nth layer Perform the first-level entity recognition 220 to obtain the first-level entities in the input text H 0 .
  • the elements of the first mask matrix MASK N1 are 1 within the sentence length n, and are 0 beyond the sentence length.
  • Step 130 input the context characterization vector H K output by the Kth layer to the remaining NK layer of the BERT model for the second task again to output the context characterization vector at the Nth layer
  • each layer of the remaining NK layer processes the input based on the second mask matrix MASK N2 , and based on the context representation vector output by the Nth layer
  • the context representation vector H K is used as a shared feature of the joint model and is input to the remaining NK layers.
  • different mask matrices MASK are set in the multi-head self-attention layer to obtain the two entities of the first layer entity recognition and the second layer entity recognition.
  • task is the first task or the second task
  • the first task is the first layer entity recognition, which is recorded as N1
  • the second task is the second layer entity recognition, which is recorded as N2
  • MASK N1 is the first mask matrix
  • MASK N2 is the second mask matrix
  • P entities is the position of the first layer entity that has been recognized in the input text
  • MultiHead h the multi-head self-attention formula MultiHead h is:
  • the formula sets different MASKs according to different tasks.
  • the contextual representation vector H K is used as a shared feature for joint learning, and each word in the sentence is effective for feature expression. Therefore, the matrix MASK all does not need to cover up any information.
  • the remaining NK layer it is necessary to set a different matrix MASK task for the two different downstream tasks of the first layer entity recognition and the second layer recognition. This matrix is used to filter some unnecessary information in the downstream tasks to enhance the BERT model.
  • the present invention uses the "BIO" notation method to serialize and label entities. In order to improve the accuracy, the correct attention weight should be optimized by parameters instead of limiting each character.
  • each word in a sentence can calculate attention with any other word, and the matrix MASK N1 only needs to cover up information for words that exceed the length of the sentence, and set "1" for the rest.
  • the first-level entity label information can help the second-level entity recognition, therefore, the matrix MASK N2 is used to limit the attention to all the first-level entity positions, and other positions are filtered with "0" .
  • the context representation vector output by the Nth layer of the BERT model is:
  • This layered fine-tuning structure can improve the feature extraction ability of the BERT pre-trained language model for knowledge, and obtain contextual representation vectors for different downstream tasks.
  • the fine-tuned BERT pre-trained model is easier to optimize due to the use of structured features.
  • the fine-tuning structure does not require major adjustments to the original BERT model, so the linguistic knowledge contained in the pre-trained language model can be directly utilized.
  • the present invention uses the standard BIO (begin, inside, outside) notation method to label each word in the sentence with a named entity label, and the label B represents the position of the beginning word in the entity , the label I represents the position of the non-initial word in the entity, and the label O represents the position of the non-entity word in the sentence.
  • BIO begin, inside, outside
  • the CRF (Conditional Random Fields, conditional random field) layer first calculates the emission probability H ner by linearly transforming the context representation vector H N output by the BERT model, and then scores and sorts the tag sequence according to the transition probability, Finally, the softmax function is used to obtain the probability distribution of the label, and then the first layer entity recognition and the second layer entity recognition are performed.
  • H N is the context representation vector output by the BERT model
  • H ner is the emission probability matrix of the CRF layer, its size is n ⁇ k, n is the sentence length, k is the number of entity type labels
  • Score(X,y) is the label sequence score
  • A is the transition probability matrix, its Represents the transition probability from label y i to label y i+1
  • Y X is all possible label sequences.
  • the BERT model and the CRF layer need to use training samples for prior training. Specifically, first use the BIO marking method to mark the training samples, and then use the marked training samples to train the BERT model and the CRF layer.
  • Each training sample can be a piece of labeled text.
  • the goal is to minimize the loss function L ner , the formula is:
  • the maximization score function predicts the label sequence, the formula is:
  • L ⁇ L N1 +(1 ⁇ )L N2 .
  • the present invention provides a computer-readable medium, in which program instructions are stored, and the program instructions are executed by a processor to implement the above-mentioned automatic entity knowledge extraction method 100 .
  • the present invention provides a computing device, which includes a processor and a memory, wherein program instructions are stored in the memory, and the program instructions are executed by the processor to implement the above-mentioned automatic entity knowledge extraction method 100 .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

本发明提供一种实体知识自动抽取方法和计算机装置、计算机可读介质。所述实体知识自动抽取方法包括:将输入文本H0输入至由N层组成的BERT模型的前K层进行处理,以在第K层输出上下文表征向量HK;将HK因第一任务首次输入到所述BERT模型的剩余N-K层进行处理以在第N层输出上下文表征向量 HN N1,此时剩余N-K层的每层基于第一掩码矩阵对输入进行处理,基于 HN N1进行第一层实体识别得到第一层实体;将HK因第二任务再次输入到所述BERT模型的剩余N-K层进行处理以在第N层输出上下文表征向量 HN N2,此时剩余N-K层的每层基于第二掩码矩阵对输入进行处理,基于 HN N2进行第二层实体识别得到第二层实体。这样,可以提高BERT模型对实体知识的特征抽取能力。

Description

实体知识自动抽取方法和计算机装置、计算机可读介质 技术领域
本发明涉及文本处理领域,尤其涉及一种实体知识自动抽取方法和计算机装置、计算机可读介质。
背景技术
作为自然语言处理中的一个重要任务,信息抽取是从原始非结构化的自然语言文本中提取有意义的实体、属性、关系、事件等事实类结构化信息的文本处理技术。在公文写作中,其实体知识有着重要的作用,可以辅助公文写作,如内容审核。目前的实体知识自动抽取方法仍有准确率低,难以优化等缺点。因此,有必要提出一种改进的实体知识自动抽取方法。
发明内容
本发明的目的在于提供一种实体知识自动抽取方法和计算机装置、计算机可读介质,其可以提高BERT模型对实体知识的特征抽取能力。
根据本发明的另一个方面,本发明提供一种实体知识自动抽取方法,其包括:将输入文本H 0输入至由N层组成的BERT模型的前K层进行处理,以在第K层输出上下文表征向量H K;将第K层输出的上下文表征向量H K因第一任务首次输入到所述BERT模型的剩余N-K层进行处理以在第N层输出上下文表征向量
Figure PCTCN2022097154-appb-000001
此时剩余N-K层的每层基于第一掩码矩阵对输入进行处理,基于第N层输出的上下文表征向量
Figure PCTCN2022097154-appb-000002
进行第一层实体识别得到输入文本H 0中的第一层实体;和,将第K层输出的上下文表征向量H K因第二任务再次输入到所述BERT模型的剩余N-K层进行处理以在第N层输出上下文表征向量
Figure PCTCN2022097154-appb-000003
此时剩余N-K层的每层基于第二掩码矩阵对输入进行处理,基于第N层输出的上下文表征向量
Figure PCTCN2022097154-appb-000004
进行第二层实体识别得到输入文本H 0中的第二层实体,其中第二掩码矩阵的元素在属于第一层实体时为1,其余为0。
进一步的,第一掩码矩阵的元素在句子长度n内为1,超出句子长度为0,所述BERT模型的N层是依次串联的,N大于K,K大于等于2,N和K为正整数,前K层的每层基于全局掩码矩阵对输入进行处理,全局掩码矩阵的元素 在句子长度内为1,超出句子长度为0。
根据本发明的另一个方面,本发明提供一种计算装置,其包括处理器和存储器,所述存储器中存储有程序指令,该程序指令由处理器执行以实现上述实体知识自动抽取方法。所述实体知识自动抽取方法包括:将输入文本H 0输入至由N层组成的BERT模型的前K层进行处理,以在第K层输出上下文表征向量H K;将第K层输出的上下文表征向量H K因第一任务首次输入到所述BERT模型的剩余N-K层进行处理以在第N层输出上下文表征向量
Figure PCTCN2022097154-appb-000005
此时剩余N-K层的每层基于第一掩码矩阵对输入进行处理,基于第N层输出的上下文表征向量
Figure PCTCN2022097154-appb-000006
进行第一层实体识别得到输入文本H 0中的第一层实体;和,将第K层输出的上下文表征向量H K因第二任务再次输入到所述BERT模型的剩余N-K层进行处理以在第N层输出上下文表征向量
Figure PCTCN2022097154-appb-000007
此时剩余N-K层的每层基于第二掩码矩阵对输入进行处理,基于第N层输出的上下文表征向量
Figure PCTCN2022097154-appb-000008
进行第二层实体识别得到输入文本H 0中的第二层实体,其中第二掩码矩阵的元素在属于第一层实体位置时为1,其余为0
根据本发明的另一个方面,本发明提供一种计算机可读介质,其内存储有程序指令,该程序指令被执行以实现:将输入文本H 0输入至由N层组成的BERT模型的前K层进行处理,以在第K层输出上下文表征向量H K;将第K层输出的上下文表征向量H K因第一任务首次输入到所述BERT模型的剩余N-K层进行处理以在第N层输出上下文表征向量
Figure PCTCN2022097154-appb-000009
此时剩余N-K层的每层基于第一掩码矩阵对输入进行处理,基于第N层输出的上下文表征向量
Figure PCTCN2022097154-appb-000010
进行第一层实体识别得到输入文本H 0中的第一层实体;和,将第K层输出的上下文表征向量H K因第二任务再次输入到所述BERT模型的剩余N-K层进行处理以在第N层输出上下文表征向量
Figure PCTCN2022097154-appb-000011
此时剩余N-K层的每层基于第二掩码矩阵对输入进行处理,基于第N层输出的上下文表征向量
Figure PCTCN2022097154-appb-000012
进行第二层实体识别得到输入文本H 0中的第二层实体,其中第二掩码矩阵的元素在属于第一层实体位置时为1,其余为0。
与现有技术相比,本发明可以进行两层的实体识别,从而可以提高模型对实体知识的特征抽取能力。
附图说明
图1为本发明的实体知识自动抽取方法的流程示意图;
图2为本发明的实体知识自动抽取方法的原理示意图。
具体实施方式
为更进一步阐述本发明为达成预定发明目的所采取的技术手段及功效,以下结合附图及较佳实施例,对依据本发明的具体实施方式、结构、特征及其功效,详细说明如下。
这里以公文中的实体知识的提取为例进行介绍。首先通过基于统计的方法和基于规则的方法进行公文文本的领域词挖掘,总结和定义了如下的实体类别:
Figure PCTCN2022097154-appb-000013
根据对数据集的分析,在句子表述中,本文定义的实体类别有着互相关联作用,如句子“中共十九届五中全会在北京举行,**发表重要讲话。”可以抽取的实体有“十九届五中全会”、“北京”,其中,实体“十九届五中全会”的类别为“事件活动”,实体“北京”的类别为“区域场所”。从知识层面来说,这里的事件活动与区域场所、人物是极具关联的,区域场所和人物对事件活动的识别有着辅助作用。因此,可以把实体类别划分为两层,第一层实体为“人物”、“区域场所”、“时间立法”、“组织机构”、“法律法规”、“职务”,第二层实体为“事件活动”、“思想理论”。
本发明提供一种改进的实体知识自动抽取方法,其可以进行两层的实体识别,从而可以提高BERT(Bidirectional Encoder Representation from  Transformers)模型对实体知识的特征抽取能力。
图1为本发明的实体知识自动抽取方法100的流程示意图。图2为本发明的实体知识自动抽取方法的原理示意图。
结合图1-2所示,所述实体知识自动抽取方法100包括如下步骤。
步骤110,将输入文本H 0输入至由N层组成的BERT模型的前K层进行处理,以在第K层输出上下文表征向量H K
如图2所示的,所述BERT模型210也可以被称为BERT预训练语言模型。所述BERT模型的N层是依次串联的,N大于K,K大于等于2,N和K为正整数。所述输入文本可以是普通的一段自然语言文本。前K层的每层基于全局掩码矩阵MASK all对输入进行处理,全局掩码矩阵的MASK all元素在句子长度内为1,超出句子长度为0。
在一个实施例中,在前K层中,注意力用来捕获上下文信息,根据第m-1层输出的上下文表征向量H m-1计算第m层输出的上下文表征向量H m
H′ m=LN(H m-1+MultiHead h(H m-1,MASK all))
H m=LN(H′ m+FFN(H′ m))
Figure PCTCN2022097154-appb-000014
其中MASK all为全局掩码矩阵,i,j为全局掩码矩阵中元素的位置,n为句子长度,m大于等于1小于等于K。
步骤120,将第K层输出的上下文表征向量H K因第一任务首次输入到所述BERT模型210的剩余N-K层进行处理以在第N层输出上下文表征向量
Figure PCTCN2022097154-appb-000015
此时剩余N-K层的每层基于第一掩码矩阵MASK N1对输入进行处理,基于第N层输出的上下文表征向量
Figure PCTCN2022097154-appb-000016
进行第一层实体识别220得到输入文本H 0中的第一层实体。第一掩码矩阵MASK N1的元素在句子长度n内为1,超出句子长度为0。
步骤130,将第K层输出的上下文表征向量H K因第二任务再次输入到所述BERT模型的剩余N-K层进行处理以在第N层输出上下文表征向量
Figure PCTCN2022097154-appb-000017
此时剩余N-K层的每层基于第二掩码矩阵MASK N2对输入进行处理,基于第N层输出的上下文表征向量
Figure PCTCN2022097154-appb-000018
进行第二层实体识别230得到输入文本H 0中的第二层实体,其中第二掩码矩阵MASK N2的元素在属于第一层实体时为1,其余为0。
可见,上下文表征向量H K作为联合模型的共享特征,输入到剩余N-K层,接下来,在多头自注意力层设置不同的掩码矩阵MASK,获取第一层实体识别和第二层实体识别两个下游任务不同的上下文表征向量。
具体的,在剩余N-K层中,根据给定第m-1层输出的上下文表征向量
Figure PCTCN2022097154-appb-000019
计算第m层输出的上下文表征向量
Figure PCTCN2022097154-appb-000020
Figure PCTCN2022097154-appb-000021
Figure PCTCN2022097154-appb-000022
Figure PCTCN2022097154-appb-000023
Figure PCTCN2022097154-appb-000024
其中task为第一任务或第二任务,第一任务为第一层实体识别,被记为N1,第二任务为第二层实体识别,被记为N2,在剩余N-K层中为第一任务和第二任务分别进行运算,MASK N1为第一掩码矩阵,MASK N2为第二掩码矩阵,P entities为输入文本中已被识别出的第一层实体位置,
其中多头自注意力公式MultiHead h为:
MultiHead h(X,MASK)=[head 1;……;head h]W M
Figure PCTCN2022097154-appb-000025
Figure PCTCN2022097154-appb-000026
Figure PCTCN2022097154-appb-000027
公式根据不同的任务设置了不同的MASK。在前K层中,上下文表征向量H K用来作为联合学习的共享特征,其句子中的每个字都对特征表达有效,因此,该矩阵MASK all用来计算注意力时不需要掩盖掉任何信息。在剩余N-K层中,需要为第一层实体识别和第二层识别两个不同的下游任务设置不同的矩阵MASK task,此矩阵用来过滤一些下游任务中不需要的信息,以此增强BERT模 型中结构化信息对两个下游任务的特征表达能力。具体来说,对于第一层实体识别子模型,本发明用“BIO”标记法对实体序列化标注标签,为了提高准确率,正确的注意力权重应该通过参数优化,而不应该限制每个字(token)的注意力范围。因此,句子中每个字都可以和任何其他字计算注意力,矩阵MASK N1只需要对超出句子长度的字掩盖信息,其余位置置“1”。对于第二层实体识别子模型,第一层实体标签信息可以帮助第二层实体识别,因此,矩阵MASK N2用来把注意力限制在所有第一层实体位置上,其他位置用“0”过滤。公式similar(i,j)是计算第i个字和第j个字的相似度,如果矩阵MASK中的值MASK i,j=0,则第i个字不需要考虑第j个字。相反,如果矩阵MASK中的值MASK i,j=1,则第i个字需要考虑第j个字。
根据不同的任务,所述BERT模型的第N层输出的上下文表征向量为:
Figure PCTCN2022097154-appb-000028
这种分层微调结构可以提高BERT预训练语言模型对知识的特征抽取能力,得到不同下游任务的上下文表征向量。由于利用了结构化特征,微调的BERT预训练模型更易优化。并且,微调结构不需要对原始BERT模型进行较大调整,因此可以直接利用预训练语言模型中包含的语言学知识。
在各层的实体识别中,由于实体存在边界问题,本发明使用标准的BIO(begin,inside,outside)标记法对句子中的每个字标注命名实体标签,标签B代表实体中开始字的位置,标签I代表实体中非首字的位置,标签O代表句子中不是实体字的位置。
在一个实施例中,CRF(Conditional Random Fields,条件随机场)层首先将所述BERT模型输出的上下文表征向量H N通过线性变换计算发射概率H ner,然后根据转移概率对标签序列进行打分排序,最后利用softmax函数得到标签的概率分布,进而进行第一层实体识别和第二层实体识别。
具体的计算公式如下:
Figure PCTCN2022097154-appb-000029
Figure PCTCN2022097154-appb-000030
Figure PCTCN2022097154-appb-000031
H N为BERT模型输出的上下文表征向量;H ner为CRF层的发射概率矩阵, 其大小为n×k,n为句子长度,k为实体类型标签个数;Score(X,y)为标签序列的得分;A为转移概率矩阵,其
Figure PCTCN2022097154-appb-000032
代表了标签y i到标签y i+1的转移概率;Y X为所有可能的标签序列。
在投入实际使用前,所述BERT模型和CRF层需要利用训练样本进行事先的训练。具体的,先利用BIO标记法对训练样本进行标记,之后用标记好的训练样本对所述BERT模型和CRF层进行训练。每个训练样本都可以是一段标记号的文本。
在训练阶段,目标是最小化损失函数L ner,公式为:
Figure PCTCN2022097154-appb-000033
在实体识别阶段,最大化得分函数预测标签序列,公式为:
Figure PCTCN2022097154-appb-000034
在训练阶段,优化交叉熵损失函数,第一层实体识别和第二层实体识别是一个联合学习方法,公式为:
L=αL N1+(1-α)L N2
根据本发明的另一个方面,本发明提供一种计算机可读介质,其内存储有程序指令,该程序指令由处理器执行以实现上述的实体知识自动抽取方法100。
根据本发明的另一个方面,本发明提供一种计算装置,其包括处理器和存储器,所述存储器中存储有程序指令,该程序指令由处理器执行以实现上述的实体知识自动抽取方法100。
在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,除了包含所列的那些要素,而且还可包含没有明确列出的其他要素。
在本文中,所涉及的前、后、上、下等方位词是以附图中零部件位于图中以及零部件相互之间的位置来定义的,只是为了表达技术方案的清楚及方便。应当理解,所述方位词的使用不应限制本申请请求保护的范围。
在不冲突的情况下,本文中上述实施例及实施例中的特征可以相互结合。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (8)

  1. 一种实体知识自动抽取方法,其特征在于,其包括:
    将输入文本H 0输入至由N层组成的BERT模型的前K层进行处理,以在第K层输出上下文表征向量H K
    将第K层输出的上下文表征向量H K因第一任务首次输入到所述BERT模型的剩余N-K层进行处理以在第N层输出上下文表征向量
    Figure PCTCN2022097154-appb-100001
    此时剩余N-K层的每层基于第一掩码矩阵对输入进行处理,基于第N层输出的上下文表征向量
    Figure PCTCN2022097154-appb-100002
    进行第一层实体识别得到输入文本H 0中的第一层实体;和
    将第K层输出的上下文表征向量H K因第二任务再次输入到所述BERT模型的剩余N-K层进行处理以在第N层输出上下文表征向量
    Figure PCTCN2022097154-appb-100003
    此时剩余N-K层的每层基于第二掩码矩阵对输入进行处理,基于第N层输出的上下文表征向量
    Figure PCTCN2022097154-appb-100004
    进行第二层实体识别得到输入文本H 0中的第二层实体,其中第二掩码矩阵的元素在属于第一层实体位置时为1,其余为0。
  2. 如权利要求1所述的实体知识自动抽取方法,其特征在于,第一掩码矩阵的元素在句子长度n内为1,超出句子长度为0,
    所述BERT模型的N层是依次串联的,
    N大于K,K大于等于2,N和K为正整数,
    前K层的每层基于全局掩码矩阵对输入进行处理,全局掩码矩阵的元素在句子长度内为1,超出句子长度为0。
  3. 如权利要求2所述的实体知识自动抽取方法,其特征在于,
    在前K层中,根据第m-1层输出的上下文表征向量H m-1计算第m层输出的上下文表征向量H m
    H′ m=LN(H m-1+MultiHead h(H m-1,MASK all))
    H m=LN(H′ m+FFN(H′ m))
    Figure PCTCN2022097154-appb-100005
    其中MASK all为全局掩码矩阵,i,j为全局掩码矩阵中元素的位置,n为句子长度,
    在剩余N-K层中,根据给定第m-1层输出的上下文表征向量
    Figure PCTCN2022097154-appb-100006
    计算第m层输出的上下文表征向量
    Figure PCTCN2022097154-appb-100007
    Figure PCTCN2022097154-appb-100008
    Figure PCTCN2022097154-appb-100009
    Figure PCTCN2022097154-appb-100010
    Figure PCTCN2022097154-appb-100011
    其中task为第一任务或第二任务,第一任务被记为N1,第二任务被记为N2,在剩余N-K层中为第一任务和第二任务分别进行运算,MASK N1为第一掩码矩阵,MASK N2为第二掩码矩阵,P entities为输入文本中已被识别出的第一层实体位置,
    其中多头自注意力公式MultiHead h为:
    MultiHead h(X,MASK)=[head 1;……;head h]W M
    Figure PCTCN2022097154-appb-100012
    Figure PCTCN2022097154-appb-100013
    Figure PCTCN2022097154-appb-100014
    根据不同的任务,所述BERT模型的第N层输出的上下文表征向量为:
    Figure PCTCN2022097154-appb-100015
  4. 如权利要求3所述的实体知识自动抽取方法,其特征在于,
    CRF层首先将所述BERT模型输出的上下文表征向量H N通过线性变换计算发射概率H ner,然后根据转移概率对标签序列进行打分排序,最后利用softmax函数得到标签的概率分布,进而进行第一层实体识别和第二层实体识别,
    具体的计算公式如下:
    Figure PCTCN2022097154-appb-100016
    Figure PCTCN2022097154-appb-100017
    Figure PCTCN2022097154-appb-100018
    H N为BERT模型输出的上下文表征向量;H ner为CRF层的发射概率矩阵,其大小为n×k,n为句子长度,k为实体类型标签个数;Score(X,y)为标签序列的得分;A为转移概率矩阵,其
    Figure PCTCN2022097154-appb-100019
    代表了标签y i到标签y i+1的转移概率;Y X为所有可能的标签序列,
    使用标准的BIO标记法对输入文本的句子中的每个字标注命名实体标签,标签B代表实体中开始字的位置,标签I代表实体中非首字的位置,标签O代表句子中不是实体字的位置。
  5. 如权利要求4所述的实体知识自动抽取方法,其特征在于,
    先利用BIO标记法对训练样本进行标记,之后用标记好的训练样本进行训练,
    在训练阶段,目标是最小化损失函数L ner,公式为:
    Figure PCTCN2022097154-appb-100020
    在实体识别阶段,最大化得分函数预测标签序列,公式为:
    Figure PCTCN2022097154-appb-100021
  6. 如权利要求5所述的实体知识自动抽取方法,其特征在于,
    在训练阶段,优化交叉熵损失函数,第一层实体识别和第二层实体识别是一个联合学习方法,公式为:
    L=αL N1+(1-α)L N2
  7. 一种计算装置,其包括处理器和存储器,所述存储器中存储有程序指令,该程序指令由处理器执行以实现:
    将输入文本H 0输入至由N层组成的BERT模型的前K层进行处理,以在第K层输出上下文表征向量H K
    将第K层输出的上下文表征向量H K因第一任务首次输入到所述BERT模型的剩余N-K层进行处理以在第N层输出上下文表征向量
    Figure PCTCN2022097154-appb-100022
    此时剩余N-K层的每层基于第一掩码矩阵对输入进行处理,基于第N层输出的上下文表征向量
    Figure PCTCN2022097154-appb-100023
    进行第一层实体识别得到输入文本H 0中的第一层实体;和
    将第K层输出的上下文表征向量H K因第二任务再次输入到所述BERT模型 的剩余N-K层进行处理以在第N层输出上下文表征向量
    Figure PCTCN2022097154-appb-100024
    此时剩余N-K层的每层基于第二掩码矩阵对输入进行处理,基于第N层输出的上下文表征向量
    Figure PCTCN2022097154-appb-100025
    进行第二层实体识别得到输入文本H 0中的第二层实体,其中第二掩码矩阵的元素在属于第一层实体位置时为1,其余为0。
  8. 一种计算机可读介质,其内存储有程序指令,该程序指令被执行以实现:
    将输入文本H 0输入至由N层组成的BERT模型的前K层进行处理,以在第K层输出上下文表征向量H K
    将第K层输出的上下文表征向量H K因第一任务首次输入到所述BERT模型的剩余N-K层进行处理以在第N层输出上下文表征向量
    Figure PCTCN2022097154-appb-100026
    此时剩余N-K层的每层基于第一掩码矩阵对输入进行处理,基于第N层输出的上下文表征向量
    Figure PCTCN2022097154-appb-100027
    进行第一层实体识别得到输入文本H 0中的第一层实体;和
    将第K层输出的上下文表征向量H K因第二任务再次输入到所述BERT模型的剩余N-K层进行处理以在第N层输出上下文表征向量
    Figure PCTCN2022097154-appb-100028
    此时剩余N-K层的每层基于第二掩码矩阵对输入进行处理,基于第N层输出的上下文表征向量
    Figure PCTCN2022097154-appb-100029
    进行第二层实体识别得到输入文本H 0中的第二层实体,其中第二掩码矩阵的元素在属于第一层实体位置时为1,其余为0。
PCT/CN2022/097154 2021-11-26 2022-06-06 实体知识自动抽取方法和计算机装置、计算机可读介质 WO2023092985A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111419529.3 2021-11-26
CN202111419529.3A CN114357176B (zh) 2021-11-26 2021-11-26 实体知识自动抽取方法和计算机装置、计算机可读介质

Publications (1)

Publication Number Publication Date
WO2023092985A1 true WO2023092985A1 (zh) 2023-06-01

Family

ID=81096296

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/097154 WO2023092985A1 (zh) 2021-11-26 2022-06-06 实体知识自动抽取方法和计算机装置、计算机可读介质

Country Status (2)

Country Link
CN (1) CN114357176B (zh)
WO (1) WO2023092985A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117371534A (zh) * 2023-12-07 2024-01-09 同方赛威讯信息技术有限公司 一种基于bert的知识图谱构建方法及系统
CN117891900A (zh) * 2024-03-18 2024-04-16 腾讯科技(深圳)有限公司 基于人工智能的文本处理方法及文本处理模型训练方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114357176B (zh) * 2021-11-26 2023-11-21 永中软件股份有限公司 实体知识自动抽取方法和计算机装置、计算机可读介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110570920A (zh) * 2019-08-20 2019-12-13 华东理工大学 一种基于集中注意力模型的实体、关系联合学习方法
WO2021096571A1 (en) * 2019-11-15 2021-05-20 Intuit Inc. Pre-trained contextual embedding models for named entity recognition and confidence prediction
CN113221571A (zh) * 2021-05-31 2021-08-06 重庆交通大学 基于实体相关注意力机制的实体关系联合抽取方法
CN113468888A (zh) * 2021-06-25 2021-10-01 浙江华巽科技有限公司 基于神经网络的实体关系联合抽取方法与装置
CN114357176A (zh) * 2021-11-26 2022-04-15 永中软件股份有限公司 实体知识自动抽取方法和计算机装置、计算机可读介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165385B (zh) * 2018-08-29 2022-08-09 中国人民解放军国防科技大学 一种基于实体关系联合抽取模型的多三元组抽取方法
CN111444717A (zh) * 2018-12-28 2020-07-24 天津幸福生命科技有限公司 医学实体信息的抽取方法、装置、存储介质及电子设备
US20200250139A1 (en) * 2018-12-31 2020-08-06 Dathena Science Pte Ltd Methods, personal data analysis system for sensitive personal information detection, linking and purposes of personal data usage prediction
JP7358748B2 (ja) * 2019-03-01 2023-10-11 富士通株式会社 学習方法、抽出方法、学習プログラムおよび情報処理装置
CN110781312B (zh) * 2019-09-19 2022-07-15 平安科技(深圳)有限公司 基于语义表征模型的文本分类方法、装置和计算机设备
CN113672770A (zh) * 2020-05-15 2021-11-19 永中软件股份有限公司 一种基于xml文件的数据封装方法
CN113220844B (zh) * 2021-05-25 2023-01-24 广东省环境权益交易所有限公司 基于实体特征的远程监督关系抽取方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110570920A (zh) * 2019-08-20 2019-12-13 华东理工大学 一种基于集中注意力模型的实体、关系联合学习方法
WO2021096571A1 (en) * 2019-11-15 2021-05-20 Intuit Inc. Pre-trained contextual embedding models for named entity recognition and confidence prediction
CN113221571A (zh) * 2021-05-31 2021-08-06 重庆交通大学 基于实体相关注意力机制的实体关系联合抽取方法
CN113468888A (zh) * 2021-06-25 2021-10-01 浙江华巽科技有限公司 基于神经网络的实体关系联合抽取方法与装置
CN114357176A (zh) * 2021-11-26 2022-04-15 永中软件股份有限公司 实体知识自动抽取方法和计算机装置、计算机可读介质

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZHANG LE, LI JIAN; TANG LIANG; YI MIANZH: "Deep Learning Recognition Method for Target Entity in Military Field Based on Pre-Trained BERT", XINXI-GONGCHENG-DAXUE-XUEBAO / JOURNAL OF INFORMATION ENGINEERING UNIVERSITY, vol. 22, no. 3, 30 June 2021 (2021-06-30), pages 331 - 337, XP093068610, ISSN: 1671-0673, DOI: 10.3969/j.issn.1671-0673.2021.03.013 *
ZHANG SUOXIANG; ZHAO MING: "Chinese agricultural diseases named entity recognition based on BERT-CRF", 2020 5TH INTERNATIONAL CONFERENCE ON MECHANICAL, CONTROL AND COMPUTER ENGINEERING (ICMCCE), IEEE, 25 December 2020 (2020-12-25), pages 1148 - 1151, XP033914505, DOI: 10.1109/ICMCCE51767.2020.00252 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117371534A (zh) * 2023-12-07 2024-01-09 同方赛威讯信息技术有限公司 一种基于bert的知识图谱构建方法及系统
CN117371534B (zh) * 2023-12-07 2024-02-27 同方赛威讯信息技术有限公司 一种基于bert的知识图谱构建方法及系统
CN117891900A (zh) * 2024-03-18 2024-04-16 腾讯科技(深圳)有限公司 基于人工智能的文本处理方法及文本处理模型训练方法

Also Published As

Publication number Publication date
CN114357176B (zh) 2023-11-21
CN114357176A (zh) 2022-04-15

Similar Documents

Publication Publication Date Title
CN111444721B (zh) 一种基于预训练语言模型的中文文本关键信息抽取方法
WO2023092985A1 (zh) 实体知识自动抽取方法和计算机装置、计算机可读介质
CN109902293B (zh) 一种基于局部与全局互注意力机制的文本分类方法
CN111310443B (zh) 一种文本纠错方法和系统
CN109829159B (zh) 一种古汉语文本的一体化自动词法分析方法及系统
CN107729309B (zh) 一种基于深度学习的中文语义分析的方法及装置
CN111709243B (zh) 一种基于深度学习的知识抽取方法与装置
CN111444343B (zh) 基于知识表示的跨境民族文化文本分类方法
CN109858041B (zh) 一种半监督学习结合自定义词典的命名实体识别方法
WO2018218706A1 (zh) 一种基于神经网络的新闻事件抽取的方法及系统
CN113591483A (zh) 一种基于序列标注的文档级事件论元抽取方法
CN108763510A (zh) 意图识别方法、装置、设备及存储介质
CN106569998A (zh) 一种基于Bi‑LSTM、CNN和CRF的文本命名实体识别方法
CN114036933B (zh) 基于法律文书的信息抽取方法
CN111581954B (zh) 一种基于语法依存信息的文本事件抽取方法及装置
CN111966812A (zh) 一种基于动态词向量的自动问答方法和存储介质
CN106682089A (zh) 一种基于RNNs的短信自动安全审核的方法
CN113177412A (zh) 基于bert的命名实体识别方法、系统、电子设备及存储介质
CN108829823A (zh) 一种文本分类方法
CN111339772B (zh) 俄语文本情感分析方法、电子设备和存储介质
WO2021128704A1 (zh) 一种基于分类效用的开集分类方法
CN114417851A (zh) 一种基于关键词加权信息的情感分析方法
CN112185361A (zh) 一种语音识别模型训练方法、装置、电子设备及存储介质
CN115238693A (zh) 一种基于多分词和多层双向长短期记忆的中文命名实体识别方法
CN113901228B (zh) 融合领域知识图谱的跨境民族文本分类方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22897083

Country of ref document: EP

Kind code of ref document: A1