WO2023092960A1 - 一种用于法律文书的命名实体识别的标注方法和装置 - Google Patents

一种用于法律文书的命名实体识别的标注方法和装置 Download PDF

Info

Publication number
WO2023092960A1
WO2023092960A1 PCT/CN2022/093493 CN2022093493W WO2023092960A1 WO 2023092960 A1 WO2023092960 A1 WO 2023092960A1 CN 2022093493 W CN2022093493 W CN 2022093493W WO 2023092960 A1 WO2023092960 A1 WO 2023092960A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
label
entities
template
recognition
Prior art date
Application number
PCT/CN2022/093493
Other languages
English (en)
French (fr)
Inventor
王宏升
鲍虎军
陈�光
马超
廖青
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Priority to US17/830,786 priority Critical patent/US11615247B1/en
Publication of WO2023092960A1 publication Critical patent/WO2023092960A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention relates to the technical field of natural language processing, in particular to a tagging method and device for named entity recognition of legal documents.
  • rule-based methods Currently commonly used named entity recognition is divided into: rule-based methods and statistical methods.
  • the characteristics of the rule-based method are: matching rules, dependent dictionaries, templates, regular expressions, poor flexibility, and poor portability.
  • the characteristics of the statistical-based method are: the named entity is regarded as a classification problem, the maximum probability is selected, the named entity recognition is regarded as a sequence label, and hidden Markov, maximum entropy Markov chain, conditional random field, long short-term memory are used Network and other machine learning sequence annotation models, which fail to highly recognize nested entities.
  • the object of the present invention is to provide a marking method and device for named entity recognition of legal documents to address the deficiencies of the prior art.
  • a marking method for named entity recognition of legal documents comprising the following steps:
  • Step S1 Obtain the legal text, use the annotation tool to annotate the legal text with entities to obtain annotation information, process the annotation information to obtain sentences and annotations, fill in the dictionary made with entities according to the index position, and convert the Statements, annotations and dictionaries are saved as files;
  • Step S2 converting the sentence in the file into an index value recognizable by the BERT pre-training model, and simultaneously determining the input label information matrix to create a generator;
  • Step S3 Input a batch of index values generated by the generator into the BERT pre-training model to extract sentence context features, capture sentence context features, understand the relationship between sentences, understand the relationship between semantics, and output sentence feature codes result;
  • Step S4 Input the sentence feature encoding result into the conditional random field model for training and prediction, and obtain the entity recognition tag sequence, entity tag position information and score transfer matrix;
  • Step S5 Input the score transfer matrix into the multi-head neural network layer to obtain multiple entity recognition label sequences corresponding to one entity, multiple entity label position information, and multiple score transfer matrices;
  • Step S6 copying and transposing the score transfer matrix to obtain a transpose matrix, and splicing the transpose matrix and the original matrix to obtain a multi-head score transfer matrix;
  • Step S7 Input the multi-head score transfer matrix into the fully connected layer to obtain the score transfer matrix corresponding to the legal text, so far the construction of the BERT-SPAN training model is completed;
  • Step S8 Perform the cross-entropy loss function on the score transfer matrix and the label information matrix to obtain the loss value, pass the loss value back into the BERT-SPAN training model for training, and continuously adjust the training result parameters of the BERT-SPAN training model , to get the identified nested entity;
  • Step S9 constructing an entity labeling template by using the identified nested entity.
  • step S1 specifically includes the following sub-steps:
  • Step S11 Obtain the legal text, use the annotation tool to annotate the entity of the legal text to obtain annotation information, decompose the annotation information into sentences and annotations, and store the sentences and the annotations into the sentence list and the annotation list respectively Save;
  • step S12 fill in the dictionary made of entities according to the index position for the statement list and the label list;
  • Step S13 storing the sentence list, the label list and the dictionary information into a josn file.
  • step S2 specifically includes the following sub-steps:
  • Step S21 Judging and processing the sentences in the file in step S1 to obtain sentences with only 510 characters or less;
  • Step S22 directly converting the sentence with only 510 or less characters into an index value through the encoder of the BERT pre-training model
  • Step S23 read the file described in step S1, and extract the label position
  • Step S24 Determine the starting coordinates of the entity in the direction of the horizontal axis
  • Step S25 Determine the end coordinates of the entity in the direction of the longitudinal axis:
  • Step S26 Merge the start coordinates of the entity with the end coordinates of the entity to determine the label information matrix
  • Step S27 By setting the batch size, determine the maximum length of the statement in the same batch, fill each list in the label information matrix backwards, and the value of the maximum length is the same as the value of the maximum length, and at the same time modify the index The value and the all-zero list generated by the length of the index value are also all padded to the maximum length;
  • Step S28 Determine the return value as [index value, list of all zeros], [entity position information matrix], and create a generator.
  • step S3 specifically includes the following sub-steps:
  • Step S31 input a batch of index values generated by the generator into the BERT-SPAN pre-training model for training to obtain word embedding vectors;
  • Step S32 Extract sentence context features according to the word embedding vector, capture sentence context features, understand the relationship between sentences, understand the relationship between semantics, and output the sentence feature encoding result.
  • the score transition matrix in the step S4 represents the relationship between the predicted entity and the plurality of label information, and the activation function is used to obtain the output of the maximum score for each entity, and the label with the maximum score is the corresponding entity. Label.
  • step S9 adopts markup language design.
  • the present invention also provides a labeling method for named entity recognition of legal documents.
  • the attributes of the entity labeling template in step S9 include: labeling entity serial number, entity type and code, and identified entity name.
  • step S9 includes the following sub-steps:
  • Step S91 Using the identified nested entity to construct a label set, the label set includes person, plaintiff, lawyer, time, place, event, crime, result;
  • Step S92 According to the label set, the horizontal axis and vertical axis entity label positions and corresponding entities have been constructed, and a single entity template is constructed.
  • the single entity template includes: a single person entity, a single plaintiff entity, a single lawyer entity, and a single time entity entity, single place entity, single event entity, single crime entity;
  • Step S93 According to the label set, extract the label positions of the constructed horizontal axis and vertical axis entities and the corresponding nested entities, and construct a nested entity template. Multiple entities are separated by commas, and the nested entity template Including: multiple person entities, multiple plaintiff entities, multiple lawyer entities, multiple time entities, multiple location entities, multiple event entities, and multiple crime entities;
  • Step S94 According to the label set, the non-entity template is constructed if the non-entity template is not identified or is not an entity in the constructed horizontal axis and vertical axis entity annotations;
  • Step S95 the set of the single entity template, the nested entity template and the non-entity template is the entity annotation template.
  • the present invention also provides a tagging device for named entity recognition of legal documents, including a memory and one or more processors, executable codes are stored in the memory, and the one or more processors execute the executable code. When the code is executed, it is used to implement a tagging method for named entity recognition of legal documents described in any one of the above embodiments.
  • the present invention also provides a computer-readable storage medium, on which a program is stored.
  • a program is stored.
  • the program is executed by a processor, the method for marking named entity recognition of legal documents described in any one of the above embodiments is implemented.
  • the beneficial effects of the present invention are: firstly, using the SPAN mode to provide a method for solving long text classification in the task of named entity recognition; secondly, by changing the input to the BERT model, attempting to complete the recognition of nested entity annotations.
  • the SPAN mode described in the present invention the difficulty in identifying long texts and nested entities in NER tasks can be solved to a greater extent, and practice has been made for better solving such problems in the future.
  • the research of the present invention is based on the BERT pre-trained language model.
  • the present invention firstly carries out the processing of SPAN method to corpus, makes its location information transform into the label of location information from original common annotation (BMES) method, stipulates the output of BERT at the same time, builds BERT-SPAN model on this basis;
  • BMES common annotation
  • the multi-head feed-forward neural network to process the score transfer matrix to obtain the multi-head score transfer matrix, copy and transpose the multi-head score transfer matrix to obtain a transposed matrix, and splicing the transposed matrix and the original matrix to obtain a multi-head label position transfer matrix;
  • the transfer matrix is input into the fully connected layer to obtain the entity label position and construct the label horizontal axis and vertical axis position coordinates; use the identified entity and label position to construct the entity label template.
  • the invention solves the low recognition degree of long text nested entities under the BERT model, and also provides a solution for nested entity recognition. Compared with the named entity recognition method based on machine learning, the model framework is simpler and the accuracy is
  • FIG. 1 is an overall architecture diagram of a labeling method for named entity recognition of legal documents according to the present invention
  • Fig. 2 is the entity coordinates in the horizontal axis direction of a labeling method used for named entity recognition of legal documents in the present invention
  • Fig. 3 is the entity coordinates in the vertical axis direction of a labeling method used for named entity recognition of legal documents in the present invention
  • Fig. 4 is the actual matrix of the corpus training process of a kind of labeling method that is used for the named entity recognition of legal document of the present invention
  • Fig. 5 is a labeling result of a labeling method used for named entity recognition of legal documents in the present invention.
  • FIG. 6 is a structural diagram of a tagging device for named entity recognition of legal documents according to the present invention.
  • Step S1 Obtain the legal text, use the annotation tool to annotate the legal text with entities to obtain annotation information, process the annotation information to obtain sentences and annotations, fill in the dictionary made with entities according to the index position, and convert the Statements, annotations and dictionaries are saved as files;
  • Step S11 Obtain the legal text, use the annotation tool to annotate the entity of the legal text to obtain annotation information, decompose the annotation information into sentences and annotations, and store the sentences and the annotations into the sentence list and the annotation list respectively save;
  • Step S12 fill in the dictionary made of entities according to the index positions of the statement list and the label list;
  • Step S13 storing the sentence list, the label list and the dictionary information into a josn file.
  • Step S2 converting the sentence in the file into an index value recognizable by the BERT pre-training model, and simultaneously determining the input label information matrix to create a generator;
  • Step S21 Judging and processing the sentences in the file in step S1 to obtain sentences with only 510 characters or less;
  • Step S22 directly converting the sentence with only 510 or less characters into an index value through the encoder of the BERT pre-training model
  • Step S23 Read the file mentioned in Step S1, and extract the label position; for example: "Defendant Zhang San escaped from drunk driving and was detained.”
  • the extracted dictionary format is as follows:
  • Step S24 Determine the starting coordinates of the entity in the direction of the horizontal axis
  • the category is converted into a number through the label index table, such as the above example sentence: ⁇ 1: [0,2], [0,4], 2: [5,6], [7,8], 3: [10, 11] ⁇ , according to the first digit in each list, create an all-zero list with the same length as this sentence in advance, change the 0 in the number position in the all-zero list to the corresponding label category index, if it contains embedded Set of relationships, put the numbers of the two categories under the same list, indicating that the position contains the beginning of various entities, and the overall expression is:
  • Step S25 Determine the end coordinates of the entity in the direction of the longitudinal axis:
  • each number represents the end of a category entity of the corresponding category entity section, see Figure 3.
  • Step S26 Combine the start coordinates of the entity with the end coordinates of the entity to determine the label information matrix; the matrix represents the exact position of each entity in the corpus, and the specific representation form is that the horizontal axis and the vertical axis are the same text, and the horizontal axis The position and the position of the vertical axis determine a point in the matrix. This point is an entity, and the corresponding number indicates the category of the entity. At the same time, in order to avoid overfitting, the number of the horizontal axis greater than the vertical axis is represented by -1, and the matrix The upper right part has nothing to do with the position relationship.
  • This method is used as a template to form the entity position information, and the position information of each entity is continuously extracted, and a two-dimensional matrix is constructed to represent the position of the entity in the corpus.
  • a two-dimensional matrix is constructed to represent the position of the entity in the corpus.
  • Step S27 By setting the batch size, determine the maximum length of the statement in the same batch, fill each list in the label information matrix backwards, and the value of the maximum length is the same as the value of the maximum length, and at the same time modify the index The value and the all-zero list generated by the length of the index value are also all padded to the maximum length;
  • Step S28 Determine the return value as [index value, list of all zeros], [entity position information matrix], and create a generator.
  • Step S3 Input a batch of index values generated by the generator into the BERT pre-training model to extract sentence context features, capture sentence context features, understand the relationship between sentences, understand the relationship between semantics, and output sentence feature codes result;
  • Step S31 input a batch of index values generated by the generator into the BERT-SPAN pre-training model for training to obtain word embedding vectors;
  • Step S32 Extract sentence context features according to the word embedding vector, capture sentence context features, understand the relationship between sentences, understand the relationship between semantics, and output the sentence feature encoding result.
  • Step S4 Input the sentence feature encoding result into the conditional random field model for training and prediction, and obtain the entity recognition label sequence, entity label position information and score transition matrix; the score transition matrix represents the predicted entity and multiple label information
  • the activation function is used to obtain the output of the maximum score for each entity, and the label with the maximum score is the label corresponding to the entity.
  • Step S5 Input the score transfer matrix into the multi-head neural network layer to obtain multiple entity recognition label sequences corresponding to one entity, multiple entity label position information, and multiple score transfer matrices;
  • Step S6 Copy and transpose the score transfer matrix to obtain a transpose matrix, concatenate the transpose matrix and the original matrix to obtain a multi-head score transfer matrix, and convert the dimension to [batch size, sequence length, sequence length, two hidden layer];
  • Step S7 Input the multi-head score transfer matrix into the fully connected layer to obtain the score transfer matrix corresponding to the legal text, and convert the dimension to [batch size, sequence length, sequence length, category position information of sentences in the sequence], So far, the BERT-SPAN training model has been built;
  • Step S8 Perform the cross-entropy loss function on the score transfer matrix and the label information matrix to obtain the loss value, pass the loss value back into the BERT-SPAN training model for training, and continuously adjust the training result parameters of the BERT-SPAN training model , to get the identified nested entity;
  • span_loss is the loss function used in this model; e is a hyperparameter, which can be adjusted by itself; n is all prediction information; i is the i-th prediction information; xi is each category; p(xi) is the real probability distribution; q (xi) is the predicted probability distribution.
  • Step S9 use the identified nested entity to construct an entity labeling template; the method for constructing the entity labeling template adopts markup language (XML) design, and the attributes of the entity labeling template include: labeling entity serial number, entity type and code, identified entity name.
  • XML markup language
  • Step S91 Using the identified nested entity to construct a label set, the label set includes person, plaintiff, lawyer, time, place, event, crime, result;
  • Step S92 According to the label set, the horizontal axis and vertical axis entity label positions and corresponding entities have been constructed, and a single entity template is constructed.
  • Step S95 the set of the single entity template, the nested entity template and the non-entity template is the entity annotation template.
  • the present invention also provides an embodiment of a tagging device for named entity recognition of legal documents.
  • an embodiment of the present invention provides a tagging device for named entity recognition of legal documents, including a memory and one or more processors, executable codes are stored in the memory, and the one or more When the processor executes the executable code, it is used to implement a tagging method for named entity recognition of legal documents in the above embodiment.
  • An embodiment of a labeling device for named entity recognition of legal documents of the present invention can be applied to any device with data processing capabilities, and any device with data processing capabilities can be a device or device such as a computer.
  • the device embodiments can be implemented by software, or by hardware or a combination of software and hardware. Taking software implementation as an example, as a device in a logical sense, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory for operation by the processor of any device capable of data processing.
  • FIG. 6 it is a hardware structure diagram of any device with data processing capabilities where a labeling device for named entity recognition of legal documents is located, except for the processing shown in Figure 6
  • any device with data processing capability where the device in the embodiment is usually based on the actual function of any device with data processing capability may also include other hardware. This will not be repeated here.
  • the device embodiment since it basically corresponds to the method embodiment, for related parts, please refer to the part description of the method embodiment.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present invention. It can be understood and implemented by those skilled in the art without creative effort.
  • An embodiment of the present invention also provides a computer-readable storage medium, on which a program is stored.
  • the program is executed by a processor, the method for marking named entity recognition of legal documents in the above-mentioned embodiments is implemented.
  • the computer-readable storage medium may be an internal storage unit of any device capable of data processing described in any of the foregoing embodiments, such as a hard disk or a memory.
  • the computer-readable storage medium can also be an external storage device of any device with data processing capabilities, such as a plug-in hard disk, a smart media card (Smart Media Card, SMC), an SD card, and a flash memory card equipped on the device. (Flash Card) etc.
  • the computer-readable storage medium may also include both an internal storage unit of any device capable of data processing and an external storage device.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by any device capable of data processing, and may also be used to temporarily store data that has been output or will be output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种用于法律文书的命名实体识别的标注方法和装置,包括以下步骤:步骤S1:获取法律文本,转换成索引表;步骤S2:输出句子特征编码结果;步骤S3:训练及预测;步骤S4:得到集合;步骤S5:得到多头分数转移矩阵;步骤S6:得出所述法律文本对应的分数转移矩阵;步骤S7:确定识别嵌套实体;步骤S8:利用所述识别嵌套实体构建实体标注模板。本发明通过改变对BERT模型的输入,尝试完成对嵌套实体标注的识别,利用本发明所述的多头选择矩阵标注思路,较大程度的缓解了NER任务中长文本以及嵌套实体的识别难度。

Description

一种用于法律文书的命名实体识别的标注方法和装置
交叉引用
本发明要求于2022年4月24日向中国专利局提交的申请号为202210434737.9、发明名称为“一种用于法律文书的命名实体识别的标注方法和装置”中国专利申请的优先权,其全部内容通过引用,合并于此。
技术领域
本发明涉及自然语言处理技术领域,尤其涉及一种用于法律文书的命名实体识别的标注方法和装置。
背景技术
近年来,随着硬件计算能力的大幅度提高,基于深度神经网络的方法已经被成功地应用到命名实体识别中,该方法是一种端到端的方法,不需要特殊的领域资源(如词典)或者构建本体,可以从大规模的标注数据中自动地学习和抽取文本特征。
目前常用的命名实体识别分为:基于规则的方法、基于统计的方法。基于规则的方法的特点是:匹配规则、依赖词典、模板、正则表达式,灵活性差,可移植性差。基于统计的方法的特点是:将命名实体别看作分类问题,选择最大概率,将命名实体识别看作序列标注,采用隐马尔可夫、最大熵马尔可夫链、条件随机场、长短期记忆网络等机器学习序列标注模型,这些序列标注模型未能高度识别嵌套实体。
因此,我们提出了一种用于法律文书的命名实体识别的标注方法和装置以此解决上述技术问题。
发明内容
本发明的目的在于针对现有技术的不足,提供一种用于法律文书的命名实体识别的标注方法和装置。
本发明采用的技术方案如下:
一种用于法律文书的命名实体识别的标注方法,包括以下步骤:
步骤S1:获取法律文本,利用标注工具对所述法律文本进行实体标注得到标注信息,对所述标注信息进行处理得到语句和标注,并按索引位置填入用实体做成的字典,将所述语句、标注和字典保存为文件;
步骤S2:将所述文件中的所述语句转变为BERT预训练模型所能识别的索引值,同时确定输入的标签信息矩阵,创建生成器;
步骤S3:将所述生成器生成的一个批次的索引值输入BERT预训练模型中提取句子上下文特征、捕获句子上下文特征、理解句子之间的关系、理解语义之间的关系,输出句子特征编码结果;
步骤S4:将所述句子特征编码结果输入条件随机场模型中训练及预测,获取实体识别标签序列、实体标签位置信息和分数转移矩阵;
步骤S5:将所述分数转移矩阵输入多头神经网络层,得到一个实体对应多个实体识别标签序列、多个实体标签位置信息、多个分数转移矩阵;
步骤S6:对所述分数转移矩阵复制并转置,得到转置矩阵,将转置矩阵与原矩阵拼接得到多头分数转移矩阵;
步骤S7:将所述多头分数转移矩阵输入全连接层中得出所述法律文本对应的分数转移矩阵,至此BERT-SPAN训练模型构建完成;
步骤S8:将所述分数转移矩阵与所述标签信息矩阵进行交叉熵损失函数求损失值,将损失值回传入所述BERT-SPAN训练模型训练,不断调整BERT-SPAN训练模型的训练结果参数,得到识别嵌套实体;
步骤S9:利用所述识别嵌套实体构建实体标注模板。
进一步地,所述步骤S1具体包括以下子步骤:
步骤S11:获取法律文本,利用标注工具对所述法律文本进行实体标注得到标注信息,对所述标注信息分解为语句和标注,并将所述语句和所述标注分别存入语句列表和标注列表保存;步骤S12:对所述语句列表和所述标注列表按索引位置填入用实体做成的字典;
步骤S13:将所述语句列表和所述标注列表以及所述字典的信息存入josn文件。
进一步地,所述步骤S2具体包括以下子步骤:
步骤S21:对步骤S1中所述文件中的所述语句进行判断并处理,得到仅小于等于510字符的语句;
步骤S22:将所述仅小于等于510字符的语句通过所述BERT预训练模型的编码器直接转变为索引值;
步骤S23:读取步骤S1中所述文件,提取标签位置;
步骤S24:确定横轴方向上的实体起始坐标;
步骤S25:确定纵轴方向上的实体终止坐标:
步骤S26:将所述实体起始坐标与所述实体终止坐标合并,确定标签信息矩阵;
步骤S27:通过设置批次大小,确定同一批次中所述语句的最大长度,将所述标签信息矩阵中每个列表进行后向填充,与所述最大长度的值相同,同时对所述索引值以及按索引值长度生成的全零列表也全部填充至最大长度;
步骤S28:确定返回值为[索引值,全零列表],[实体位置信息矩阵],创建生成器。
进一步地,所述步骤S3具体包括以下子步骤:
步骤S31:将所述生成器生成的一个批次的索引值输入BERT-SPAN预训练模型中训练,得到词嵌入向量;
步骤S32:根据所述词嵌入向量提取句子上下文特征、捕获句子上下文特征、理解句子之间的关系、理解语义之间的关系,输出句子特征编码结果。
进一步地,所述步骤S4中所述分数转移矩阵表示预测实体与多个所述标签信息之间的关系,采用激活函数,为每个实体得到最大分数的输出,得分最大的标签为实体对应的标签。
进一步地,所述步骤S9中构建实体标注模板的方法采用标记语言设计。
本发明还提供一种用于法律文书的命名实体识别的标注方法,所述步骤S9中所述实体标注模板的属性包括:标注实体序号,实体类型和代码,已识别实体名称。
进一步地,所述步骤S9包括以下子步骤:
步骤S91:利用所述识别嵌套实体构建标签集,所述标签集包括人物、原告、被告、时间、地点、事件、罪名、结果;
步骤S92:根据所述标签集,已构建横轴与纵轴实体标注位置以及对应的实体,构建单实体模板,所述单实体模板包括:单个人物实体、单个原告实体、单个被告实体、单个时间实体、单个地点实体、单个事件实体、单个罪名实体;
步骤S93:根据所述标签集,提取已构建横轴与纵轴实体标注位置以及对应的嵌套实体,构建嵌套实体模板,多个实体之间采用顿号隔开,所述嵌套实体模板包括:多个人物实体、多个原告实体、多个被告实体、多个时间实体、多个地点实体、多个事件实体、多个罪名实体;
步骤S94:根据所述标签集,非实体模板由已构建横轴与纵轴实体标注中未能识别或者不是实体,构建非实体模板;
步骤S95:所述单实体模板,所述嵌套实体模板和所述非实体模板的集合为所述实体标注模板。
本发明还提供一种用于法律文书的命名实体识别的标注装置,包括存储器和一个或 多个处理器,所述存储器中存储有可执行代码,所述一个或多个处理器执行所述可执行代码时,用于实现上述实施例任一项所述的一种用于法律文书的命名实体识别的标注方法。
本发明还提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述实施例任一项所述的一种用于法律文书的命名实体识别的标注方法。
本发明的有益效果是:首先,用SPAN模式为解决命名实体识别任务中长文本别提供方法;其次,可以通过改变对BERT模型的输入,尝试完成对嵌套实体标注的识别。利用本发明所述的SPAN模式,较大程度的解决NER任务中长文本以及嵌套实体的识别难度,为后续更好的解决此类问题做出了实践。本发明研究基于BERT预训练语言模型。具体的,本发明首先对语料进行SPAN方法的处理,让其位置信息从原先的普通标注(BMES)方法转化为位置信息的标注,同时规定BERT的输出,在此基础上构建BERT-SPAN模型;使用多头前馈神经网络对分数转移矩阵处理得到多头分数转移矩阵,将多头分数转移矩阵复制并转置,得到转置矩阵,将转置矩阵与原矩阵拼接得到多头标注位置转移矩阵;将多头分数转移矩阵输入全连接层中得出实体标注位置并构建标注横轴与纵轴位置坐标;利用已识别的实体与标注的位置构建实体标注模板。本发明解决了长文本嵌套实体在BERT模型下识别程度低,同时也为嵌套实体识别提供了解决方法,相比基于机器学习的命名实体识别方法,模型框架更为简单,准确度更高。
附图说明
图1是本发明一种用于法律文书的命名实体识别的标注方法的整体架构图;
图2是本发明一种用于法律文书的命名实体识别的标注方法的横轴方向上的实体坐标;
图3是本发明一种用于法律文书的命名实体识别的标注方法的纵轴方向上的实体坐标;
图4是本发明一种用于法律文书的命名实体识别的标注方法的语料训练过程实际矩阵;
图5是本发明一种用于法律文书的命名实体识别的标注方法的标注结果;
图6本发明一种用于法律文书的命名实体识别的标注装置的结构图。
具体实施方式
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本发明及其应用或使用的任何限制。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
见图1,一种用于法律文书的命名实体识别的标注方法,包括以下步骤:
步骤S1:获取法律文本,利用标注工具对所述法律文本进行实体标注得到标注信息,对所述标注信息进行处理得到语句和标注,并按索引位置填入用实体做成的字典,将所述语句、 标注和字典保存为文件;
步骤S11:获取法律文本,利用标注工具对所述法律文本进行实体标注得到标注信息,对所述标注信息分解为语句和标注,并将所述语句和所述标注分别存入语句列表和标注列表保存;
步骤S12:对所述语句列表和所述标注列表按索引位置填入用实体做成的字典;
步骤S13:将所述语句列表和所述标注列表以及所述字典的信息存入josn文件。
步骤S2:将所述文件中的所述语句转变为BERT预训练模型所能识别的索引值,同时确定输入的标签信息矩阵,创建生成器;
步骤S21:对步骤S1中所述文件中的所述语句进行判断并处理,得到仅小于等于510字符的语句;
步骤S22:将所述仅小于等于510字符的语句通过所述BERT预训练模型的编码器直接转变为索引值;
步骤S23:读取步骤S1中所述文件,提取标签位置;例如:“被告人张三酒驾逃逸被拘留”。提取出的字典格式如:
{“句子序号”:0,“句子长度”:12,{“标签”:{“被告”:[0,2],[0,4],“事件”:[5,6],[7,8],“结果”:[10,11]}};
步骤S24:确定横轴方向上的实体起始坐标;
此时通过标签索引表将类别转化为数字,如上述例句:{1:[0,2],[0,4],2:[5,6],[7,8],3:[10,11]},根据每一个列表中的第一位数字,提前创建的一个同本句长度相同的全零列表,将全零列表中数字位置上的0变更为对应的标签类别索引,如果包含嵌套关系,则将两个类别的数字放在同一列表下,表示该位置包含多种实体的开头部分,整体表示为:
[1,0,0,1,0,2,0,0,0,0,3,0],用该列表表示实体的位置信息与标注信息,以此构建横轴方向上的实体坐标,每个数字代表对应类别实体的一种类别实体的开始部分,参见图2。
步骤S25:确定纵轴方向上的实体终止坐标:
根据上述字典中有实体位置列表的第二位数字,将另一组同本句长度相同的全零列表中该数字位置索引上的0变更为对应的标签类别索引,整体为:
[0,0,1,0,1,0,2,0,2,0,0,3],以此构建纵轴方向的实体坐标,每个数字代表对应类别实体的一种类别实体的结束部分,参见图3。
步骤S26:将所述实体起始坐标与所述实体终止坐标合并,确定标签信息矩阵;该矩阵表示语料中各个实体的准确位置,具体表示形式为横轴与纵轴为一条相同文本,横轴位置与纵轴位置确定矩阵中的一个点,此点为一个实体,用对应的数字表明实体的类别,同时为 了避免过拟合,将横轴大于纵轴的数用-1表示,将矩阵的右上部和位置关系无关的数字去除,将此方法用作形成实体位置信息的模板,不断将每个实体的位置信息进行提取,构建一个二维矩阵来表示语料中实体的位置,语料训练过程中实际矩阵,参见图4;标注结果,参见图5。同时将此二维矩阵作为实体位置信息矩阵输入模型。
步骤S27:通过设置批次大小,确定同一批次中所述语句的最大长度,将所述标签信息矩阵中每个列表进行后向填充,与所述最大长度的值相同,同时对所述索引值以及按索引值长度生成的全零列表也全部填充至最大长度;
步骤S28:确定返回值为[索引值,全零列表],[实体位置信息矩阵],创建生成器。
步骤S3:将所述生成器生成的一个批次的索引值输入BERT预训练模型中提取句子上下文特征、捕获句子上下文特征、理解句子之间的关系、理解语义之间的关系,输出句子特征编码结果;
步骤S31:将所述生成器生成的一个批次的索引值输入BERT-SPAN预训练模型中训练,得到词嵌入向量;
步骤S32:根据所述词嵌入向量提取句子上下文特征、捕获句子上下文特征、理解句子之间的关系、理解语义之间的关系,输出句子特征编码结果。
步骤S4:将所述句子特征编码结果输入条件随机场模型中训练及预测,获取实体识别标签序列、实体标签位置信息和分数转移矩阵;所述分数转移矩阵表示预测实体与多个所述标签信息之间的关系,为解决一个实体属于多个标签情况,采用激活函数,为每个实体得到最大分数的输出,得分最大的标签为实体对应的标签。
步骤S5:将所述分数转移矩阵输入多头神经网络层,得到一个实体对应多个实体识别标签序列、多个实体标签位置信息、多个分数转移矩阵;
步骤S6:对所述分数转移矩阵复制并转置,得到转置矩阵,将转置矩阵与原矩阵拼接得到多头分数转移矩阵,维度转换为[批次大小,序列长度,序列长度,两个隐藏层];
步骤S7:将所述多头分数转移矩阵输入全连接层中得出所述法律文本对应的分数转移矩阵,维度转换为[批次大小,序列长度,序列长度,序列中语句的类别位置信息],至此BERT-SPAN训练模型构建完成;
步骤S8:将所述分数转移矩阵与所述标签信息矩阵进行交叉熵损失函数求损失值,将损失值回传入所述BERT-SPAN训练模型训练,不断调整BERT-SPAN训练模型的训练结果参数,得到识别嵌套实体;
对损失进行修正,避免BERT-SPAN模型出现过拟合,交叉熵损失函数具体公式如下:
Figure PCTCN2022093493-appb-000001
其中span_loss为此次模型所用的损失函数;e为超参数,可自行调整;n为所有预测信息;i为第i次预测信息;xi为每一类别;p(xi)为真实概率分布;q(xi)为预测概率分布。
通过一个个批次的参数优化,调整真实标签的所在位置,确定识别嵌套实体。
步骤S9:利用所述识别嵌套实体构建实体标注模板;构建实体标注模板的方法采用标记语言(XML)设计,所述实体标注模板的属性包括:标注实体序号,实体类型和代码,已识别实体名称。
步骤S91:利用所述识别嵌套实体构建标签集,所述标签集包括人物、原告、被告、时间、地点、事件、罪名、结果;
步骤S92:根据所述标签集,已构建横轴与纵轴实体标注位置以及对应的实体,构建单实体模板,所述单实体模板包括:单个人物实体、单个原告实体、单个被告实体、单个时间实体、单个地点实体、单个事件实体、单个罪名实体;单实体模板<NER id=“1”label=“被告”code=“0001”>实体</NER>。
步骤S93:根据所述标签集,提取已构建横轴与纵轴实体标注位置以及对应的嵌套实体,构建嵌套实体模板,多个实体之间采用顿号隔开,所述嵌套实体模板包括:多个人物实体、多个原告实体、多个被告实体、多个时间实体、多个地点实体、多个事件实体、多个罪名实体;嵌套实体模板<NER_MORE id=“2”label=“事件”code=“0002”>实体1、实体2</NER_MORE>。
步骤S94:根据所述标签集,非实体模板由已构建横轴与纵轴实体标注中未能识别或者不是实体,构建非实体模板;非实体模板<NER_NO id=“3”label=“NULL”code=“NULL”>文本</NER_NO>。
步骤S95:所述单实体模板,所述嵌套实体模板和所述非实体模板的集合为所述实体标注模板。
与前述一种用于法律文书的命名实体识别的标注方法的实施例相对应,本发明还提供了一种用于法律文书的命名实体识别的标注装置的实施例。
参见图6,本发明实施例提供的一种用于法律文书的命名实体识别的标注装置,包括存储器和一个或多个处理器,所述存储器中存储有可执行代码,所述一个或多个处理器执行所述可执行代码时,用于实现上述实施例中的一种用于法律文书的命名实体识别的标注方法。
本发明一种用于法律文书的命名实体识别的标注装置的实施例可以应用在任意具备 数据处理能力的设备上,该任意具备数据处理能力的设备可以为诸如计算机等设备或装置。装置实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在任意具备数据处理能力的设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,如图6所示,为本发明一种用于法律文书的命名实体识别的标注装置所在任意具备数据处理能力的设备的一种硬件结构图,除了图6所示的处理器、内存、网络接口、以及非易失性存储器之外,实施例中装置所在的任意具备数据处理能力的设备通常根据该任意具备数据处理能力的设备的实际功能,还可以包括其他硬件,对此不再赘述。
上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
本发明实施例还提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述实施例中的一种用于法律文书的命名实体识别的标注方法。
所述计算机可读存储介质可以是前述任一实施例所述的任意具备数据处理能力的设备的内部存储单元,例如硬盘或内存。所述计算机可读存储介质也可以是任意具备数据处理能力的设备的外部存储设备,例如所述设备上配备的插接式硬盘、智能存储卡(Smart Media Card,SMC)、SD卡、闪存卡(Flash Card)等。进一步的,所述计算机可读存储介质还可以既包括任意具备数据处理能力的设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述任意具备数据处理能力的设备所需的其他程序和数据,还可以用于暂时地存储已经输出或者将要输出的数据。
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换或改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种用于法律文书的命名实体识别的标注方法,其特征在于,包括以下步骤:
    步骤S1:获取法律文本,利用标注工具对所述法律文本进行实体标注得到标注信息,对所述标注信息进行处理得到语句和标注,并按索引位置填入用实体做成的字典,将所述语句、标注和字典保存为文件;
    步骤S2:将所述文件中的所述语句转变为BERT预训练模型所能识别的索引值,同时确定输入的标签信息矩阵,创建生成器;
    步骤S3:将所述生成器生成的一个批次的索引值输入BERT预训练模型中提取句子上下文特征、捕获句子上下文特征、理解句子之间的关系、理解语义之间的关系,输出句子特征编码结果;
    步骤S4:将所述句子特征编码结果输入条件随机场模型中训练及预测,获取实体识别标签序列、实体标签位置信息和分数转移矩阵;
    步骤S5:将所述分数转移矩阵输入多头神经网络层,得到一个实体对应多个实体识别标签序列、多个实体标签位置信息、多个分数转移矩阵;
    步骤S6:对所述分数转移矩阵复制并转置,得到转置矩阵,将转置矩阵与原矩阵拼接得到多头分数转移矩阵;
    步骤S7:将所述多头分数转移矩阵输入全连接层中得出所述法律文本对应的分数转移矩阵,至此BERT-SPAN训练模型构建完成;
    步骤S8:将所述分数转移矩阵与所述标签信息矩阵进行交叉熵损失函数求损失值,将损失值回传入所述BERT-SPAN训练模型训练,不断调整BERT-SPAN训练模型的训练结果参数,得到识别嵌套实体;
    步骤S9:利用所述识别嵌套实体构建实体标注模板。
  2. 如权利要求1所述的一种用于法律文书的命名实体识别的标注方法,其特征在于,所述步骤S1具体包括以下子步骤:
    步骤S11:获取法律文本,利用标注工具对所述法律文本进行实体标注得到标注信息,对所述标注信息分解为语句和标注,并将所述语句和所述标注分别存入语句列表和标注列表保存;步骤S12:对所述语句列表和所述标注列表按索引位置填入用实体做成的字典;
    步骤S13:将所述语句列表和所述标注列表以及所述字典的信息存入josn文件。
  3. 如权利要求1所述的一种用于法律文书的命名实体识别的标注方法,其特征在于,所述步骤S2具体包括以下子步骤:
    步骤S21:对步骤S1中所述文件中的所述语句进行判断并处理,得到仅小于等于510字符 的语句;
    步骤S22:将所述仅小于等于510字符的语句通过所述BERT预训练模型的编码器直接转变为索引值;
    步骤S23:读取步骤S1中所述文件,提取标签位置;
    步骤S24:确定横轴方向上的实体起始坐标;
    步骤S25:确定纵轴方向上的实体终止坐标:
    步骤S26:将所述实体起始坐标与所述实体终止坐标合并,确定标签信息矩阵;
    步骤S27:通过设置批次大小,确定同一批次中所述语句的最大长度,将所述标签信息矩阵中每个列表进行后向填充,与所述最大长度的值相同,同时对所述索引值以及按索引值长度生成的全零列表也全部填充至最大长度;
    步骤S28:确定返回值为[索引值,全零列表],[实体位置信息矩阵],创建生成器。
  4. 如权利要求1所述的一种用于法律文书的命名实体识别的标注方法,其特征在于,所述步骤S3具体包括以下子步骤:
    步骤S31:将所述生成器生成的一个批次的索引值输入BERT-SPAN预训练模型中训练,得到词嵌入向量;
    步骤S32:根据所述词嵌入向量提取句子上下文特征、捕获句子上下文特征、理解句子之间的关系、理解语义之间的关系,输出句子特征编码结果。
  5. 如权利要求1所述的一种用于法律文书的命名实体识别的标注方法,其特征在于,所述步骤S4中所述分数转移矩阵表示预测实体与多个所述标签信息之间的关系,采用激活函数,为每个实体得到最大分数的输出,得分最大的标签为实体对应的标签。
  6. 如权利要求1所述的一种用于法律文书的命名实体识别的标注方法,其特征在于,所述步骤S9中构建实体标注模板的方法采用标记语言设计。
  7. 如权利要求1所述的一种用于法律文书的命名实体识别的标注方法,其特征在于,所述步骤S9中所述实体标注模板的属性包括:标注实体序号,实体类型和代码,已识别实体名称。
  8. 如权利要求1所述的一种用于法律文书的命名实体识别的标注方法,其特征在于,所述步骤S9包括以下子步骤:
    步骤S91:利用所述识别嵌套实体构建标签集,所述标签集包括人物、原告、被告、时间、地点、事件、罪名、结果;
    步骤S92:根据所述标签集,已构建横轴与纵轴实体标注位置以及对应的实体,构建单实体 模板,所述单实体模板包括:单个人物实体、单个原告实体、单个被告实体、单个时间实体、单个地点实体、单个事件实体、单个罪名实体;
    步骤S93:根据所述标签集,提取已构建横轴与纵轴实体标注位置以及对应的嵌套实体,构建嵌套实体模板,多个实体之间采用顿号隔开,所述嵌套实体模板包括:多个人物实体、多个原告实体、多个被告实体、多个时间实体、多个地点实体、多个事件实体、多个罪名实体;
    步骤S94:根据所述标签集,非实体模板由已构建横轴与纵轴实体标注中未能识别或者不是实体,构建非实体模板;
    步骤S95:所述单实体模板,所述嵌套实体模板和所述非实体模板的集合为所述实体标注模板。
  9. 一种用于法律文书的命名实体识别的标注装置,其特征在于,包括存储器和一个或多个处理器,所述存储器中存储有可执行代码,所述一个或多个处理器执行所述可执行代码时,用于实现权利要求1-8中任一项所述的一种用于法律文书的命名实体识别的标注方法。
  10. 一种计算机可读存储介质,其特征在于,其上存储有程序,该程序被处理器执行时,实现上述权利要求1-8中任一项所述的一种用于法律文书的命名实体识别的标注方法。
PCT/CN2022/093493 2022-04-24 2022-05-18 一种用于法律文书的命名实体识别的标注方法和装置 WO2023092960A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/830,786 US11615247B1 (en) 2022-04-24 2022-06-02 Labeling method and apparatus for named entity recognition of legal instrument

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210434737.9A CN114580424B (zh) 2022-04-24 2022-04-24 一种用于法律文书的命名实体识别的标注方法和装置
CN202210434737.9 2022-04-24

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/830,786 Continuation US11615247B1 (en) 2022-04-24 2022-06-02 Labeling method and apparatus for named entity recognition of legal instrument

Publications (1)

Publication Number Publication Date
WO2023092960A1 true WO2023092960A1 (zh) 2023-06-01

Family

ID=81784813

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/093493 WO2023092960A1 (zh) 2022-04-24 2022-05-18 一种用于法律文书的命名实体识别的标注方法和装置

Country Status (2)

Country Link
CN (1) CN114580424B (zh)
WO (1) WO2023092960A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117151117A (zh) * 2023-10-30 2023-12-01 国网浙江省电力有限公司营销服务中心 电网轻量级非结构化文档内容自动识别方法、装置及介质
CN117236335A (zh) * 2023-11-13 2023-12-15 江西师范大学 基于提示学习的两阶段命名实体识别方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115098617A (zh) * 2022-06-10 2022-09-23 杭州未名信科科技有限公司 三元组关系抽取任务的标注方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680511A (zh) * 2020-04-21 2020-09-18 华东师范大学 一种多神经网络协作的军事领域命名实体识别方法
CN112214966A (zh) * 2020-09-04 2021-01-12 拓尔思信息技术股份有限公司 基于深度神经网络的实体及关系联合抽取方法
US20210319216A1 (en) * 2020-04-13 2021-10-14 Ancestry.Com Operations Inc. Topic segmentation of image-derived text
CN113536793A (zh) * 2020-10-14 2021-10-22 腾讯科技(深圳)有限公司 一种实体识别方法、装置、设备以及存储介质
CN113743119A (zh) * 2021-08-04 2021-12-03 中国人民解放军战略支援部队航天工程大学 中文命名实体识别模块、方法、装置及电子设备
US20220114476A1 (en) * 2020-10-14 2022-04-14 Adobe Inc. Utilizing a joint-learning self-distillation framework for improving text sequential labeling machine-learning models

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391485A (zh) * 2017-07-18 2017-11-24 中译语通科技(北京)有限公司 基于最大熵和神经网络模型的韩语命名实体识别方法
CN107783960B (zh) * 2017-10-23 2021-07-23 百度在线网络技术(北京)有限公司 用于抽取信息的方法、装置和设备
CN110209807A (zh) * 2018-07-03 2019-09-06 腾讯科技(深圳)有限公司 一种事件识别的方法、模型训练的方法、设备及存储介质
CN109815339B (zh) * 2019-01-02 2022-02-08 平安科技(深圳)有限公司 基于TextCNN知识抽取方法、装置、计算机设备及存储介质
CN109992782B (zh) * 2019-04-02 2023-07-07 深圳市华云中盛科技股份有限公司 法律文书命名实体识别方法、装置及计算机设备
CN110287961B (zh) * 2019-05-06 2024-04-09 平安科技(深圳)有限公司 中文分词方法、电子装置及可读存储介质
US11748613B2 (en) * 2019-05-10 2023-09-05 Baidu Usa Llc Systems and methods for large scale semantic indexing with deep level-wise extreme multi-label learning
WO2021000362A1 (zh) * 2019-07-04 2021-01-07 浙江大学 一种基于深度神经网络模型的地址信息特征抽取方法
CN112329465A (zh) * 2019-07-18 2021-02-05 株式会社理光 一种命名实体识别方法、装置及计算机可读存储介质
CN111222317B (zh) * 2019-10-16 2022-04-29 平安科技(深圳)有限公司 序列标注方法、系统和计算机设备
CN110807328B (zh) * 2019-10-25 2023-05-05 华南师范大学 面向法律文书多策略融合的命名实体识别方法及系统
CN111310471B (zh) * 2020-01-19 2023-03-10 陕西师范大学 一种基于bblc模型的旅游命名实体识别方法
CN111460807B (zh) * 2020-03-13 2024-03-12 平安科技(深圳)有限公司 序列标注方法、装置、计算机设备和存储介质
CN111597810B (zh) * 2020-04-13 2024-01-05 广东工业大学 一种半监督解耦的命名实体识别方法
CN111651992A (zh) * 2020-04-24 2020-09-11 平安科技(深圳)有限公司 命名实体标注方法、装置、计算机设备和存储介质
CN111782768B (zh) * 2020-06-30 2021-04-27 首都师范大学 基于双曲空间表示和标签文本互动的细粒度实体识别方法
CN112434531A (zh) * 2020-10-27 2021-03-02 西安交通大学 一种有格式法律文书的命名实体和属性识别方法及系统
CN113743118A (zh) * 2021-07-22 2021-12-03 武汉工程大学 基于融合关系信息编码的法律文书中的实体关系抽取方法
CN113609859A (zh) * 2021-08-04 2021-11-05 浙江工业大学 一种基于预训练模型的特种设备中文命名实体识别方法
CN114169330B (zh) * 2021-11-24 2023-07-14 匀熵智能科技(无锡)有限公司 融合时序卷积与Transformer编码器的中文命名实体识别方法
CN114266254A (zh) * 2021-12-24 2022-04-01 上海德拓信息技术股份有限公司 一种文本命名实体识别方法与系统
CN114372153A (zh) * 2022-01-05 2022-04-19 重庆大学 基于知识图谱的法律文书结构化入库方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210319216A1 (en) * 2020-04-13 2021-10-14 Ancestry.Com Operations Inc. Topic segmentation of image-derived text
CN111680511A (zh) * 2020-04-21 2020-09-18 华东师范大学 一种多神经网络协作的军事领域命名实体识别方法
CN112214966A (zh) * 2020-09-04 2021-01-12 拓尔思信息技术股份有限公司 基于深度神经网络的实体及关系联合抽取方法
CN113536793A (zh) * 2020-10-14 2021-10-22 腾讯科技(深圳)有限公司 一种实体识别方法、装置、设备以及存储介质
US20220114476A1 (en) * 2020-10-14 2022-04-14 Adobe Inc. Utilizing a joint-learning self-distillation framework for improving text sequential labeling machine-learning models
CN113743119A (zh) * 2021-08-04 2021-12-03 中国人民解放军战略支援部队航天工程大学 中文命名实体识别模块、方法、装置及电子设备

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117151117A (zh) * 2023-10-30 2023-12-01 国网浙江省电力有限公司营销服务中心 电网轻量级非结构化文档内容自动识别方法、装置及介质
CN117151117B (zh) * 2023-10-30 2024-03-01 国网浙江省电力有限公司营销服务中心 电网轻量级非结构化文档内容自动识别方法、装置及介质
CN117236335A (zh) * 2023-11-13 2023-12-15 江西师范大学 基于提示学习的两阶段命名实体识别方法
CN117236335B (zh) * 2023-11-13 2024-01-30 江西师范大学 基于提示学习的两阶段命名实体识别方法

Also Published As

Publication number Publication date
CN114580424B (zh) 2022-08-05
CN114580424A (zh) 2022-06-03

Similar Documents

Publication Publication Date Title
CN109299273B (zh) 基于改进seq2seq模型的多源多标签文本分类方法及其系统
CN111274394B (zh) 一种实体关系的抽取方法、装置、设备及存储介质
WO2023092960A1 (zh) 一种用于法律文书的命名实体识别的标注方法和装置
CN111753081B (zh) 基于深度skip-gram网络的文本分类的系统和方法
WO2020224219A1 (zh) 中文分词方法、装置、电子设备及可读存储介质
CN111930942B (zh) 文本分类方法、语言模型训练方法、装置及设备
CN112036184A (zh) 基于BiLSTM网络模型及CRF模型的实体识别方法、装置、计算机装置及存储介质
CN112380863A (zh) 一种基于多头自注意力机制的序列标注方法
US11615247B1 (en) Labeling method and apparatus for named entity recognition of legal instrument
CN112560506B (zh) 文本语义解析方法、装置、终端设备及存储介质
CN113255320A (zh) 基于句法树和图注意力机制的实体关系抽取方法及装置
CN114528827A (zh) 一种面向文本的对抗样本生成方法、系统、设备及终端
CN112101031A (zh) 一种实体识别方法、终端设备及存储介质
CN116150361A (zh) 一种财务报表附注的事件抽取方法、系统及存储介质
CN111950281B (zh) 一种基于深度学习和上下文语义的需求实体共指检测方法和装置
CN117035084A (zh) 一种基于语法分析的医疗文本实体关系抽取方法和系统
CN116975292A (zh) 信息识别方法、装置、电子设备、存储介质及程序产品
Wang et al. Named entity recognition for Chinese telecommunications field based on Char2Vec and Bi-LSTMs
Desai et al. Lightweight convolutional representations for on-device natural language processing
CN115204142A (zh) 开放关系抽取方法、设备及存储介质
CN114896404A (zh) 文档分类方法及装置
CN117371447A (zh) 命名实体识别模型的训练方法、装置及存储介质
CN114911940A (zh) 文本情感识别方法及装置、电子设备、存储介质
CN113673247A (zh) 基于深度学习的实体识别方法、装置、介质及电子设备
Ma et al. Study of Tibetan Text Classification based on fastText

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22897059

Country of ref document: EP

Kind code of ref document: A1