WO2020215870A1 - 命名实体识别方法及装置 - Google Patents

命名实体识别方法及装置 Download PDF

Info

Publication number
WO2020215870A1
WO2020215870A1 PCT/CN2020/076196 CN2020076196W WO2020215870A1 WO 2020215870 A1 WO2020215870 A1 WO 2020215870A1 CN 2020076196 W CN2020076196 W CN 2020076196W WO 2020215870 A1 WO2020215870 A1 WO 2020215870A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature vector
word
named entity
vector
processed
Prior art date
Application number
PCT/CN2020/076196
Other languages
English (en)
French (fr)
Inventor
张露
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Priority to EP20796429.7A priority Critical patent/EP3961475A4/en
Priority to US16/959,381 priority patent/US11574124B2/en
Publication of WO2020215870A1 publication Critical patent/WO2020215870A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Definitions

  • the present disclosure relates to the field of information technology, in particular to a method and device for identifying named entities.
  • Named entity recognition refers to the recognition of entities with specific meanings in the text, including names of people, places, organizations, proper nouns, etc.
  • automatic identification of named entities from electronic medical records plays an important role in the construction of medical knowledge bases and clinical decision support.
  • the named entity recognition of Chinese electronic medical records is not accurate due to the shorter sentences and more abbreviations in the electronic medical records.
  • a method for identifying named entities including:
  • the electronic text to be processed includes words, characters, and/or characters;
  • the word vector and the word vector use a bidirectional long and short-term memory model to generate a feature vector
  • the feature vector is input into a random field model to identify the named entity and obtain the type of the named entity.
  • said generating a feature vector using a bidirectional long and short-term memory model according to the word vector and the word vector further includes:
  • the named entity recognition method of the present disclosure also includes:
  • the training data including historical electronic texts, historical named entities and corresponding historical named entity types;
  • the named entity recognition method of the present disclosure also includes:
  • conditional random field model is optimized.
  • the text to be processed includes: Chinese electronic medical records.
  • the embodiment of the present disclosure also provides a method for constructing a knowledge graph, including: identifying the named entity through the named entity recognition method; and constructing a knowledge graph according to the identified named entity.
  • the embodiment of the present disclosure also provides a named entity recognition device, including: a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the computer program is executed by the processor to achieve the above The steps in the named entity recognition method described.
  • the embodiment of the present disclosure also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the method for identifying a named entity as described above are implemented.
  • FIG. 1 is a schematic flowchart of a named entity recognition method according to an embodiment of the disclosure
  • Fig. 2 is a structural block diagram of a named entity recognition device according to an embodiment of the disclosure.
  • the embodiment of the present disclosure provides a named entity recognition method, as shown in FIG. 1, including:
  • Step 101 Obtain an electronic text to be processed, where the electronic text to be processed includes words, characters, and/or characters. Among them, the text to be processed includes: Chinese electronic medical records.
  • the named entity recognition method of the present disclosure further includes: performing data preprocessing on the electronic text to be processed.
  • Data preprocessing includes data cleaning, data integration, data specification and data transformation.
  • Data preprocessing can improve data quality, including accuracy, completeness, and consistency.
  • the word segmentation tool facilitates the word segmentation of the Chinese electronic medical record.
  • the word segmentation tool is used to segment the electronic text to be processed to obtain words, characters, and/or characters in the electronic text to be processed.
  • Word segmentation tools include: stuttering word segmentation tools.
  • the stuttering word segmentation tool is an algorithm model for calculating language probability. By counting the probability of each word segmentation result, that is, the rationality, it can get the word segmentation result that conforms to the speaking or writing habits.
  • the stuttering word segmentation tool can segment Chinese text very well, and the accuracy of word segmentation is high.
  • the database used by the word segmentation tool includes the International Classification of Disease Database.
  • each time the stuttering word segmentation tool is started it will first import the default database or dictionary, which is the general database or dictionary.
  • the database or dictionary will be added to the default database or dictionary. Normally, the original database or dictionary will not be overwritten.
  • the word segmentation tool When the word segmentation tool is performing word segmentation, it will look for a word in the database or dictionary. For example, the word "mouth ulcer" may not be available in the default database or dictionary.
  • the word segmentation tool will cut a sentence into many pieces during word segmentation. , Find the slicing method with the greatest probability of correct slicing. In this process, it will go to the database or dictionary to find whether there is a sub-slice.
  • the international disease classification database needs to be trained before it is used to ensure the accuracy and completeness of the keywords in the database; therefore, the training of the international disease classification database can be completed by establishing a training database.
  • the training database includes a large number of Chinese electronic medical records. Professionals can be asked to mark the named entities and entity types in the Chinese electronic medical records in detail.
  • the entity types can be divided into 5 categories, including: body parts, diseases and diagnosis, symptoms and signs,
  • ⁇ B, D, S, C, T ⁇ are used as the labels of the five categories of entities, and non-entities are represented by ⁇ X ⁇ .
  • Step 102 Generate a corresponding word vector based on each word in the electronic text to be processed, and generate a corresponding word vector based on each word or character in the electronic text to be processed.
  • the word2vec algorithm may be used to convert the word segmentation result into a word vector.
  • the word vector may be a 300-dimensional vector.
  • Each word or character in the training text is converted into a word vector.
  • the word2vec algorithm may be used to convert the word or character into a word vector.
  • the word vector may be a 128-dimensional vector.
  • each word or character is represented by a word vector (such as a word embedding vector), that is, different dense vectors represent different words or characters.
  • word vectors and word vectors can extract word vector features based on word granularity and word vector features fused with semantic information.
  • Step 103 According to the word vector and the word vector, use a bidirectional long- and short-term memory model to generate a feature vector.
  • using a bidirectional long and short-term memory model to generate a feature vector further includes:
  • the third high-level feature vector is used as the feature vector.
  • BiLSTM bidirectional Long Short-Term Memory (Bi-directional Long Short-Term Memory, BiLSTM bidirectional long short-term memory network) model is used to learn the word vectors and word vectors transformed by the word2vec algorithm to obtain the corresponding word vector features and word vectors feature.
  • the combination of feature H 1 and feature Y 2 as the input of the BiLSTM model is residual learning.
  • a total of three BiLSTMs are used, which is the stack BiLSTM.
  • the number of times that the long and short-term memory model is used is not limited to three, and other numbers can be used, and only three times are used as an example for description.
  • Step 104 Input the feature vector into a random field model to identify the named entity, and select the type of the named entity.
  • the conditional random field model can predict the entity type of each word and character.
  • the input of the conditional random field model is high-level features, and the output is the type corresponding to the input text and the text, namely non-entity (represented by X) and entity type (B, D, S, C, T).
  • the named entity recognition method of the present disclosure further includes: obtaining training data, the training data including historical electronic text, historical named entities and corresponding historical named entity types;
  • the named entity recognition method of the present disclosure further includes: optimizing the conditional random field model through multiple iterations of the LBFGS algorithm.
  • the LBFGS algorithm that is, the BFGS algorithm is performed in the limited memory (Limited memory).
  • the LBFGS algorithm is an optimization algorithm for neural networks. It is suitable for processing large-scale data, has a fast convergence speed, and can save a lot of storage space and computing resources.
  • Conditional Random Fields the weight coefficients are initialized first. Under the initial weight coefficients, there is an error between the output predicted value and the ground truth. If the error is greater than For the error threshold, the CRF model needs to be optimized, specifically, the initial weight coefficient is optimized.
  • the optimization algorithm is the LBFGS algorithm.
  • the LBFGS algorithm is based on the error output by the CRF model, after calculation and return, a series of parameters are obtained.
  • the technician can adjust the initial weight coefficient of the CRF model according to the series of parameters to obtain the optimized weight coefficient. If the output error of the CRF model is still greater than the error threshold according to the optimized weight coefficients, the CRF model needs to be optimized multiple times, that is, multiple iterations of the LBFGS algorithm to reduce the error of the CRF model below the error threshold.
  • the named entity recognition method of the present disclosure further includes evaluating the combination of the two-way long and short-term memory model and the conditional random field model, and the evaluation parameters include: accuracy, recall, and F1 score.
  • the precision, recall rate and F1 score can be used to measure the named entity recognition performance of the long and short-term memory model combined with the conditional random field model.
  • Table 1 in order to obtain model accuracy and recall data based on historical user model output data and real data, those skilled in the art can evaluate the model based on the data in Table 1, and further, can be optimized based on this data model.
  • the feature vector is extracted together according to the word vector and the word vector, and the characteristics of the character and the word can be obtained at the same time, and at the same time, the error of word segmentation is greatly reduced; in addition, a combination of long and short-term memory model and conditional random field model are used.
  • the recognition of named entities can absorb more characters and word features, which can further improve the accuracy of entity recognition.
  • the embodiment of the present disclosure also provides a named entity recognition device, as shown in FIG. 2, including:
  • the text acquisition module 21 is configured to acquire the electronic text to be processed, and the electronic text to be processed includes words, characters, and/or characters;
  • the vector generation module 22 is configured to generate a corresponding word vector based on each word in the electronic text to be processed, and generate a corresponding word vector based on each word or character in the electronic text to be processed;
  • the feature vector generating module 23 uses a bidirectional long and short-term memory model to generate feature vectors according to the word vector and the word vector;
  • the named entity recognition module 24 inputs the feature vector into a random field model to recognize the named entity, and obtains the type of the named entity.
  • the feature vectors are jointly extracted based on the word vector and the word vector, which can simultaneously obtain the characteristics of characters and words, and at the same time greatly reduce the error of word segmentation; in addition, a combination of long-term and short-term memory models and conditional random field models are used to name entities
  • the recognition can absorb more characters and word features, which can further improve the accuracy of entity recognition.
  • the feature vector generating module 22 further includes:
  • the first high-level feature vector generating unit is configured to input the word vector into the bidirectional long and short-term memory model to generate the first high-level feature vector;
  • a first transition feature vector obtaining unit configured to splice the word vector with the first high-level feature vector to obtain a first transition feature vector
  • a second high-level feature vector generating unit configured to input the first transition feature vector into the two-way long and short-term memory model to generate a second high-level feature vector
  • a second transition feature vector unit configured to splice the first transition feature vector and the second high-level feature vector to obtain a second transition feature vector
  • a third high-level feature vector generating unit configured to input the second transition feature vector into the bidirectional long and short-term memory model to generate a third high-level feature vector
  • the feature vector unit is used to use the third high-level feature vector as the feature vector.
  • the named entity recognition device of the present disclosure further includes:
  • a training data acquisition module for acquiring training data, the training data including historical electronic texts, historical named entities and corresponding historical named entity types;
  • the model optimization module is used to optimize the conditional random field model according to the historical electronic text, historical named entity and corresponding historical named entity type.
  • the named entity recognition device of the present disclosure further includes: an algorithm iteration unit for optimizing the conditional random field model through multiple iterations of the LBFGS algorithm.
  • the named entity recognition device of the present disclosure further includes: a word segmentation module, which is used to segment the electronic text to be processed through a word segmentation tool to obtain words, characters, and/or characters in the electronic text to be processed.
  • a word segmentation module which is used to segment the electronic text to be processed through a word segmentation tool to obtain words, characters, and/or characters in the electronic text to be processed.
  • the named entity recognition device of the present disclosure further includes: a preprocessing module, which is used to perform data preprocessing on the electronic text to be processed.
  • the named entity recognition device of the present disclosure further includes:
  • the model evaluation module is used to evaluate the conditional random field model through evaluation parameters, the evaluation parameters including: accuracy, recall, and F1 score.
  • a three-time long and short-term memory model is used, which is the stacked long- and short-term memory model.
  • the use of the stacked long- and short-term memory model can solve the problem of key information loss during network training and transmission, and is beneficial to the extraction of key features.
  • the number of times that the long and short-term memory model is used is not limited to three, and other numbers can be used, and only three times are used as an example for illustration.
  • the trained long and short-term memory model and the optimized conditional random field model can be used to identify the named entity of the electronic text to be processed. Input the text to be processed into the trained long-term and short-term memory model and the conditional random field model to output Process named entities in the text.
  • the technical solution of this embodiment provides a method for identifying a named entity based on the stack residual BiLSTM Chinese electronic medical record combining character features and word features, which not only increases the richness of input feature information, but also reduces the loss of feature information during training. This improves the accuracy of named entity recognition in Chinese electronic medical records.
  • the feature vector is extracted based on the word vector and the word vector together, which can obtain the characteristics of characters and words at the same time, and also greatly reduce the error of word segmentation.
  • a combination of long-term and short-term memory model and conditional random field model are used for named entity Recognition can absorb more characters and word features, which can further improve the accuracy of entity recognition.
  • the embodiment of the present disclosure also provides a method for constructing a knowledge graph, including: identifying the named entity by a named entity recognition method; and constructing a knowledge graph according to the identified named entity.
  • all named entities associated with the named entity can be obtained according to the identified named entity, including but not limited to the first-degree and second-degree associated named entities.
  • the embodiment of the present disclosure also provides a named entity recognition device, including: a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the computer program is executed by the processor to achieve the above The steps in the named entity recognition method described.
  • the embodiment of the present disclosure also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the method for identifying a named entity as described above are implemented.
  • the embodiments described herein can be implemented by hardware, software, firmware, middleware, microcode, or a combination thereof.
  • the processing unit can be implemented in one or more Application Specific Integrated Circuits (ASIC), Digital Signal Processing (DSP), Digital Signal Processing Equipment (DSP Device, DSPD), programmable Logic device (Programmable Logic Device, PLD), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), general-purpose processors, controllers, microcontrollers, microprocessors, and others for performing the functions described in this application Electronic unit or its combination.
  • ASIC Application Specific Integrated Circuits
  • DSP Digital Signal Processing
  • DSP Device Digital Signal Processing Equipment
  • PLD programmable Logic Device
  • PLD Field-Programmable Gate Array
  • FPGA Field-Programmable Gate Array
  • the technology described herein can be implemented through modules (such as procedures, functions, etc.) that perform the functions described herein.
  • the software codes can be stored in the memory and executed by the processor.
  • the memory can be implemented in the processor or external to the processor.
  • the embodiments of the embodiments of the present disclosure may be provided as methods, devices, or computer program products. Therefore, the embodiments of the present disclosure may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the embodiments of the present disclosure may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can direct a computer or other programmable data processing user equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the instruction device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing user equipment, so that a series of operation steps are executed on the computer or other programmable user equipment to produce computer-implemented processing, so that the computer or other programmable user equipment
  • the instructions executed above provide steps for implementing functions specified in one or more processes in the flowchart and/or one or more blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Computational Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Algebra (AREA)
  • Machine Translation (AREA)
  • Character Discrimination (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开提供了一种命名实体识别方法及装置,属于信息技术领域。其中,命名实体识别方法,包括:获取待处理电子文本,待处理电子文本中包括词、字、和/或字符;基于所述待处理电子文本中的各个词生成相对应的词向量,以及基于所述待处理电子文本中的各个字或字符生成相对应的字向量;根据词向量和/或所述字向量,利用双向长短期记忆模型生成特征向量;将特征向量输入随机场模型以识别命名实体,并获取命名实体的类型。

Description

命名实体识别方法及装置
相关申请的交叉引用
本申请主张在2019年4月22日在中国提交的中国专利申请号No.201910325442.6的优先权,其全部内容通过引用包含于此。
技术领域
本公开涉及信息技术领域,特别是指一种命名实体识别方法及装置。
背景技术
命名实体识别是指识别文本中具有特定意义的实体,主要包括人名、地名、机构名、专有名词等。在医疗领域,从电子病历中自动识别命名实体对于医学知识库的构建和临床决策支持起着重要作用。相比通用领域的命名实体识别,中文电子病历的命名实体识别由于电子病历中句子较短、缩略语较多而导致命名实体识别的精度不高。
发明内容
本公开提供技术方案如下:
一方面,提供一种命名实体识别方法,包括:
获取待处理电子文本,所述待处理电子文本中包括词、字、和/或字符;
基于所述待处理电子文本中的各个词生成相对应的词向量,以及基于所述待处理电子文本中的各个字或字符生成相对应的字向量;
根据所述词向量和所述字向量,利用双向长短期记忆模型生成特征向量;
将所述特征向量输入随机场模型以识别所述命名实体,并获取所述命名实体的类型。
其中,所述根据所述词向量和所述字向量,利用双向长短期记忆模型生成特征向量,进一步包括:
将所述词向量输入双向长短期记忆模型,生成第一高级特征向量;
将所述字向量与所述第一高级特征向量进行拼接,获取第一过渡特征向 量;
将所述第一过渡特征向量输入所述双向长短期记忆模型,生成第二高级特征向量;
将所述第一过渡特征向量与所述第二高级特征向量进行拼接,获取第二过渡特征向量;
将所述第二过渡特征向量输入所述双向长短期记忆模型,生成第三高级特征向量;
将所述第三高级特征向量作为所述特征向量。
本公开的命名实体识别方法,还包括:
获取训练数据,所述训练数据包括历史电子文本、历史命名实体和对应的历史命名实体类型;
根据所述历史电子文本、历史命名实体和对应的历史命名实体类型,优条件随机场模型。
本公开的命名实体识别方法,还包括:
通过多次迭代LBFGS算法,优化所述条件随机场模型。
其中,所述待处理文本包括:中文电子病历。
本公开实施例还提供一种知识图谱的构建方法,包括:通过所述的命名实体识别方法识别所述命名实体;根据识别的所述命名实体,构建知识图谱。
本公开实施例还提供了一种命名实体识别设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如上所述的命名实体识别方法中的步骤。
本公开实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如上所述的命名实体识别方法中的步骤。
附图说明
图1为本公开实施例命名实体识别方法的流程示意图;
图2为本公开实施例命名实体识别装置的结构框图。
具体实施方式
为使本公开的实施例要解决的技术问题、技术方案和优点更加清楚,下面将结合附图及具体实施例进行详细描述。
在中文电子病历的命名实体识别中,由于中文电子病历中句子较短、缩略语较多,从而导致命名实体识别的精度不高,因此可以采用本公开的技术方案提高中文电子病历的命名实体识别的精度。
本公开的实施例提供一种命名实体识别方法,如图1所示,包括:
步骤101:获取待处理电子文本,所述待处理电子文本中包括词、字、和/或字符。其中,待处理文本包括:中文电子病历。
本公开的命名实体识别方法,还包括:对所述待处理电子文本进行数据预处理。数据预处理包括数据清洗、数据集成、数据规约和数据变换,数据预处理可提高数据质量,包括准确性、完整性和一致性等。
中文电子病历进行数据预处理后,便于分词工具对中文电子病历的分词。
其中,通过分词工具对待处理电子文本进行分词,以获取待处理电子文本中的词、字、和/或字符。分词工具包括:结巴分词工具。
结巴分词工具是一个计算语言概率的算法模型,通过统计每一种分词结果的概率,即合理性,可以得到符合说话或行文习惯的分词结果。结巴分词工具可以很好地对中文文本进行分词,分词的正确性高。
其中,分词工具使用的数据库包括国际疾病分类数据库。
根据结巴分词工具的开发程序,结巴分词工具每次启动时,会先导入默认的数据库或字典,也就是通用数据库或字典。在结巴分词工具使用过程中,用户能够根据实际应用导入适用的数据库或字典,该数据库或字典会被加到默认数据库或字典之后,通常情况下,并不覆盖原数据库或字典。结巴分词工具在进行分词时,会查找数据库或字典中是否存在某个词,比如默认数据库或字典中可能没有“口腔溃疡”这个词,结巴分词工具在进行分词时会把一句话切成很多片,找到切片正确概率最大的那一种切片方式,在这个过程中,会去数据库或字典中查找是否存在某个子片。
以结巴分词工具使用的国际疾病分类数据库ICD10为例,即在结巴分词工具中导入国际疾病分类数据库ICD10。例如,在中文电子病历上记载有“给 与右下肢持续皮牵引”,则在结巴分词工具中使用ICD10数据库进行分词,分词后的结果为“给与”、“右下肢”、“持续”、“皮”、“牵引”。
国际疾病分类数据库在使用之前需进行训练,已保证数据库中关键词的准确性、完备性;因此,可通过建立训练数据库完成对国际疾病分类数据库的训练。
训练数据库中包括大量的中文电子病历,可以请专业人员详细标注中文电子病历中的命名实体及实体类型,实体类型一共可以分为5大类,包括:身体部位、疾病和诊断、症状和体征、检查与检验以及治疗,分别用{B,D,S,C,T}作为5大类实体的标签,非实体用{X}表示。
步骤102:基于所述待处理电子文本中的各个词生成对应的词向量,以及基于所述待处理电子文本中的各个字或字符生成对应的字向量。
作为示例,具体地,可以将分词结果采用word2vec算法将词转化为一个词向量,例如,词向量可以为一个300维的向量。
将所述训练文本中的每一字或字符转化为字向量,具体地,可以采用word2vec算法将字或字符转化为一个字向量,例如,字向量可以为一个128维的向量。获得中文电子病历的基于字的字向量,将每个字或字符用字向量(比如字嵌入向量)表示,即不同的密集型向量表示不同的字或字符。
因计算机仅能对数值型的类型进行计算,而输入的词、字和字符是字符型,计算机不能直接计算,因此需要将词和字符转化为数值向量。利用词向量和字向量训练预设的长短期记忆模型,可以提取出基于字粒度的字向量特征以及融合语义信息的词向量特征。
步骤103:根据所述词向量和所述字向量,利用双向长短期记忆模型生成特征向量。
其中,根据词向量和/或所述字向量,利用双向长短期记忆模型生成特征向量,进一步包括:
将词向量输入双向长短期记忆模型,生成第一高级特征向量;
将字向量与第一高级特征向量进行拼接,获取第一过渡特征向量;
将第一过渡特征向量输入双向长短期记忆模型,生成第二高级特征向量;
将第一过渡特征向量与第二高级特征向量进行拼接,获取第二过渡特征 向量;
将第二过渡特征向量输入双向长短期记忆模型,生成第三高级特征向量;
将第三高级特征向量作为所述特征向量。
作为示例,采用BiLSTM双向长短期记忆(Bi-directional Long Short-Term Memory,BiLSTM双向长短期记忆网络)模型对通过word2vec算法转化的词向量和字向量进行学习以获得相应的词向量特征和字向量特征。
将词向量输入双向长短期记忆模型,生成第一高级特征向量Y 0
将所述字向量与所述第一高级特征向量Y 0进行拼接,获取第一过渡特征向量H 1
将所述第一过渡特征向量H 1输入所述双向长短期记忆模型,生成第二高级特征向量Y 2
将所述第一过渡特征向量H 1与所述第二高级特征向量Y 2进行拼接,获取第二过渡特征向量H 2
将所述第二过渡特征向量H 2输入所述双向长短期记忆模型进行训练,生成第三高级特征向量Y 3
其中,将特征H 1与特征Y 2结合作为BiLSTM模型的输入,就是残差学习。在上述特征处理步骤中,一共采用了三次BiLSTM这就是堆栈BiLSTM。
当然,本发明本公开的技术方案中,采用长短期记忆模型的次数并不局限为三次,还可以其他数量,仅以三次为例进行说明。
步骤104:将所述特征向量输入随机场模型以识别所述命名实体,并取所述命名实体的类型。
条件随机场模型,可以预测每个词和字符的实体类型。条件随机场模型的输入是高级特征,输出是输入文本和文本对应的类型,即非实体(用X表示)和实体类型(B,D,S,C,T)。
本公开的命名实体识别方法还包括:获取训练数据,训练数据包括历史电子文本、历史命名实体和对应的历史命名实体类型;
根据历史电子文本、历史命名实体和对应的历史命名实体类型,优化条件随机场模型。
其中,本公开的命名实体识别方法,还包括:通过多次迭代LBFGS算法, 优化所述条件随机场模型。
LBFGS算法,即在有限内存(Limited emory)中进行BFGS算法。LBFGS算法是神经网络的优化算法,它适合处理大规模数据,收敛速度快,能节省大量的存储空间和计算资源。
条件随机场模型(Conditional Random Fields,CRF)在使用过程中,先初始化权重系数,在初始的权重系数下,输出的预测值与真实的有效值(ground truth)之间存在误差,如果此误差大于误差阈值,则需对CRF模型进行优化,具体地,优化初始的权重系数。
在本公开的实施例中,优化算法为LBFGS算法。LBFGS算法基于CRF模型输出的误差,经过计算并回传,得到一系列参数。技术人员能够根据该一系列参数对CRF模型初始的权重系数进行调整,得到优化的权重系数。如果根据优化的权重系数,CRF模型输出的误差仍然大于误差阈值,则需对CRF模型进行多次优化,即多次迭代LBFGS算法以减小CRF模型的误差至误差阈值以下。
本公开的命名实体识别方法,还包括对双向长短期记忆模型和条件随机场模型的结合进行评价,所述评价参数包括:精度、召回率、F1分数。
在利用长短期记忆模型以及条件随机场模型进行命名实体的识别后,可以利用精度、召回率和F1分数来衡量长短期记忆模型结合条件随机场模型的命名实体识别性能。如表1所示,为根据历史用户的模型输出数据与真实数据得出的模型精度及召回率数据,本领域技术人员可根据表1的数据对模型进行评估,进一步地,根据此数据可优化模型。
表1
  精度 召回率 F1分数
D 0.600 0.061 0.111
S 0.752 0.820 0.784
C 0.881 0.904 0.892
B 0.523 0.832 0.642
T 0.891 0.948 0.919
本公开的实施例中,根据词向量和字向量共同提取特征向量,能够同时获取字符和词的特征,同时还大大减少了分词的错误;另外采用长短期记忆模型和条件随机场模型相结合进行命名实体的识别,能够吸收更多的字符和词特征,从而能更进一步的提升实体识别的精度。
本公开实施例还提供了一种命名实体识别装置,如图2所示,包括:
文本获取模块21,用于获取待处理电子文本,所述待处理电子文本中包括词、字、和/或字符;
向量生成模块22,用于基于所述待处理电子文本中的各个词生成相对应的词向量,以及基于所述待处理电子文本中的各个字或字符生成相对应的字向量;
特征向量生成模块23,根据所述词向量和所述字向量,利用双向长短期记忆模型生成特征向量;
命名实体识别模块24,将所述特征向量输入随机场模型以识别所述命名实体,并获取所述命名实体的类型。
本实施例中,根据词向量和字向量共同提取特征向量,能够同时获取字符和词的特征,同时还大大减少了分词的错误;另外采用长短期记忆模型和条件随机场模型相结合进行命名实体的识别,能够吸收更多的字符和词特征,从而能更进一步的提升实体识别的精度。
其中,特征向量生成模块22,进一步包括:
第一高级特征向量生成单元,用于将所述词向量输入双向长短期记忆模型,生成第一高级特征向量;
第一过渡特征向量获取单元,用于将所述字向量与所述第一高级特征向量进行拼接,获取第一过渡特征向量;
第二高级特征向量生成单元,用于将所述第一过渡特征向量输入所述双向长短期记忆模型,生成第二高级特征向量;
第二过渡特征向量单元,用于将所述第一过渡特征向量与所述第二高级特征向量进行拼接,获取第二过渡特征向量;
第三高级特征向量生成单元,用于将所述第二过渡特征向量输入所述双向长短期记忆模型,生成第三高级特征向量;
特征向量单元,用于将所述第三高级特征向量作为所述特征向量。
其中,本公开的命名实体识别装置还包括:
训练数据获取模块,用于获取训练数据,所述训练数据包括历史电子文本、历史命名实体和对应的历史命名实体类型;
模型优化模块,用于根据所述历史电子文本、历史命名实体和对应的历史命名实体类型,优化条件随机场模型。
其中,本公开的命名实体识别装置还包括:算法迭代单元,用于通过多次迭代LBFGS算法,优化所述条件随机场模型。
本公开的命名实体识别装置还包括:分词模块,用于通过分词工具对所述待处理电子文本进行分词,以获取待处理电子文本中的词、字、和/或字符。
本公开的命名实体识别装置,还包括:预处理模块,用于对所述待处理电子文本进行数据预处理。
其中,本公开的命名实体识别装置,还包括:
模型评价模块,用于通过评价参数,对条件随机场模型进行评价,所述评价参数包括:精度、召回率、F1分数。
本实施例中采用了三次长短期记忆模型,这就是堆栈长短期记忆模型,采用堆栈长短期记忆模型能够解决特征在网络训练传递过程中关键信息丢失的问题,有利于关键特征的提取。当然,本公开的技术方案中,采用长短期记忆模型的次数并不局限为三次,还可以其他数量,仅以三次为例进行说明。
利用训练好的长短期记忆模型以及优化后的条件随机场模型即可对待处理电子文本进行命名实体的识别,将待处理文本输入训练好的长短期记忆模型和条件随机场模型,即可输出待处理文本中的命名实体。
本实施例的技术方案提供了一种结合字特征和词特征的堆栈残差BiLSTM中文电子病历命名实体识别方法,不仅增加了输入特征信息的丰富度,而且减少了训练过程中特征信息的损失,从而提高了中文电子病历中命名实体识别的准确率。
本公开的实施例具有以下有益效果:
上述方案中,根据词向量和字向量共同提取特征向量,能够同时获取字符和词的特征,同时还大大减少了分词的错误;另外采用长短期记忆模型和 条件随机场模型相结合进行命名实体的识别,能够吸收更多的字符和词特征,从而能更进一步的提升实体识别的精度。
本公开实施例还提供一种知识图谱的构建方法,包括:通过命名实体识别方法识别所述命名实体;以及根据识别的所述命名实体构建知识图谱。
通过构建知识图谱,能够根据识别的所述命名实体获取与所述命名实体所关联的所有命名实体,包括但不仅限于一度、二度关联命名实体。
本公开实施例还提供了一种命名实体识别设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如上所述的命名实体识别方法中的步骤。
本公开实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如上所述的命名实体识别方法中的步骤。
可以理解的是,本文描述的这些实施例可以用硬件、软件、固件、中间件、微码或其组合来实现。对于硬件实现,处理单元可以实现在一个或多个专用集成电路(Application Specific Integrated Circuits,ASIC)、数字信号处理器(Digital Signal Processing,DSP)、数字信号处理设备(DSP Device,DSPD)、可编程逻辑设备(Programmable Logic Device,PLD)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、通用处理器、控制器、微控制器、微处理器、用于执行本申请所述功能的其它电子单元或其组合中。
对于软件实现,可通过执行本文所述功能的模块(例如过程、函数等)来实现本文所述的技术。软件代码可存储在存储器中并通过处理器执行。存储器可以在处理器中或在处理器外部实现。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
本领域内的技术人员应明白,本公开实施例的实施例可提供为方法、装置、或计算机程序产品。因此,本公开实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质 (包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本公开实施例是参照根据本公开实施例的方法、用户设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理用户设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理用户设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理用户设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理用户设备上,使得在计算机或其他可编程用户设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程用户设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本公开实施例的可选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括可选实施例以及落入本公开实施例范围的所有变更和修改。
还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者用户设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者用 户设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者用户设备中还存在另外的相同要素。
以上所述的是本公开的可选实施方式,应当指出对于本技术领域的普通人员来说,在不脱离本公开所述的原理前提下还可以作出若干改进和润饰,这些改进和润饰也在本公开的保护范围内。

Claims (20)

  1. 一种命名实体识别方法,包括:
    获取待处理电子文本,所述待处理电子文本中包括词、字、和/或字符;
    基于所述待处理电子文本中的各个词生成相对应的词向量,以及基于所述待处理电子文本中的各个字或字符生成相对应的字向量;
    根据所述词向量和所述字向量,利用双向长短期记忆模型生成特征向量;
    将所述特征向量输入随机场模型以识别所述命名实体,并获取所述命名实体的类型。
  2. 如权利要求1所述的方法,其中,所述根据所述词向量和所述字向量,利用双向长短期记忆模型生成特征向量,进一步包括:
    将所述词向量输入双向长短期记忆模型,生成第一高级特征向量;
    将所述字向量与所述第一高级特征向量进行拼接,获取第一过渡特征向量;
    将所述第一过渡特征向量输入所述双向长短期记忆模型,生成第二高级特征向量;
    将所述第一过渡特征向量与所述第二高级特征向量进行拼接,获取第二过渡特征向量;
    将所述第二过渡特征向量输入所述双向长短期记忆模型,生成第三高级特征向量;
    将所述第三高级特征向量作为所述特征向量。
  3. 如权利要求1所述的方法,还包括:
    获取训练数据,所述训练数据包括历史电子文本、历史命名实体和对应的历史命名实体类型;
    根据所述历史电子文本、历史命名实体和对应的历史命名实体类型,优化条件随机场模型。
  4. 如权利要求3所述的方法,还包括:
    通过多次迭代LBFGS算法,优化所述条件随机场模型。
  5. 如权利要求1所述的方法,其中,所述待处理文本包括:中文电子病历。
  6. 如权利要求1所述的方法,还包括:
    通过分词工具对所述待处理电子文本进行分词,以获取待处理电子文本中的词、字和/或字符。
  7. 如权利要求6所述的方法,其中,所述分词工具包括:结巴分词工具。
  8. 如权利要求1所述的方法,还包括:
    对所述待处理电子文本进行数据预处理。
  9. 如权利要求1所述的方法,其中,还包括:通过评价参数,对双向长短期记忆模型和条件随机场模型的结合进行评价,所述评价参数包括:精度、召回率、F1分数。
  10. 如权利要求8所述的方法,其中,所述分词工具使用的数据库为国际疾病分类数据库。
  11. 一种命名实体识别装置,包括:
    文本获取模块,用于获取待处理电子文本,所述待处理电子文本中包括词、字、和/或字符;
    向量生成模块,用于基于所述待处理电子文本中的各个词生成相对应的词向量,以及基于所述待处理电子文本中的各个字或字符生成相对应的字向量;
    特征向量生成模块,根据所述词向量和所述字向量,利用双向长短期记忆模型生成特征向量;
    命名实体识别模块,将所述特征向量输入随机场模型以识别所述命名实体,并获取所述命名实体的类型。
  12. 如权利要求11所述的装置,其中,所述特征向量生成模块,进一步包括:
    第一高级特征向量生成单元,用于将所述词向量输入双向长短期记忆模型,生成第一高级特征向量;
    第一过渡特征向量获取单元,用于将所述字向量与所述第一高级特征向量进行拼接,获取第一过渡特征向量;
    第二高级特征向量生成单元,用于将所述第一过渡特征向量输入所述双向长短期记忆模型,生成第二高级特征向量;
    第二过渡特征向量单元,用于将所述第一过渡特征向量与所述第二高级特征向量进行拼接,获取第二过渡特征向量;
    第三高级特征向量生成单元,用于将所述第二过渡特征向量输入所述双向长短期记忆模型,生成第三高级特征向量;
    特征向量单元,用于将所述第三高级特征向量作为所述特征向量。
  13. 如权利要求11所述的装置,其中,还包括:
    训练数据获取模块,用于获取训练数据,所述训练数据包括历史电子文本、历史命名实体和对应的历史命名实体类型;
    模型优化模块,用于根据所述历史电子文本、历史命名实体和对应的历史命名实体类型,优化条件随机场模型。
  14. 如权利要求13所述的装置,其中,还包括:
    算法迭代单元,用于通过多次迭代LBFGS算法,优化所述条件随机场模型。
  15. 如权利要求11所述的装置,还包括:
    分词模块,用于通过分词工具对所述待处理电子文本进行分词,以获取待处理电子文本中的词、字、和/或字符。
  16. 如权利要求11所述的装置,还包括:
    预处理模块,用于对所述待处理电子文本进行数据预处理。
  17. 如权利要求1所述的方法,其中,还包括:
    模型评价模块,用于对双向长短期记忆模型和条件随机场模型的结合进行评价,所述评价参数包括:精度、召回率、F1分数。
  18. 一种知识图谱的构建方法,包括:
    通过权利要求1至10中任一项所述的命名实体识别方法识别所述命名实体;
    根据识别的所述命名实体,构建知识图谱。
  19. 一种命名实体识别设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如权利要求1至10中任一项所述的命名实体识别方法中的步骤。
  20. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算 机程序,所述计算机程序被处理器执行时实现如权利要求1至10中任一项所述的命名实体识别方法中的步骤。
PCT/CN2020/076196 2019-04-22 2020-02-21 命名实体识别方法及装置 WO2020215870A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20796429.7A EP3961475A4 (en) 2019-04-22 2020-02-21 NAMED ENTITY IDENTIFICATION METHOD AND APPARATUS
US16/959,381 US11574124B2 (en) 2019-04-22 2020-02-21 Method and apparatus of recognizing named entity

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910325442.6 2019-04-22
CN201910325442.6A CN109871545B (zh) 2019-04-22 2019-04-22 命名实体识别方法及装置

Publications (1)

Publication Number Publication Date
WO2020215870A1 true WO2020215870A1 (zh) 2020-10-29

Family

ID=66922955

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/076196 WO2020215870A1 (zh) 2019-04-22 2020-02-21 命名实体识别方法及装置

Country Status (4)

Country Link
US (1) US11574124B2 (zh)
EP (1) EP3961475A4 (zh)
CN (1) CN109871545B (zh)
WO (1) WO2020215870A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112488194A (zh) * 2020-11-30 2021-03-12 上海寻梦信息技术有限公司 地址缩略语生成方法、模型训练方法及相关设备
CN112836056A (zh) * 2021-03-12 2021-05-25 南宁师范大学 一种基于网络特征融合的文本分类方法
CN112883730A (zh) * 2021-03-25 2021-06-01 平安国际智慧城市科技股份有限公司 相似文本匹配方法、装置、电子设备及存储介质

Families Citing this family (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871545B (zh) 2019-04-22 2022-08-05 京东方科技集团股份有限公司 命名实体识别方法及装置
CN110232192A (zh) * 2019-06-19 2019-09-13 中国电力科学研究院有限公司 电力术语命名实体识别方法及装置
CN110298044B (zh) * 2019-07-09 2023-04-18 广东工业大学 一种实体关系识别方法
CN110334357A (zh) * 2019-07-18 2019-10-15 北京香侬慧语科技有限责任公司 一种命名实体识别的方法、装置、存储介质及电子设备
CN110414395B (zh) * 2019-07-18 2022-08-02 北京字节跳动网络技术有限公司 内容识别方法、装置、服务器及存储介质
CN110597970B (zh) * 2019-08-19 2023-04-07 华东理工大学 一种多粒度医疗实体联合识别的方法及装置
CN110532570A (zh) * 2019-09-10 2019-12-03 杭州橙鹰数据技术有限公司 一种命名实体识别的方法和装置及模型训练的方法和装置
CN110555102A (zh) * 2019-09-16 2019-12-10 青岛聚看云科技有限公司 媒体标题识别方法、装置及存储介质
CN111339764A (zh) * 2019-09-18 2020-06-26 华为技术有限公司 一种中文命名实体识别方法以及装置
CN110909548B (zh) * 2019-10-10 2024-03-12 平安科技(深圳)有限公司 中文命名实体识别方法、装置及计算机可读存储介质
CN112906370B (zh) * 2019-12-04 2022-12-20 马上消费金融股份有限公司 意图识别模型训练方法、意图识别方法及相关装置
CN111145914B (zh) * 2019-12-30 2023-08-04 四川大学华西医院 一种确定肺癌临床病种库文本实体的方法及装置
CN111523316A (zh) * 2020-03-04 2020-08-11 平安科技(深圳)有限公司 基于机器学习的药物识别方法及相关设备
CN111444718A (zh) * 2020-03-12 2020-07-24 泰康保险集团股份有限公司 一种保险产品需求文档处理方法、装置及电子设备
CN111581972A (zh) * 2020-03-27 2020-08-25 平安科技(深圳)有限公司 文本中症状和部位对应关系识别方法、装置、设备及介质
CN113742523B (zh) * 2020-05-29 2023-06-27 北京百度网讯科技有限公司 文本核心实体的标注方法及装置
CN111783466A (zh) * 2020-07-15 2020-10-16 电子科技大学 一种面向中文病历的命名实体识别方法
CN112151183A (zh) * 2020-09-23 2020-12-29 上海海事大学 一种基于Lattice LSTM模型的中文电子病历的实体识别方法
CN112269911A (zh) * 2020-11-11 2021-01-26 深圳视界信息技术有限公司 设备信息识别方法、模型训练方法、装置、设备及介质
CN113157727B (zh) * 2021-05-24 2022-12-13 腾讯音乐娱乐科技(深圳)有限公司 提供召回结果的方法、设备和存储介质
CN113283242B (zh) * 2021-05-31 2024-04-26 西安理工大学 一种基于聚类与预训练模型结合的命名实体识别方法
CN113255356B (zh) * 2021-06-10 2021-09-28 杭州费尔斯通科技有限公司 一种基于实体词列表的实体识别方法和装置
CN113408273B (zh) * 2021-06-30 2022-08-23 北京百度网讯科技有限公司 文本实体识别模型的训练与文本实体识别方法、装置
CN113343692B (zh) * 2021-07-15 2023-09-12 杭州网易云音乐科技有限公司 搜索意图的识别方法、模型训练方法、装置、介质及设备
CN113486154A (zh) * 2021-07-27 2021-10-08 中国银行股份有限公司 文本内容识别方法及装置
CN113656555B (zh) * 2021-08-19 2024-03-12 云知声智能科技股份有限公司 嵌套命名实体识别模型的训练方法、装置、设备和介质
CN113505599B (zh) * 2021-09-10 2021-12-07 北京惠每云科技有限公司 病历文书中实体概念的提取方法、装置及可读存储介质
CN114048744A (zh) * 2021-10-28 2022-02-15 盐城金堤科技有限公司 基于实体抽取的任职记录生成方法、装置及设备
CN114330343B (zh) * 2021-12-13 2023-07-25 广州大学 词性感知嵌套命名实体识别方法、系统、设备和存储介质
CN117151222B (zh) * 2023-09-15 2024-05-24 大连理工大学 领域知识引导的突发事件案例实体属性及其关系抽取方法、电子设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107644014A (zh) * 2017-09-25 2018-01-30 南京安链数据科技有限公司 一种基于双向lstm和crf的命名实体识别方法
CN107908614A (zh) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 一种基于Bi‑LSTM的命名实体识别方法
CN108229582A (zh) * 2018-02-01 2018-06-29 浙江大学 一种面向医学领域的多任务命名实体识别对抗训练方法
CN109522546A (zh) * 2018-10-12 2019-03-26 浙江大学 基于上下文相关的医学命名实体识别方法
CN109871545A (zh) * 2019-04-22 2019-06-11 京东方科技集团股份有限公司 命名实体识别方法及装置

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2375355A1 (en) * 2002-03-11 2003-09-11 Neo Systems Inc. Character recognition system and method
US9916538B2 (en) * 2012-09-15 2018-03-13 Z Advanced Computing, Inc. Method and system for feature detection
KR102240279B1 (ko) * 2014-04-21 2021-04-14 삼성전자주식회사 컨텐트 처리 방법 및 그 전자 장치
US20180268015A1 (en) * 2015-09-02 2018-09-20 Sasha Sugaberry Method and apparatus for locating errors in documents via database queries, similarity-based information retrieval and modeling the errors for error resolution
US10509860B2 (en) * 2016-02-10 2019-12-17 Weber State University Research Foundation Electronic message information retrieval system
US11139081B2 (en) * 2016-05-02 2021-10-05 Bao Tran Blockchain gene system
CN106569998A (zh) * 2016-10-27 2017-04-19 浙江大学 一种基于Bi‑LSTM、CNN和CRF的文本命名实体识别方法
US10255269B2 (en) * 2016-12-30 2019-04-09 Microsoft Technology Licensing, Llc Graph long short term memory for syntactic relationship discovery
US11593558B2 (en) * 2017-08-31 2023-02-28 Ebay Inc. Deep hybrid neural network for named entity recognition
CN107885721A (zh) * 2017-10-12 2018-04-06 北京知道未来信息技术有限公司 一种基于lstm的命名实体识别方法
CN108628823B (zh) * 2018-03-14 2022-07-01 中山大学 结合注意力机制和多任务协同训练的命名实体识别方法
CN108536679B (zh) * 2018-04-13 2022-05-20 腾讯科技(成都)有限公司 命名实体识别方法、装置、设备及计算机可读存储介质
CN108829681B (zh) * 2018-06-28 2022-11-11 鼎富智能科技有限公司 一种命名实体提取方法及装置
CN109165384A (zh) * 2018-08-23 2019-01-08 成都四方伟业软件股份有限公司 一种命名实体识别方法及装置
CN109471895B (zh) * 2018-10-29 2021-02-26 清华大学 电子病历表型抽取、表型名称规范化方法及系统
CN109117472A (zh) * 2018-11-12 2019-01-01 新疆大学 一种基于深度学习的维吾尔文命名实体识别方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107644014A (zh) * 2017-09-25 2018-01-30 南京安链数据科技有限公司 一种基于双向lstm和crf的命名实体识别方法
CN107908614A (zh) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 一种基于Bi‑LSTM的命名实体识别方法
CN108229582A (zh) * 2018-02-01 2018-06-29 浙江大学 一种面向医学领域的多任务命名实体识别对抗训练方法
CN109522546A (zh) * 2018-10-12 2019-03-26 浙江大学 基于上下文相关的医学命名实体识别方法
CN109871545A (zh) * 2019-04-22 2019-06-11 京东方科技集团股份有限公司 命名实体识别方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3961475A4 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112488194A (zh) * 2020-11-30 2021-03-12 上海寻梦信息技术有限公司 地址缩略语生成方法、模型训练方法及相关设备
CN112836056A (zh) * 2021-03-12 2021-05-25 南宁师范大学 一种基于网络特征融合的文本分类方法
CN112836056B (zh) * 2021-03-12 2023-04-18 南宁师范大学 一种基于网络特征融合的文本分类方法
CN112883730A (zh) * 2021-03-25 2021-06-01 平安国际智慧城市科技股份有限公司 相似文本匹配方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN109871545B (zh) 2022-08-05
EP3961475A1 (en) 2022-03-02
US20210103701A1 (en) 2021-04-08
EP3961475A4 (en) 2023-05-03
US11574124B2 (en) 2023-02-07
CN109871545A (zh) 2019-06-11

Similar Documents

Publication Publication Date Title
WO2020215870A1 (zh) 命名实体识别方法及装置
CN109472033B (zh) 文本中的实体关系抽取方法及系统、存储介质、电子设备
CN111949787B (zh) 基于知识图谱的自动问答方法、装置、设备及存储介质
US11132370B2 (en) Generating answer variants based on tables of a corpus
WO2019153737A1 (zh) 用于对评论进行评估的方法、装置、设备和存储介质
US9621601B2 (en) User collaboration for answer generation in question and answer system
CN104636466B (zh) 一种面向开放网页的实体属性抽取方法和系统
Luo et al. Synthesizing natural language to visualization (NL2VIS) benchmarks from NL2SQL benchmarks
WO2017092380A1 (zh) 用于人机对话的方法、神经网络系统和用户设备
CN110675944A (zh) 分诊方法及装置、计算机设备及介质
CN111241294A (zh) 基于依赖解析和关键词的图卷积网络的关系抽取方法
CN105843897A (zh) 一种面向垂直领域的智能问答系统
CN111460787A (zh) 一种话题提取方法、装置、终端设备及存储介质
CN110765277B (zh) 一种基于知识图谱的移动端的在线设备故障诊断方法
Xie et al. Topic enhanced deep structured semantic models for knowledge base question answering
CN101539907A (zh) 词性标注模型训练装置、词性标注系统及其方法
CN104199965A (zh) 一种语义信息检索方法
CN111581474A (zh) 基于多头注意力机制的涉案微博评论的评价对象抽取方法
Zhang et al. Effective subword segmentation for text comprehension
CN112328800A (zh) 自动生成编程规范问题答案的系统及方法
CN109522396B (zh) 一种面向国防科技领域的知识处理方法及系统
CN111737420A (zh) 一种基于争议焦点的类案检索方法及系统及装置及介质
NEAMAH et al. QUESTION ANSWERING SYSTEM SUPPORTING VECTOR MACHINE METHOD FOR HADITH DOMAIN.
Li et al. Approach of intelligence question-answering system based on physical fitness knowledge graph
CN113705207A (zh) 语法错误识别方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20796429

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020796429

Country of ref document: EP

Effective date: 20211122