CN113221576A - Named entity identification method based on sequence-to-sequence architecture - Google Patents

Named entity identification method based on sequence-to-sequence architecture Download PDF

Info

Publication number
CN113221576A
CN113221576A CN202110608812.4A CN202110608812A CN113221576A CN 113221576 A CN113221576 A CN 113221576A CN 202110608812 A CN202110608812 A CN 202110608812A CN 113221576 A CN113221576 A CN 113221576A
Authority
CN
China
Prior art keywords
sequence
named entity
named
text
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110608812.4A
Other languages
Chinese (zh)
Other versions
CN113221576B (en
Inventor
邱锡鹏
颜航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202110608812.4A priority Critical patent/CN113221576B/en
Publication of CN113221576A publication Critical patent/CN113221576A/en
Application granted granted Critical
Publication of CN113221576B publication Critical patent/CN113221576B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Character Discrimination (AREA)

Abstract

The invention relates to the technical field of recognition, and provides a named entity recognition method based on a sequence-to-sequence architecture.

Description

Named entity identification method based on sequence-to-sequence architecture
Technical Field
The invention relates to the technical field of identification, in particular to a named entity identification method based on a sequence-to-sequence architecture.
Background
The named entity recognition task is a task of capturing a specific type of text segment from given text, such as extracting characters, places, symptoms and the like in the text. For example, for a sentence, "zhang san will be in a job in 2021," two tuples of (zhang san, person), (2021 year, time) need to be extracted, the first element of the tuple represents the content in the sentence, and the second element of the tuple represents what type of named entity the content is.
Named entity recognition is one of basic technologies of information extraction technology, and is widely applied to a question and answer system, a dialogue system, a translation system, and the like in natural language processing. In the most common named entity task, there is no intersection between different entities, and the same entity must be a contiguous piece of text. However, in some specific application scenarios, there may be a nested relationship between entities, for example, the phrase "souvenir hall" includes at least the following entities: (person, lugnu), (commemorative hall, venue), there is a nested relationship between the two entities. Furthermore, named entity recognition in the medical field may also be the case where there are non-continuous entities, for example, in entity recognition where patient symptoms are extracted, both symptoms (muscle pain, symptoms) and (muscle soreness, symptoms) need to be extracted from "patient muscle pain and soreness", where "muscle soreness" is not a continuous text segment in the original sentence.
At present, common named entity recognition is generally solved by a sequence labeling mode, but for nested named entity recognition and discontinuous named entity recognition, a complicated specification needs to be designed by adopting the sequence labeling mode. Moreover, the method for identifying the named entities through sequence marking is very limited, different types of named entity identification must be processed by adopting different model structures, and the application range is narrow.
Disclosure of Invention
The present invention is made to solve the above problems, and an object of the present invention is to provide a named entity recognition method based on a sequence-to-sequence architecture.
The invention provides a named entity identification method based on a sequence-to-sequence architecture, which has the characteristics that the method comprises the following steps: step S1, constructing a named entity recognition model; step S2, training the named entity recognition model through a preset sample, wherein the entity sequence of the preset sample is obtained according to a preset sequencing rule; step S3, inputting the text to be detected into a named entity recognition model to obtain a recognition result sequence; and step S4, decoding the recognition result sequence output by the named entity recognition model to obtain a plurality of named entities and text labels corresponding to the named entities, wherein the named entity recognition model comprises an encoder and a decoder, the output of the decoder is named entity positions and text labels, in the training process, the decoder outputs the named entity positions and the output labels as sample labels according to preset samples, the corresponding named entities are obtained from the preset samples according to the named entity positions as sample entities, the decoder is trained according to the sample entities and the sample labels, and the named entity sequence is composed of the named entity positions and the text labels output by the named entity recognition model according to the text to be tested.
In the named entity recognition method based on the sequence-to-sequence architecture provided by the invention, the method can also have the following characteristics: the input of the encoder is a text to be recognized, and the output of the encoder is a high-dimensional vector of words.
In the named entity recognition method based on the sequence-to-sequence architecture provided by the invention, the method can also have the following characteristics: wherein the input of the decoder is the output of the encoder and the output of the decoder is the named entity sequence.
In the named entity recognition method based on the sequence-to-sequence architecture provided by the invention, the method can also have the following characteristics: in the named entity sequence, the named entity position is used for indicating the position of the named entity in the text to be identified, and the text label is the category corresponding to the named entity.
In the named entity recognition method based on the sequence-to-sequence architecture provided by the invention, the method can also have the following characteristics: wherein the predetermined ordering rule is: and sequencing the named entities according to the starting positions of the named entities in sequence, and sequencing the named entities with the same starting positions according to the entity lengths corresponding to the named entities.
In the named entity recognition method based on the sequence-to-sequence architecture provided by the invention, the method can also have the following characteristics: the named entity position is a pointer pointing to the sequence number of the character in the text.
Action and Effect of the invention
According to the named entity recognition method based on the sequence-to-sequence architecture, the named entity recognition model of the component comprises an encoder and a decoder, the output of the decoder is named entity position and text labels, after the named entity recognition model is trained through a preset sample, a text to be detected is input into the named entity recognition model to obtain a recognition result sequence, and the recognition result sequence output by the named entity recognition model is decoded to obtain a plurality of named entities and text labels corresponding to the named entities.
In addition, in the training process, the decoder outputs the named entity position and the output label as a sample label according to the preset sample, acquires the corresponding named entity from the preset sample as a sample entity according to the named entity position, and trains the decoder according to the sample entity and the sample label, so that the training effect is prevented from being influenced by inputting the named entity position which does not contain semantic information.
Drawings
FIG. 1 is a flow diagram of a named entity identification method based on a sequence-to-sequence architecture in an embodiment of the present invention;
FIG. 2 is a diagram of a named entity recognition model in an embodiment of the invention.
Detailed Description
In order to make the technical means, the creation features, the achievement objectives and the efficacy of the present invention easy to understand, the following embodiments specifically describe the named entity recognition method based on sequence-to-sequence architecture in conjunction with the accompanying drawings.
< example >
This embodiment details the named entity recognition method based on the sequence-to-sequence architecture.
Fig. 1 is a flowchart of a named entity identification method based on a sequence-to-sequence architecture in this embodiment.
As shown in fig. 1, the named entity identification method based on sequence-to-sequence architecture includes the following steps:
and step S1, constructing a named entity recognition model.
The named entity recognition model comprises an encoder and a decoder, wherein the input of the encoder is a text to be recognized, and the output of the encoder is a high-dimensional vector of words. The input of the decoder is the output of the encoder and the output of the decoder is the named entity sequence.
Step S2, training the named entity recognition model through a preset sample, wherein the entity sequence of the preset sample is obtained according to a preset ordering rule.
In the training process, the decoder outputs the named entity position and the output label as a sample label according to a preset sample, acquires a corresponding named entity from the preset sample as a sample entity according to the named entity position, trains the decoder according to the sample entity and the sample label,
and step S3, inputting the text to be detected into the named entity recognition model to obtain a recognition result sequence.
Step S4, decoding the recognition result sequence output by the named entity recognition model to obtain a plurality of named entities and a text label corresponding to each named entity.
In the named entity sequence, the named entity position is used for indicating the position of the named entity in the text to be identified, and the text label is the category corresponding to the named entity. The named entity location is a pointer to the sequence number of the character in the text.
The predetermined ordering rule is: and sequencing the named entities according to the starting positions of the named entities in sequence, and sequencing the named entities with the same starting positions according to the entity lengths corresponding to the named entities.
FIG. 2 is a diagram of a named entity recognition model in this embodiment.
As shown in fig. 2, the named entity recognition model includes an encoder and a decoder.
The conversion mode after the text to be tested is input into the encoder is as follows:
and when the named entities in the text to be tested are conventional named entities, sequentially arranging the named entities according to the appearance sequence of the named entities in the text. For example, for text [ x ] to be tested1,x2,x3,x4,x5,x6,x7]Suppose where [ x ] is1,x2],[x5,x6]Respectively entity class e1And e2Then, the named entity sequence in the text to be tested is represented as [1,2, e ]1,5,6,e2]In this embodiment, the entity is expressed by using the named entity position in the text to be tested instead of using the text segment, so as to avoid ambiguity caused by the occurrence of the same text segment in the text. For example, the entity sequence corresponding to "Zhang Sansheng in Hunan" is [1, people, 4, place]。
When the named entities in the text to be tested are nested named entities, the conversion mode of the named entity sequence is that the named entities which start first are ranked ahead, and the entities which start at the same position are ranked ahead with shorter length. For example if the sentence [ x ]1,x2,x3,x4,x5,x6,x7]In [ x ]1,x2],[x1,x2,x3]And [ x ]5,x6]As entity class e1,e2And e3Then the corresponding entity sequence is [1,2, e ]1,1,2,3,e2,5,6,e3]And expressing the named entity by using the named entity position in the text to be tested.
When the named entities in the text to be tested are discontinuous named entities, the conversion rule of the named entity sequence is that the entity sequence started first is arranged in front, the entities started at the same position are arranged according to the entity length, and the shorter entity is arranged in front. For example if the sentence [ x ]1,x2,x3,x4,x5,x6,x7]In [ x ]1,x3],[x1,x2,x3,x5]And [ x ]5,x6]As entity class e1,e2And e3Then, the corresponding factThe body sequence is [1,3, e ]1,1,2,3,5,e2,5,6,e3]And expressing the named entity by using the named entity position in the text to be tested. Wherein x is1-xnFor the identified entity, n > 1
The calculation process of the encoder is as follows:
He=Encoder([x1,...,xn]),
wherein HeIs a latent vector for each word after encoding.
The calculation process of the decoder is as follows:
Figure BDA0003094705890000071
Ee=TokenEmbed(X),
Figure BDA0003094705890000072
Figure BDA0003094705890000073
Cd=TokenEmbed(C),
Figure BDA0003094705890000074
wherein the content of the first and second substances,
Figure BDA0003094705890000075
is the content that has been generated by the encoder, alpha is a hyper-parameter quantity, C is a collection of entity classes,
Figure BDA0003094705890000076
is a dot product, PtIs the distribution of the output words at the current moment, X is the input word,
Figure BDA0003094705890000077
is the decoder hidden state at time t, EeIs an input word embedding vector, CdAn embedded vector of classes.
In this embodiment, at the time of decoding, P is passedtThe output is the pointer number or the category number, and when they are used as the input of the next moment of the decoder, they need to be converted into corresponding words. Since the target generation sequence may generate the word sequence number of the input text, but the sequence number itself does not contain semantic information, the word sequence number cannot be directly transmitted to the decoder as an input in the process of performing autoregressive generation, and needs to be restored to a specific word in the input through mapping.
Effects and effects of the embodiments
According to the named entity recognition method based on the sequence-to-sequence architecture, the named entity recognition model of the component comprises the encoder and the decoder, the output of the decoder is the named entity position and the text label, after the named entity recognition model is trained through the preset samples, the text to be detected is input into the named entity recognition model to obtain the recognition result, and then the recognition result is sequenced according to the preset entity sequencing rule to obtain the named entity sequence.
In addition, in the training process, the decoder outputs the named entity position and the output label as a sample label according to the preset sample, acquires the corresponding named entity from the preset sample as a sample entity according to the named entity position, and trains the decoder according to the sample entity and the sample label, so that the training effect is prevented from being influenced by inputting the named entity position which does not contain semantic information.
The above embodiments are preferred examples of the present invention, and are not intended to limit the scope of the present invention.

Claims (6)

1. A named entity identification method based on a sequence-to-sequence architecture is characterized by comprising the following steps:
step S1, constructing a named entity recognition model;
step S2, training the named entity recognition model through a preset sample, wherein the entity sequence of the preset sample is obtained according to a preset sequencing rule;
step S3, inputting the text to be detected into the named entity recognition model to obtain a recognition result sequence;
step S4, decoding the recognition result sequence output by the named entity recognition model to obtain a plurality of named entities and the text labels corresponding to the named entities,
wherein the named entity recognition model comprises an encoder and a decoder,
the output of the decoder is the named entity location and the text label,
in the training process, enabling the decoder to output a named entity position and an output label as a sample label according to the preset sample, acquiring a corresponding named entity from the preset sample as a sample entity according to the named entity position, and training the decoder according to the sample entity and the sample label,
and the named entity sequence consists of the named entity position and the text label which are output by the named entity recognition model according to the text to be tested.
2. The named entity recognition method based on sequence-to-sequence architecture as claimed in claim 1, wherein:
the input of the encoder is a text to be recognized, and the output of the encoder is a high-dimensional vector of words.
3. The named entity recognition method based on sequence-to-sequence architecture as claimed in claim 1, wherein:
wherein the input of the decoder is the output of the encoder and the output of the decoder is the named entity sequence.
4. The named entity recognition method based on sequence-to-sequence architecture as claimed in claim 1, wherein:
in the named entity sequence, the named entity position is used for indicating the position of the named entity in the text to be recognized, and the text label is a category corresponding to the named entity.
5. The named entity recognition method based on sequence-to-sequence architecture as claimed in claim 1, wherein:
wherein the predetermined ordering rule is:
and sequencing the named entities according to the starting positions of the named entities, and sequencing the named entities with the same starting positions according to the entity lengths corresponding to the named entities.
6. The named entity recognition method based on sequence-to-sequence architecture as claimed in claim 1, wherein:
the naming entity is a pointer pointing to the sequence number of the character in the text.
CN202110608812.4A 2021-06-01 2021-06-01 Named entity identification method based on sequence-to-sequence architecture Active CN113221576B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110608812.4A CN113221576B (en) 2021-06-01 2021-06-01 Named entity identification method based on sequence-to-sequence architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110608812.4A CN113221576B (en) 2021-06-01 2021-06-01 Named entity identification method based on sequence-to-sequence architecture

Publications (2)

Publication Number Publication Date
CN113221576A true CN113221576A (en) 2021-08-06
CN113221576B CN113221576B (en) 2023-01-13

Family

ID=77082195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110608812.4A Active CN113221576B (en) 2021-06-01 2021-06-01 Named entity identification method based on sequence-to-sequence architecture

Country Status (1)

Country Link
CN (1) CN113221576B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113886522A (en) * 2021-09-13 2022-01-04 苏州空天信息研究院 Non-continuous entity identification method based on path expansion
CN115983271A (en) * 2022-12-12 2023-04-18 北京百度网讯科技有限公司 Named entity recognition method and named entity recognition model training method

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107680580A (en) * 2017-09-28 2018-02-09 百度在线网络技术(北京)有限公司 Text transformation model training method and device, text conversion method and device
CN107705784A (en) * 2017-09-28 2018-02-16 百度在线网络技术(北京)有限公司 Text regularization model training method and device, text regularization method and device
CN109543667A (en) * 2018-11-14 2019-03-29 北京工业大学 A kind of text recognition method based on attention mechanism
CN109684452A (en) * 2018-12-25 2019-04-26 中科国力(镇江)智能技术有限公司 A kind of neural network problem generation method based on answer Yu answer location information
CN110162795A (en) * 2019-05-30 2019-08-23 重庆大学 A kind of adaptive cross-cutting name entity recognition method and system
CN110362823A (en) * 2019-06-21 2019-10-22 北京百度网讯科技有限公司 The training method and device of text generation model are described
CN110704633A (en) * 2019-09-04 2020-01-17 平安科技(深圳)有限公司 Named entity recognition method and device, computer equipment and storage medium
CN111310485A (en) * 2020-03-12 2020-06-19 南京大学 Machine translation method, device and storage medium
CN111539229A (en) * 2019-01-21 2020-08-14 波音公司 Neural machine translation model training method, neural machine translation method and device
CN111581361A (en) * 2020-04-22 2020-08-25 腾讯科技(深圳)有限公司 Intention identification method and device
CN112069328A (en) * 2020-09-08 2020-12-11 中国人民解放军国防科技大学 Establishment method of entity relation joint extraction model based on multi-label classification
CN112417902A (en) * 2020-12-04 2021-02-26 北京有竹居网络技术有限公司 Text translation method, device, equipment and storage medium
CN112784602A (en) * 2020-12-03 2021-05-11 南京理工大学 News emotion entity extraction method based on remote supervision

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107705784A (en) * 2017-09-28 2018-02-16 百度在线网络技术(北京)有限公司 Text regularization model training method and device, text regularization method and device
CN107680580A (en) * 2017-09-28 2018-02-09 百度在线网络技术(北京)有限公司 Text transformation model training method and device, text conversion method and device
CN109543667A (en) * 2018-11-14 2019-03-29 北京工业大学 A kind of text recognition method based on attention mechanism
CN109684452A (en) * 2018-12-25 2019-04-26 中科国力(镇江)智能技术有限公司 A kind of neural network problem generation method based on answer Yu answer location information
CN111539229A (en) * 2019-01-21 2020-08-14 波音公司 Neural machine translation model training method, neural machine translation method and device
CN110162795A (en) * 2019-05-30 2019-08-23 重庆大学 A kind of adaptive cross-cutting name entity recognition method and system
CN110362823A (en) * 2019-06-21 2019-10-22 北京百度网讯科技有限公司 The training method and device of text generation model are described
CN110704633A (en) * 2019-09-04 2020-01-17 平安科技(深圳)有限公司 Named entity recognition method and device, computer equipment and storage medium
CN111310485A (en) * 2020-03-12 2020-06-19 南京大学 Machine translation method, device and storage medium
CN111581361A (en) * 2020-04-22 2020-08-25 腾讯科技(深圳)有限公司 Intention identification method and device
CN112069328A (en) * 2020-09-08 2020-12-11 中国人民解放军国防科技大学 Establishment method of entity relation joint extraction model based on multi-label classification
CN112784602A (en) * 2020-12-03 2021-05-11 南京理工大学 News emotion entity extraction method based on remote supervision
CN112417902A (en) * 2020-12-04 2021-02-26 北京有竹居网络技术有限公司 Text translation method, device, equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MIKE LEWIS 等: "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension", 《PROCEEDINGS OF THE 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 *
王卫红 等: "基于Seq2Seq模型的命名实体识别方法", 《智能计算机与应用》 *
王少敬 等: "基于序列图模型的多标签序列标注", 《中文信息学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113886522A (en) * 2021-09-13 2022-01-04 苏州空天信息研究院 Non-continuous entity identification method based on path expansion
CN113886522B (en) * 2021-09-13 2022-12-02 苏州空天信息研究院 Discontinuous entity identification method based on path expansion
CN115983271A (en) * 2022-12-12 2023-04-18 北京百度网讯科技有限公司 Named entity recognition method and named entity recognition model training method
CN115983271B (en) * 2022-12-12 2024-04-02 北京百度网讯科技有限公司 Named entity recognition method and named entity recognition model training method

Also Published As

Publication number Publication date
CN113221576B (en) 2023-01-13

Similar Documents

Publication Publication Date Title
CN108717406B (en) Text emotion analysis method and device and storage medium
CN109086357B (en) Variable automatic encoder-based emotion classification method, device, equipment and medium
CN110909122B (en) Information processing method and related equipment
CN113221576B (en) Named entity identification method based on sequence-to-sequence architecture
CN110134954B (en) Named entity recognition method based on Attention mechanism
CN110705302B (en) Named entity identification method, electronic equipment and computer storage medium
WO2021174695A1 (en) Machine learning-based medicine recognition method and related device
Alsharid et al. Captioning ultrasound images automatically
CN110377902B (en) Training method and device for descriptive text generation model
CN113887229A (en) Address information identification method and device, computer equipment and storage medium
CN116628186B (en) Text abstract generation method and system
CN114490953B (en) Method for training event extraction model, method, device and medium for extracting event
CN115130613A (en) False news identification model construction method, false news identification method and device
CN115587583A (en) Noise detection method and device and electronic equipment
CN113836929A (en) Named entity recognition method, device, equipment and storage medium
CN112580351A (en) Machine-generated text detection method based on self-information loss compensation
CN114861601B (en) Event joint extraction method based on rotary coding and storage medium
CN116483979A (en) Dialog model training method, device, equipment and medium based on artificial intelligence
CN113705692B (en) Emotion classification method and device based on artificial intelligence, electronic equipment and medium
CN113704472B (en) Method and system for identifying hate and offensive language based on theme memory network
CN114067362A (en) Sign language recognition method, device, equipment and medium based on neural network model
Bailey et al. Breathing new life into death certificates: Extracting handwritten cause of death in the LIFE-M project
Vidhyasagar et al. Video Captioning Based on Sign Language Using YOLOV8 Model
EP3757825A1 (en) Methods and systems for automatic text segmentation
CN113553855A (en) Viewpoint role labeling method and device, computer equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant