CN115859979A - Legal document named entity identification method, device and storage medium - Google Patents

Legal document named entity identification method, device and storage medium Download PDF

Info

Publication number
CN115859979A
CN115859979A CN202211464487.XA CN202211464487A CN115859979A CN 115859979 A CN115859979 A CN 115859979A CN 202211464487 A CN202211464487 A CN 202211464487A CN 115859979 A CN115859979 A CN 115859979A
Authority
CN
China
Prior art keywords
text
named entity
marked
label
labeled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211464487.XA
Other languages
Chinese (zh)
Inventor
肖熊锋
李庆
杜向阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qingdun Information Technology Co ltd
Original Assignee
Beijing Qingdun Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qingdun Information Technology Co ltd filed Critical Beijing Qingdun Information Technology Co ltd
Priority to CN202211464487.XA priority Critical patent/CN115859979A/en
Publication of CN115859979A publication Critical patent/CN115859979A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of data processing, in particular to a legal document named entity identification method, a device and a storage medium, wherein the method comprises the following steps: acquiring an initial marked text and an unmarked text of a legal document; carrying out named entity labeling on the label-free text through a preset rule base to obtain a first labeled text; performing data enhancement on the initial marked text to obtain a marked text after data enhancement; preprocessing the label text after the data enhancement and the first label text to obtain a processed label text; performing iterative training by using the processed labeling text and the BERT model to obtain a named entity recognition model; and performing command entity recognition on the unmarked legal documents to be recognized by using the named entity recognition model to obtain a named entity recognition result. Through the technical scheme, the labor cost for acquiring data is reduced, the field adaptability is improved, and the application scene with fine granularity is better adapted.

Description

Legal document named entity identification method, device and storage medium
Technical Field
The invention relates to the technical field of data processing, in particular to a legal document named entity recognition and analysis method, a device and a storage medium.
Background
Named entities are identified as one of the basic tasks of natural language processing, the goal of which is to extract named entities in text and classify the entities, such as names of people, place names, organizations, time, currency, percentages, etc., and are widely used for tasks such as information extraction, question-answering system, syntactic analysis, information retrieval, and emotion analysis.
Overall, the general trend of NER research in recent years is as follows: early methods built NER systems based primarily on rules and dictionaries, such as LaSIEII at Sheffield university and NetOwl at ISOQuest. By the beginning of 2000, CRF equal probability charts and Momo types are widely used. Further, with the rise of deep learning, bi-LSTM + CRF has once been a focus of research, and many methods have been modified on the BiLSTM-CRF until recently. Both LSTM and CRF require information to merge contexts in a sequence, and the Attention mechanism (Attention) adaptively computes weights for different context objects, thus being a useful technique. In addition, as the pre-training technology develops, the NER model based on the pre-training model such as BERT occupies the current dominance. The deep learning method generally needs a large amount of data to learn well, but in actual production, the situations of data set missing and a small amount of labels often occur, and the problems can be solved to a certain extent by transfer learning and semi-supervision, but the existing method with the best effect is still based on supervised learning.
The prior art mainly has the following defects:
1) Due to semantic complexity and diversity of legal texts, the existing methods mainly aim at general fields. Thus, the prior method has insufficient generalization;
2) Lack of high-quality labeled corpora often depends on domain experts, and even if high-quality labeled corpora exist, the quantitative deficiency can influence the generalization;
3) The entity categories to be extracted are many, the similarity between the entity categories is high, and category identification errors are easy to occur; the ability of dealing with entity categories with finer granularity is insufficient by a deep learning model;
4) The pre-training language model is mainly based on language materials in the general field, and few pre-training models special for the legal field exist.
Disclosure of Invention
In order to overcome the problems in the related art, the invention provides a legal document named entity identification method, a device and a storage medium, so that the labor cost for acquiring data is reduced, the field adaptability is improved, and fine-grained application scenes are better adapted.
According to a first aspect of an embodiment of the present invention, there is provided a legal document named entity identification method, including:
acquiring an initial marked text and an unmarked text of a legal document, wherein the named entity is marked in the initial marked text, and the named entity is not marked in the unmarked text;
carrying out named entity labeling on the label-free text through a preset rule base to obtain a first labeled text;
performing data enhancement on the initial marked text to obtain a marked text after data enhancement;
preprocessing the label text after the data enhancement and the first label text to obtain a processed label text;
performing iterative training by using the processed labeling text and the BERT model to obtain a named entity recognition model;
and performing command entity recognition on the unmarked legal documents to be recognized by using the named entity recognition model to obtain a named entity recognition result.
In one embodiment, preferably, the data enhancement of the initially annotated text comprises:
and performing data enhancement on the initial marked text by using a random splicing method, a random exchange method and/or a random erasing method.
In one embodiment, preferably, the random splicing method includes: randomly extracting a single sequence from the two texts with the context relationship, and splicing the single sequence into a new text with the label;
the random exchange method comprises the steps of randomly exchanging named entities in two texts with context relation to obtain a new labeled text;
the random erasing method comprises the steps of erasing other characters except for named entities in each text according to preset probability, and taking the residual text content after erasing as a new labeled sample.
In one embodiment, preferably, the preprocessing includes performing simplified and complicated conversion, word segmentation, word deactivation and uncommon punctuation symbol removal on the data-enhanced labeled text and the first labeled text.
In one embodiment, preferably, the method further comprises:
carrying out named entity labeling on the label-free text by using the named entity recognition model to obtain a second labeled text;
performing confidence calculation and manual verification on the first labeling text and the second labeling text;
and adding the labeled text with the confidence coefficient larger than the preset value and passing the manual verification into the initial labeled text as a manual corpus according to a preset proportion so as to update the initial labeled text.
In one embodiment, preferably, the method further comprises:
marking the legal document to be identified without marking according to the named entity identification result to obtain a marked legal document;
carrying out confidence calculation and manual verification on the marked legal documents;
and adding the legal documents with the confidence degrees larger than the preset value and passing the manual verification as manual corpora into the initial marked text so as to update the initial marked text.
In one embodiment, preferably, the method further comprises:
after each artificial corpus is used for training and learning, the obtained named entity recognition model is used for re-labeling the artificial corpus, and the number of corrected labels is counted;
and when the number of the correction labels presents a divergence trend, discarding the artificial linguistic data, and returning the parameters of the named entity recognition model to the state before learning.
According to a second aspect of embodiments of the present invention, there is provided a legal document named entity recognition apparatus, the apparatus comprising:
the system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring an initial marked text and an unmarked text of the legal document, wherein the named entity is marked in the initial marked text, and the named entity is not marked in the unmarked text;
the first labeling module is used for labeling named entities of the label-free text through a preset rule base to obtain a first labeled text;
the data enhancement module is used for enhancing the data of the initial marked text to obtain a marked text with enhanced data;
the preprocessing module is used for preprocessing the label text after the data enhancement and the first label text to obtain a processed label text;
the training module is used for carrying out iterative training by utilizing the processed labeling text and the BERT model to obtain a named entity recognition model;
and the identification module is used for carrying out command entity identification on the unmarked legal documents to be identified by using the named entity identification model to obtain the named entity identification result.
In one embodiment, preferably, the data enhancement module is configured to:
and performing data enhancement on the initial marked text by using a random splicing method, a random exchange method and/or a random erasing method.
In one embodiment, preferably, the random splicing method includes: randomly extracting a single sequence from the two texts with the context relationship, and splicing the single sequence into a new text with the label;
the random exchange method comprises the steps of randomly exchanging named entities in two texts with context relation to obtain a new labeled text;
the random erasing method comprises the steps of erasing other characters except for named entities in each text according to preset probability, and taking the residual text content after erasing as a new labeled sample.
In one embodiment, preferably, the preprocessing includes performing simplified and complicated conversion, word segmentation, word deactivation and uncommon punctuation symbol removal on the data-enhanced labeled text and the first labeled text.
In one embodiment, preferably, the apparatus further comprises:
the second labeling module is used for labeling the named entities of the unlabeled text by using the named entity recognition model to obtain a second labeled text;
the first post-processing module is used for performing confidence calculation and manual verification on the first annotation text and the second annotation text;
and the first updating module is used for adding the labeled text with the confidence coefficient larger than the preset value and passing the manual verification into the initial labeled text as the manual corpus according to the preset proportion so as to update the initial labeled text.
In one embodiment, preferably, the apparatus further comprises:
the third labeling module is used for labeling the legal document to be identified without the label according to the named entity identification result to obtain a labeled legal document;
the second post-processing module is used for performing confidence calculation and manual verification on the marked legal document;
and the second updating module is used for adding the legal documents with the confidence degrees larger than the preset value and passing the manual verification as manual corpora into the initial marked text so as to update the initial marked text.
In one embodiment, preferably, the apparatus further comprises:
the statistical module is used for re-labeling the artificial corpus by using the obtained named entity recognition model after training and learning by using one artificial corpus each time, and counting the number of corrected labels;
and the processing module is used for discarding the artificial linguistic data when the number of the correction labels presents a divergent trend, and detecting the parameters of the named entity recognition model back to the state before learning.
According to a third aspect of embodiments of the present invention, there is provided a legal document named entity recognition device, the device comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
acquiring an initial marked text and an unmarked text of a legal document, wherein the named entity is marked in the initial marked text, and the named entity is not marked in the unmarked text;
carrying out named entity labeling on the label-free text through a preset rule base to obtain a first labeled text;
performing data enhancement on the initial marked text to obtain a marked text after data enhancement;
preprocessing the label text after the data enhancement and the first label text to obtain a processed label text;
performing iterative training by using the processed labeling text and the BERT model to obtain a named entity recognition model;
and performing command entity recognition on the unmarked legal documents to be recognized by using the named entity recognition model to obtain a named entity recognition result.
According to a fourth aspect of embodiments of the present invention, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method according to any one of the embodiments of the second aspect.
The technical scheme provided by the embodiment of the invention can have the following beneficial effects:
1) Compared with the BERT pre-training model in the general field which is mostly based in the prior art, the invention trains a proprietary BERT pre-training model in the legal field by relying on the advantages of mass field data, so that the field adaptability of the word vector is stronger.
2) The method combines rule recognition and model recognition, combines schemes of new word discovery, word stock construction, artificial rule construction and the like in the rule recognition, and adopts a domain pre-training model to enable the model to be better adapted to application scenes with fine granularity.
3) The method utilizes a chapter-level data enhancement technology, which is more suitable for the long text condition of legal documents, and can reduce the data acquisition cost under the supervised deep learning compared with the prior method.
4) By adding an incremental learning mechanism to the model, the system can be continuously improved, the recognition effect is improved, and extra expenses are not brought.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1A is a schematic diagram of a neural network model shown in accordance with an exemplary embodiment.
FIG. 1B is a block diagram illustrating a BERT model according to an exemplary embodiment.
FIG. 1C is a flow chart illustrating a legal document named entity identification method in accordance with an exemplary embodiment.
FIG. 2 is a flow chart illustrating another legal document named entity identification method in accordance with an exemplary embodiment.
FIG. 3 is a flow chart illustrating another legal document named entity identification method in accordance with an exemplary embodiment.
FIG. 4 is a flow chart illustrating another legal document named entity identification method in accordance with an exemplary embodiment.
FIG. 5 is a block diagram illustrating a legal document named entity recognition device, according to an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
The invention adopts a method of combining a bidirectional long-short time memory neural network with a conditional random field (BilSTM-CRF) to construct a neural network model to carry out entity classification, the model structure of the neural network model is shown as figure 1A, the method combines context related information of words, introduces distributed expression of the words into feature extraction, and utilizes the relationship between the words and labels to the maximum extent, thereby fully improving the recognition effect.
The method for fine tuning the BilSTM-CRF based on the pre-trained language model comprises the steps of representing sentence semantics by unsupervised pre-training of a language model through massive linguistic data, and then fine tuning the language model on the linguistic data with labels.
The embedding layer is a dense vector representation or distributed representation that converts a Chinese text sequence into characters or words. The BERT model is a pre-training language model containing character-level and sentence-level characteristics, and in order to capture context information, the BERT adopts a bidirectional Transformer as an encoder to model texts through an attention mechanism. And character embedding, position embedding and sentence embedding are spliced by inputting the BERT model, and feature extraction is carried out by inputting the BERT model into a stacked Transformer model, so that an output sequence vector is obtained and is used as character embedding. The structure of the BERT model is shown in FIG. 1B.
Wherein [ CLS ] and [ SEP ] represent the mark of BERT to the sequence, [ CLS ] identifies the starting position of the sequence, and [ SEP ] identifies the sentence to be cut; e represents a distributed representation of each character; trm represents a transform model stacked in a BERT model; t represents the sequence vector output by the BERT model.
FIG. 1C is a flow chart illustrating a legal document named entity identification method in accordance with an exemplary embodiment.
As shown in fig. 1C, according to a first aspect of the embodiments of the present invention, there is provided a legal document named entity identification method, including:
step S101, acquiring an initial marked text and an unmarked text of a legal document, wherein the named entities are marked in the initial marked text, and the named entities are not marked in the unmarked text;
the initial marked text, also called initial marked linguistic data, is usually provided by a service side of an application scene, the linguistic data already marks a named entity, and a label is marked and corrected by a domain expert, so that the accuracy rate and the quality are high. In the proprietary domain, such data is often difficult to obtain and requires a very high labor cost, and therefore, this part of the corpus is often used to train the initial baseline model.
The method is characterized in that a named entity is not marked, namely unstructured plain text, relative to an initial marked corpus. Such text is abundant in data in various fields, and thus a large amount of such corpora are often directly obtained at the beginning of naming an entity recognition item, rather than the standard annotated corpora. However, in some scenarios, in order to be closer to the semantic scenario of practical application, a crawler may need to be constructed to crawl relevant documents on the network, so as to prepare data for marking later.
Step S102, carrying out named entity labeling on the label-free text through a preset rule base to obtain a first labeled text;
the preset rule base can be a rule base or a dictionary base, and the sources of the dictionary base are as follows: the method comprises the steps of network crawling of various entity word banks, manual expert induction and summary, unsupervised learning of the existing unmarked linguistic data and extraction of the existing marked linguistic data. In order to obtain a dictionary with high quality and sufficient content, the dictionaries obtained by the methods are integrated and deduplicated and then handed to a human for checking once.
The source of the rule base is obtained by inducing and summarizing the line rule of the application scene mainly by means of an artificial expert. In addition, some new rules can be generated by learning through the existing markup corpus. These rules are also manually collated.
And after the dictionary database and the rule database are obtained, the non-labeled corpus is labeled by matching the dictionary and the rule. The dictionary rule labeled linguistic data obtained by labeling can be used as a label for weak supervised learning due to the characteristic of low recall rate, and can also be fused with a label result based on model prediction to obtain a labeled linguistic data with higher recall, and after manual proofreading, the manual proofreading linguistic data is added, and can be added into an initially labeled text for model training.
Step S103, performing data enhancement on the initial annotated text to obtain a data-enhanced annotated text;
the method aims to solve the problem of how to obtain a large amount of high-quality labeled data through a small amount of manpower when the data amount in the current field is insufficient. In the field of natural language processing, previous work on data enhancement has mainly focused on text classification, emotion analysis, and especially machine translation tasks, while data enhancement on NER tasks has not been fully explored. Existing work utilizes tag-level or sentence-level information for data enhancement, while for some scenes where the text is long and contains multiple sentences, the syntax or entity information of another sentence in the same document paragraph helps identify entity types and boundaries.
The method consists of three sub-methods of random linking, random exchange and random erasing, and the data enhancement training is carried out by utilizing document-level semantic context from three aspects respectively. These three operations are all modifications to the input stage of the overall architecture and are therefore easily implemented on various NER models without changing their model structure.
Random chaining refers to sampling of a trained word sequence for model optimization, without being limited by a sentence. For example, sentence B is the next sentence of a, and we can extract the word sequence C from the two sentences. It is a method that allows training data to be sampled in one or more sentences to achieve a higher training accuracy. By randomly selecting the header character and the sample length, one or more truncated parts of the sentence are obtained, forming a new sentence. In general, there will be a logical relationship between the upper and lower sentences, particularly the end of the upper sentence and the beginning of the lower sentence. Therefore, the random link can be used for acquiring richer semantic information, the performance of named entity identification is improved, and the over-fitting problem is further reduced.
The random exchange refers to the random replacement of the existing entities by the upper sentence and the lower sentence in the corpus. For example, sentence B is the next sentence of a, and the entity in sentence B and the entity in sentence a are randomly transposed to obtain new samples C and D. Generally, the sentences obtained in this way can enable existing entities to appear multiple times in different context environments, thereby further reducing the problem of overfitting.
The random erasure refers to the random erasure of sentences in the corpus and non-entity parts. For a sentence A, neglecting part of the entity in the sentence, erasing each remaining character with a certain probability p, and taking the remaining content as a new sample A'. Generally, the sentences obtained in such a way enable the existing entities to gain more attention, thereby further reducing the problem of overfitting. And applying the three data enhancement means to the existing markup corpus to obtain a new markup corpus.
Step S104, preprocessing the label text after the data enhancement and the first label text to obtain a processed label text;
in one embodiment, preferably, the preprocessing includes performing simplified and complicated conversion, word segmentation, word deactivation and uncommon punctuation symbol removal on the data-enhanced labeled text and the first labeled text.
The corpus data from different sources may have different formats for various reasons, and such problems, if left untreated, may also affect experimental performance. Therefore, these data need to be preprocessed before being fed into the neural network module. In a Chinese scene, text is generally subjected to simplified and repeated conversion, word segmentation, word stop, removal of unusual punctuation and the like. The simplified and traditional conversion can introduce conversion tools such as OpenCC and the like, the word segmentation part can introduce a jieba word segmentation tool, and the word segmentation result is filtered by using a Chinese stop word list.
Step S105, performing iterative training by using the processed label text and the BERT model to obtain a named entity recognition model;
and for word vectors, pre-training is carried out through the existing massive unsupervised legal documents, and a domain-specific pre-training BERT model is generated. The whole workflow is divided into three stages of data preprocessing, training data generation, training performance optimization and pretraining effect optimization.
At the model level, with the usual baseline model: such as CRF \ Bi-LSTM \ CNN, and the like, and the adaptation, parameter adjustment and the like of the corresponding fields are carried out.
After the model is trained in one version, named entity recognition is carried out on the unmarked corpus by using the trained model, and a marking result is obtained, wherein the part of marking results can be verified in confidence and manual mode.
And S106, performing command entity recognition on the unmarked legal document to be recognized by using the named entity recognition model to obtain a named entity recognition result.
In the embodiment, on the basis of the existing deep learning based on the general field BERT, a special BERT model in the legal field is pre-trained, so that the field adaptability of the word vector is improved; meanwhile, the rule identification and the model identification are combined, so that a system under the fine granularity can have a better identification result; and the data enhancement technology is utilized, so that the labor cost for acquiring data is reduced.
FIG. 2 is a flow chart illustrating another legal document named entity identification method in accordance with an exemplary embodiment.
As shown in fig. 2, in one embodiment, preferably, the method further comprises:
step S201, using the named entity recognition model to label the named entity of the unlabeled text to obtain a second labeled text;
step S202, performing confidence calculation and manual verification on the first annotation text and the second annotation text;
for data obtained by labeling in multiple modes, a confidence calculation mode is needed, and labeled data with a good result is screened out:
Figure BDA0003956908760000121
wherein E standard And (4) representing confidence, wherein P and R are accuracy and recall respectively, and b is full time. Since the recall rate identified in practice tends to be low, a calculation that favors the recall rate is taken.
Then, the data are manually corrected, and finally a corpus is added to obtain a new high-quality corpus. This greatly reduces the cost of manually annotating the corpora from zeros.
Step S203, the marked text with the confidence coefficient larger than the preset value and passing the manual verification is used as the manual corpus according to the preset proportion and added to the initial marked text so as to update the initial marked text.
The method is characterized in that the unmarked linguistic data are marked through rules and a dictionary, the unmarked linguistic data are predicted after the marked linguistic data are trained through a neural network module, and the two types of linguistic data are subjected to confidence calculation and manual verification to obtain new marked linguistic data with higher quality. After the part of the markup language material is obtained, the initial markup language material is added according to a certain proportion according to the requirements of an application scene, and the next iteration training is carried out together, so that incremental training and maintenance are formed.
FIG. 3 is a flow chart illustrating another legal document named entity identification method in accordance with an exemplary embodiment.
As shown in fig. 3, in one embodiment, preferably, the method further comprises:
step S301, labeling the legal document to be identified without labeling according to the named entity identification result to obtain a labeled legal document;
step S302, performing confidence calculation and manual verification on the marked legal document;
step S303, adding the legal documents with the confidence degrees larger than the preset value and passing the manual verification as manual corpora into the initial labeled text so as to update the initial labeled text.
After the named entity recognition model is obtained through training, when the legal documents to be recognized are recognized through the named entity recognition model, the marked legal documents can be added to the initial marked text participating in model training after confidence calculation and manual verification are carried out on the marked legal documents, and incremental learning is achieved.
FIG. 4 is a flow chart illustrating another legal document named entity identification method in accordance with an exemplary embodiment.
As shown in fig. 4, in one embodiment, preferably, the method further comprises:
step S401, after training and learning by using one artificial corpus each time, re-labeling the artificial corpus by using the obtained named entity recognition model, and counting the number of corrected labels;
step S402, when the number of the correction labels presents a divergence trend, discarding the artificial corpus, and detecting the parameters of the named entity recognition model back to the state before learning.
Considering that the model may not have good cross-domain adaptability, the method increases the example of accumulation of model error recognition in the incremental learning part, and enables the model to learn to enhance generalization capability, thereby being suitable for named entity recognition in more open domains. Specifically, the model incremental learning is realized by using an artificial corpus obtained through artificial correction according to three stages of error learning, effectiveness judgment and incremental accumulation. The system performs re-labeling after learning by using one manually corrected corpus each time, counts the corrected number, abandons the incremental corpus and returns the parameters to the state before learning if the corrected number shows a divergent trend in the iteration process. By adding an incremental learning mechanism to the model, the system can be continuously improved, the recognition effect is improved, and extra expenses are not brought.
Fig. 5 is a block diagram illustrating a legal document named entity recognition apparatus in accordance with an exemplary embodiment.
As shown in fig. 5, according to a second aspect of the embodiments of the present invention, there is provided a legal document named entity recognition apparatus, including:
an obtaining module 51, configured to obtain an initial labeled text and an unlabeled text of a legal document, where a named entity is labeled in the initial labeled text and a named entity is not labeled in the unlabeled text;
the first labeling module 52 is configured to label named entities of the label-free text through a preset rule base to obtain a first labeled text;
the data enhancement module 53 is configured to perform data enhancement on the initial labeled text to obtain a labeled text after data enhancement;
a preprocessing module 54, configured to preprocess the data-enhanced labeled text and the first labeled text to obtain a processed labeled text;
the training module 55 is configured to perform iterative training by using the processed labeled text and the BERT model to obtain a named entity recognition model;
and the identification module 56 is used for performing command entity identification on the unmarked legal document to be identified by using the named entity identification model to obtain a named entity identification result.
In one embodiment, preferably, the data enhancement module is configured to:
and performing data enhancement on the initial marked text by using a random splicing method, a random exchange method and/or a random erasing method.
In one embodiment, preferably, the random splicing method includes: randomly extracting a single sequence from the two texts with the context relationship, and splicing the single sequence into a new text with a label;
the random exchange method comprises the steps of randomly exchanging named entities in two texts with context relation to obtain a new labeled text;
the random erasing method comprises the steps of erasing other characters except for named entities in each text according to preset probability, and taking the residual text content after erasing as a new marked sample.
In one embodiment, preferably, the preprocessing includes performing simplified and complicated conversion, word segmentation, word deactivation and uncommon punctuation symbol removal on the data-enhanced labeled text and the first labeled text.
In one embodiment, preferably, the apparatus further comprises:
the second labeling module is used for labeling the named entities of the unlabeled text by using the named entity recognition model to obtain a second labeled text;
the first post-processing module is used for performing confidence calculation and manual verification on the first labeling text and the second labeling text;
and the first updating module is used for adding the labeled text with the confidence coefficient larger than the preset value and passing the manual verification into the initial labeled text as a manual corpus according to a preset proportion so as to update the initial labeled text.
In one embodiment, preferably, the apparatus further comprises:
the third labeling module is used for labeling the legal document to be identified without the label according to the named entity identification result to obtain a labeled legal document;
the second post-processing module is used for carrying out confidence calculation and manual verification on the marked legal documents;
and the second updating module is used for adding the legal documents with the confidence degrees larger than the preset value and passing the manual verification as manual corpora into the initial marked text so as to update the initial marked text.
In one embodiment, preferably, the apparatus further comprises:
the statistical module is used for re-labeling the artificial corpus by using the obtained named entity recognition model after training and learning by using one artificial corpus each time, and counting the number of corrected labels;
and the processing module is used for discarding the artificial corpus and detecting the parameters of the named entity recognition model back to the state before learning when the number of the correction labels presents a divergence trend.
According to a third aspect of embodiments of the present invention, there is provided a legal document named entity recognition apparatus, the apparatus comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
acquiring an initial marked text and an unmarked text of a legal document, wherein the named entity is marked in the initial marked text, and the named entity is not marked in the unmarked text;
carrying out named entity labeling on the label-free text through a preset rule base to obtain a first labeled text;
performing data enhancement on the initial marked text to obtain a marked text after data enhancement;
preprocessing the label text after the data enhancement and the first label text to obtain a processed label text;
performing iterative training by using the processed label text and the BERT model to obtain a named entity recognition model;
and performing command entity recognition on the unmarked legal documents to be recognized by using the named entity recognition model to obtain a named entity recognition result.
According to a fourth aspect of embodiments of the present invention, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method according to any one of the embodiments of the second aspect.
It is further understood that the use of "a plurality" in the present invention means two or more, and other terms are intended to be analogous. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. The singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It will be further understood that the terms "first," "second," and the like are used to describe various information and that such information should not be limited by these terms. These terms are only used to distinguish one type of information from another and do not denote a particular order or importance. Indeed, the terms "first," "second," and the like are fully interchangeable. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention.
It is further to be understood that while operations are depicted in the drawings in a particular order, this is not to be understood as requiring that such operations be performed in the particular order shown or in serial order, or that all illustrated operations be performed, to achieve desirable results. In certain environments, multitasking and parallel processing may be advantageous.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (10)

1. A legal document named entity recognition method, comprising:
acquiring an initial marked text and an unmarked text of a legal document, wherein the named entity is marked in the initial marked text, and the named entity is not marked in the unmarked text;
carrying out named entity labeling on the label-free text through a preset rule base to obtain a first labeled text;
performing data enhancement on the initial marked text to obtain a marked text after data enhancement;
preprocessing the label text after the data enhancement and the first label text to obtain a processed label text;
performing iterative training by using the processed label text and the BERT model to obtain a named entity recognition model;
and performing command entity recognition on the unmarked legal documents to be recognized by using the named entity recognition model to obtain a named entity recognition result.
2. The method of claim 1, wherein data enhancing the initially annotated text comprises:
and performing data enhancement on the initial marked text by using a random splicing method, a random exchange method and/or a random erasing method.
3. The method of claim 1,
the random splicing method comprises the following steps: randomly extracting a single sequence from the two texts with the context relationship, and splicing the single sequence into a new text with the label;
the random exchange method comprises the steps of randomly exchanging named entities in two texts with context relation to obtain a new labeled text;
the random erasing method comprises the steps of erasing other characters except for named entities in each text according to preset probability, and taking the residual text content after erasing as a new labeled sample.
4. The method of claim 1,
the preprocessing comprises the operations of carrying out simplified and repeated conversion, word segmentation, word stop and word removal and unusual punctuation mark symbol removal on the labeled text after the data enhancement and the first labeled text.
5. The method of claim 1, further comprising:
carrying out named entity labeling on the label-free text by using the named entity recognition model to obtain a second labeled text;
performing confidence calculation and manual verification on the first labeling text and the second labeling text;
and adding the labeled text with the confidence coefficient larger than the preset value and passing the manual verification into the initial labeled text as a manual corpus according to a preset proportion so as to update the initial labeled text.
6. The method of claim 1, further comprising:
marking the legal document to be identified without marking according to the named entity identification result to obtain a marked legal document;
carrying out confidence calculation and manual verification on the marked legal documents;
and adding the legal documents with the confidence degrees larger than the preset value and passing the manual verification as manual corpora into the initial marked text so as to update the initial marked text.
7. The method of claim 1, further comprising:
after training and learning are carried out by using one artificial corpus each time, re-labeling the artificial corpus by using the obtained named entity recognition model, and counting the number of corrected labels;
and when the number of the correction labels presents a divergence trend, discarding the artificial linguistic data, and returning the parameters of the named entity recognition model to the state before learning.
8. An apparatus for legal document named entity recognition, the apparatus comprising:
the system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring an initial marked text and an unmarked text of the legal document, wherein the named entity is marked in the initial marked text, and the named entity is not marked in the unmarked text;
the first labeling module is used for labeling named entities of the label-free text through a preset rule base to obtain a first labeled text;
the data enhancement module is used for carrying out data enhancement on the initial labeled text to obtain a labeled text after the data enhancement;
the preprocessing module is used for preprocessing the label text after the data enhancement and the first label text to obtain a processed label text;
the training module is used for carrying out iterative training by utilizing the processed labeling text and the BERT model to obtain a named entity recognition model;
and the identification module is used for carrying out command entity identification on the unmarked legal documents to be identified by using the named entity identification model to obtain the named entity identification result.
9. An apparatus for legal document named entity recognition, the apparatus comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
acquiring an initial marked text and an unmarked text of a legal document, wherein the named entity is marked in the initial marked text, and the named entity is not marked in the unmarked text;
carrying out named entity labeling on the label-free text through a preset rule base to obtain a first labeled text;
performing data enhancement on the initial marked text to obtain a marked text after data enhancement;
preprocessing the label text after the data enhancement and the first label text to obtain a processed label text;
performing iterative training by using the processed labeling text and the BERT model to obtain a named entity recognition model;
and performing command entity recognition on the unmarked legal documents to be recognized by using the named entity recognition model to obtain a named entity recognition result.
10. A computer-readable storage medium having stored thereon computer instructions, which, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 7.
CN202211464487.XA 2022-11-22 2022-11-22 Legal document named entity identification method, device and storage medium Pending CN115859979A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211464487.XA CN115859979A (en) 2022-11-22 2022-11-22 Legal document named entity identification method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211464487.XA CN115859979A (en) 2022-11-22 2022-11-22 Legal document named entity identification method, device and storage medium

Publications (1)

Publication Number Publication Date
CN115859979A true CN115859979A (en) 2023-03-28

Family

ID=85664781

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211464487.XA Pending CN115859979A (en) 2022-11-22 2022-11-22 Legal document named entity identification method, device and storage medium

Country Status (1)

Country Link
CN (1) CN115859979A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738004A (en) * 2020-06-16 2020-10-02 中国科学院计算技术研究所 Training method of named entity recognition model and named entity recognition method
CN112818691A (en) * 2021-02-01 2021-05-18 北京金山数字娱乐科技有限公司 Named entity recognition model training method and device
CN114372465A (en) * 2021-09-29 2022-04-19 武汉工程大学 Legal named entity identification method based on Mixup and BQRNN
US20220129632A1 (en) * 2020-10-22 2022-04-28 Boe Technology Group Co., Ltd. Normalized processing method and apparatus of named entity, and electronic device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738004A (en) * 2020-06-16 2020-10-02 中国科学院计算技术研究所 Training method of named entity recognition model and named entity recognition method
US20220129632A1 (en) * 2020-10-22 2022-04-28 Boe Technology Group Co., Ltd. Normalized processing method and apparatus of named entity, and electronic device
CN112818691A (en) * 2021-02-01 2021-05-18 北京金山数字娱乐科技有限公司 Named entity recognition model training method and device
CN114372465A (en) * 2021-09-29 2022-04-19 武汉工程大学 Legal named entity identification method based on Mixup and BQRNN

Similar Documents

Publication Publication Date Title
CN108733792B (en) Entity relation extraction method
US11568143B2 (en) Pre-trained contextual embedding models for named entity recognition and confidence prediction
CN110427623B (en) Semi-structured document knowledge extraction method and device, electronic equipment and storage medium
CN111310471B (en) Travel named entity identification method based on BBLC model
CN109918666B (en) Chinese punctuation mark adding method based on neural network
CN106776581B (en) Subjective text emotion analysis method based on deep learning
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN109829159B (en) Integrated automatic lexical analysis method and system for ancient Chinese text
Chalkidis et al. Obligation and prohibition extraction using hierarchical RNNs
CN114781392A (en) Text emotion analysis method based on BERT improved model
CN112541337B (en) Document template automatic generation method and system based on recurrent neural network language model
CN110276069A (en) A kind of Chinese braille mistake automatic testing method, system and storage medium
CN115204143B (en) Method and system for calculating text similarity based on prompt
CN111444704A (en) Network security keyword extraction method based on deep neural network
CN114154504A (en) Chinese named entity recognition algorithm based on multi-information enhancement
CN115048511A (en) Bert-based passport layout analysis method
CN113449514A (en) Text error correction method and device suitable for specific vertical field
CN115063119A (en) Recruitment decision system and method based on adaptivity of recruitment behavior data
CN114117041B (en) Attribute-level emotion analysis method based on specific attribute word context modeling
CN112989830B (en) Named entity identification method based on multiple features and machine learning
CN113901813A (en) Event extraction method based on topic features and implicit sentence structure
CN118170907A (en) Document intelligent label system based on deep neural network and implementation method thereof
CN112989839A (en) Keyword feature-based intent recognition method and system embedded in language model
CN115859979A (en) Legal document named entity identification method, device and storage medium
Karimi et al. Sentiment analysis using BERT (pre-training language representations) and Deep Learning on Persian texts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20230328