CN115859979A

CN115859979A - Legal document named entity identification method, device and storage medium

Info

Publication number: CN115859979A
Application number: CN202211464487.XA
Authority: CN
Inventors: 肖熊锋; 李庆; 杜向阳
Original assignee: Beijing Qingdun Information Technology Co ltd
Current assignee: Beijing Qingdun Information Technology Co ltd
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2023-03-28

Abstract

The invention relates to the technical field of data processing, in particular to a legal document named entity identification method, a device and a storage medium, wherein the method comprises the following steps: acquiring an initial marked text and an unmarked text of a legal document; carrying out named entity labeling on the label-free text through a preset rule base to obtain a first labeled text; performing data enhancement on the initial marked text to obtain a marked text after data enhancement; preprocessing the label text after the data enhancement and the first label text to obtain a processed label text; performing iterative training by using the processed labeling text and the BERT model to obtain a named entity recognition model; and performing command entity recognition on the unmarked legal documents to be recognized by using the named entity recognition model to obtain a named entity recognition result. Through the technical scheme, the labor cost for acquiring data is reduced, the field adaptability is improved, and the application scene with fine granularity is better adapted.

Description

Legal document named entity identification method, device and storage medium

Technical Field

The invention relates to the technical field of data processing, in particular to a legal document named entity recognition and analysis method, a device and a storage medium.

Background

Named entities are identified as one of the basic tasks of natural language processing, the goal of which is to extract named entities in text and classify the entities, such as names of people, place names, organizations, time, currency, percentages, etc., and are widely used for tasks such as information extraction, question-answering system, syntactic analysis, information retrieval, and emotion analysis.

Overall, the general trend of NER research in recent years is as follows: early methods built NER systems based primarily on rules and dictionaries, such as LaSIEII at Sheffield university and NetOwl at ISOQuest. By the beginning of 2000, CRF equal probability charts and Momo types are widely used. Further, with the rise of deep learning, bi-LSTM + CRF has once been a focus of research, and many methods have been modified on the BiLSTM-CRF until recently. Both LSTM and CRF require information to merge contexts in a sequence, and the Attention mechanism (Attention) adaptively computes weights for different context objects, thus being a useful technique. In addition, as the pre-training technology develops, the NER model based on the pre-training model such as BERT occupies the current dominance. The deep learning method generally needs a large amount of data to learn well, but in actual production, the situations of data set missing and a small amount of labels often occur, and the problems can be solved to a certain extent by transfer learning and semi-supervision, but the existing method with the best effect is still based on supervised learning.

The prior art mainly has the following defects:

1) Due to semantic complexity and diversity of legal texts, the existing methods mainly aim at general fields. Thus, the prior method has insufficient generalization;

2) Lack of high-quality labeled corpora often depends on domain experts, and even if high-quality labeled corpora exist, the quantitative deficiency can influence the generalization;

3) The entity categories to be extracted are many, the similarity between the entity categories is high, and category identification errors are easy to occur; the ability of dealing with entity categories with finer granularity is insufficient by a deep learning model;

4) The pre-training language model is mainly based on language materials in the general field, and few pre-training models special for the legal field exist.

Disclosure of Invention

In order to overcome the problems in the related art, the invention provides a legal document named entity identification method, a device and a storage medium, so that the labor cost for acquiring data is reduced, the field adaptability is improved, and fine-grained application scenes are better adapted.

According to a first aspect of an embodiment of the present invention, there is provided a legal document named entity identification method, including:

acquiring an initial marked text and an unmarked text of a legal document, wherein the named entity is marked in the initial marked text, and the named entity is not marked in the unmarked text;

carrying out named entity labeling on the label-free text through a preset rule base to obtain a first labeled text;

performing data enhancement on the initial marked text to obtain a marked text after data enhancement;

preprocessing the label text after the data enhancement and the first label text to obtain a processed label text;

performing iterative training by using the processed labeling text and the BERT model to obtain a named entity recognition model;

and performing command entity recognition on the unmarked legal documents to be recognized by using the named entity recognition model to obtain a named entity recognition result.

In one embodiment, preferably, the data enhancement of the initially annotated text comprises:

and performing data enhancement on the initial marked text by using a random splicing method, a random exchange method and/or a random erasing method.

In one embodiment, preferably, the random splicing method includes: randomly extracting a single sequence from the two texts with the context relationship, and splicing the single sequence into a new text with the label;

the random exchange method comprises the steps of randomly exchanging named entities in two texts with context relation to obtain a new labeled text;

the random erasing method comprises the steps of erasing other characters except for named entities in each text according to preset probability, and taking the residual text content after erasing as a new labeled sample.

In one embodiment, preferably, the preprocessing includes performing simplified and complicated conversion, word segmentation, word deactivation and uncommon punctuation symbol removal on the data-enhanced labeled text and the first labeled text.

In one embodiment, preferably, the method further comprises:

carrying out named entity labeling on the label-free text by using the named entity recognition model to obtain a second labeled text;

performing confidence calculation and manual verification on the first labeling text and the second labeling text;

and adding the labeled text with the confidence coefficient larger than the preset value and passing the manual verification into the initial labeled text as a manual corpus according to a preset proportion so as to update the initial labeled text.

In one embodiment, preferably, the method further comprises:

marking the legal document to be identified without marking according to the named entity identification result to obtain a marked legal document;

carrying out confidence calculation and manual verification on the marked legal documents;

and adding the legal documents with the confidence degrees larger than the preset value and passing the manual verification as manual corpora into the initial marked text so as to update the initial marked text.

In one embodiment, preferably, the method further comprises:

after each artificial corpus is used for training and learning, the obtained named entity recognition model is used for re-labeling the artificial corpus, and the number of corrected labels is counted;

and when the number of the correction labels presents a divergence trend, discarding the artificial linguistic data, and returning the parameters of the named entity recognition model to the state before learning.

According to a second aspect of embodiments of the present invention, there is provided a legal document named entity recognition apparatus, the apparatus comprising:

the system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring an initial marked text and an unmarked text of the legal document, wherein the named entity is marked in the initial marked text, and the named entity is not marked in the unmarked text;

the first labeling module is used for labeling named entities of the label-free text through a preset rule base to obtain a first labeled text;

the data enhancement module is used for enhancing the data of the initial marked text to obtain a marked text with enhanced data;

the preprocessing module is used for preprocessing the label text after the data enhancement and the first label text to obtain a processed label text;

the training module is used for carrying out iterative training by utilizing the processed labeling text and the BERT model to obtain a named entity recognition model;

and the identification module is used for carrying out command entity identification on the unmarked legal documents to be identified by using the named entity identification model to obtain the named entity identification result.

In one embodiment, preferably, the data enhancement module is configured to:

In one embodiment, preferably, the apparatus further comprises:

the second labeling module is used for labeling the named entities of the unlabeled text by using the named entity recognition model to obtain a second labeled text;

the first post-processing module is used for performing confidence calculation and manual verification on the first annotation text and the second annotation text;

and the first updating module is used for adding the labeled text with the confidence coefficient larger than the preset value and passing the manual verification into the initial labeled text as the manual corpus according to the preset proportion so as to update the initial labeled text.

In one embodiment, preferably, the apparatus further comprises:

the third labeling module is used for labeling the legal document to be identified without the label according to the named entity identification result to obtain a labeled legal document;

the second post-processing module is used for performing confidence calculation and manual verification on the marked legal document;

and the second updating module is used for adding the legal documents with the confidence degrees larger than the preset value and passing the manual verification as manual corpora into the initial marked text so as to update the initial marked text.

In one embodiment, preferably, the apparatus further comprises:

the statistical module is used for re-labeling the artificial corpus by using the obtained named entity recognition model after training and learning by using one artificial corpus each time, and counting the number of corrected labels;

and the processing module is used for discarding the artificial linguistic data when the number of the correction labels presents a divergent trend, and detecting the parameters of the named entity recognition model back to the state before learning.

According to a third aspect of embodiments of the present invention, there is provided a legal document named entity recognition device, the device comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

According to a fourth aspect of embodiments of the present invention, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method according to any one of the embodiments of the second aspect.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

1) Compared with the BERT pre-training model in the general field which is mostly based in the prior art, the invention trains a proprietary BERT pre-training model in the legal field by relying on the advantages of mass field data, so that the field adaptability of the word vector is stronger.

2) The method combines rule recognition and model recognition, combines schemes of new word discovery, word stock construction, artificial rule construction and the like in the rule recognition, and adopts a domain pre-training model to enable the model to be better adapted to application scenes with fine granularity.

3) The method utilizes a chapter-level data enhancement technology, which is more suitable for the long text condition of legal documents, and can reduce the data acquisition cost under the supervised deep learning compared with the prior method.

4) By adding an incremental learning mechanism to the model, the system can be continuously improved, the recognition effect is improved, and extra expenses are not brought.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1A is a schematic diagram of a neural network model shown in accordance with an exemplary embodiment.

FIG. 1B is a block diagram illustrating a BERT model according to an exemplary embodiment.

FIG. 1C is a flow chart illustrating a legal document named entity identification method in accordance with an exemplary embodiment.

FIG. 2 is a flow chart illustrating another legal document named entity identification method in accordance with an exemplary embodiment.

FIG. 3 is a flow chart illustrating another legal document named entity identification method in accordance with an exemplary embodiment.

FIG. 4 is a flow chart illustrating another legal document named entity identification method in accordance with an exemplary embodiment.

FIG. 5 is a block diagram illustrating a legal document named entity recognition device, according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The invention adopts a method of combining a bidirectional long-short time memory neural network with a conditional random field (BilSTM-CRF) to construct a neural network model to carry out entity classification, the model structure of the neural network model is shown as figure 1A, the method combines context related information of words, introduces distributed expression of the words into feature extraction, and utilizes the relationship between the words and labels to the maximum extent, thereby fully improving the recognition effect.

The method for fine tuning the BilSTM-CRF based on the pre-trained language model comprises the steps of representing sentence semantics by unsupervised pre-training of a language model through massive linguistic data, and then fine tuning the language model on the linguistic data with labels.

The embedding layer is a dense vector representation or distributed representation that converts a Chinese text sequence into characters or words. The BERT model is a pre-training language model containing character-level and sentence-level characteristics, and in order to capture context information, the BERT adopts a bidirectional Transformer as an encoder to model texts through an attention mechanism. And character embedding, position embedding and sentence embedding are spliced by inputting the BERT model, and feature extraction is carried out by inputting the BERT model into a stacked Transformer model, so that an output sequence vector is obtained and is used as character embedding. The structure of the BERT model is shown in FIG. 1B.

Wherein [ CLS ] and [ SEP ] represent the mark of BERT to the sequence, [ CLS ] identifies the starting position of the sequence, and [ SEP ] identifies the sentence to be cut; e represents a distributed representation of each character; trm represents a transform model stacked in a BERT model; t represents the sequence vector output by the BERT model.

As shown in fig. 1C, according to a first aspect of the embodiments of the present invention, there is provided a legal document named entity identification method, including:

step S101, acquiring an initial marked text and an unmarked text of a legal document, wherein the named entities are marked in the initial marked text, and the named entities are not marked in the unmarked text;

the initial marked text, also called initial marked linguistic data, is usually provided by a service side of an application scene, the linguistic data already marks a named entity, and a label is marked and corrected by a domain expert, so that the accuracy rate and the quality are high. In the proprietary domain, such data is often difficult to obtain and requires a very high labor cost, and therefore, this part of the corpus is often used to train the initial baseline model.

The method is characterized in that a named entity is not marked, namely unstructured plain text, relative to an initial marked corpus. Such text is abundant in data in various fields, and thus a large amount of such corpora are often directly obtained at the beginning of naming an entity recognition item, rather than the standard annotated corpora. However, in some scenarios, in order to be closer to the semantic scenario of practical application, a crawler may need to be constructed to crawl relevant documents on the network, so as to prepare data for marking later.

Step S102, carrying out named entity labeling on the label-free text through a preset rule base to obtain a first labeled text;

the preset rule base can be a rule base or a dictionary base, and the sources of the dictionary base are as follows: the method comprises the steps of network crawling of various entity word banks, manual expert induction and summary, unsupervised learning of the existing unmarked linguistic data and extraction of the existing marked linguistic data. In order to obtain a dictionary with high quality and sufficient content, the dictionaries obtained by the methods are integrated and deduplicated and then handed to a human for checking once.

The source of the rule base is obtained by inducing and summarizing the line rule of the application scene mainly by means of an artificial expert. In addition, some new rules can be generated by learning through the existing markup corpus. These rules are also manually collated.

And after the dictionary database and the rule database are obtained, the non-labeled corpus is labeled by matching the dictionary and the rule. The dictionary rule labeled linguistic data obtained by labeling can be used as a label for weak supervised learning due to the characteristic of low recall rate, and can also be fused with a label result based on model prediction to obtain a labeled linguistic data with higher recall, and after manual proofreading, the manual proofreading linguistic data is added, and can be added into an initially labeled text for model training.

Step S103, performing data enhancement on the initial annotated text to obtain a data-enhanced annotated text;

the method aims to solve the problem of how to obtain a large amount of high-quality labeled data through a small amount of manpower when the data amount in the current field is insufficient. In the field of natural language processing, previous work on data enhancement has mainly focused on text classification, emotion analysis, and especially machine translation tasks, while data enhancement on NER tasks has not been fully explored. Existing work utilizes tag-level or sentence-level information for data enhancement, while for some scenes where the text is long and contains multiple sentences, the syntax or entity information of another sentence in the same document paragraph helps identify entity types and boundaries.

The method consists of three sub-methods of random linking, random exchange and random erasing, and the data enhancement training is carried out by utilizing document-level semantic context from three aspects respectively. These three operations are all modifications to the input stage of the overall architecture and are therefore easily implemented on various NER models without changing their model structure.

Random chaining refers to sampling of a trained word sequence for model optimization, without being limited by a sentence. For example, sentence B is the next sentence of a, and we can extract the word sequence C from the two sentences. It is a method that allows training data to be sampled in one or more sentences to achieve a higher training accuracy. By randomly selecting the header character and the sample length, one or more truncated parts of the sentence are obtained, forming a new sentence. In general, there will be a logical relationship between the upper and lower sentences, particularly the end of the upper sentence and the beginning of the lower sentence. Therefore, the random link can be used for acquiring richer semantic information, the performance of named entity identification is improved, and the over-fitting problem is further reduced.

The random exchange refers to the random replacement of the existing entities by the upper sentence and the lower sentence in the corpus. For example, sentence B is the next sentence of a, and the entity in sentence B and the entity in sentence a are randomly transposed to obtain new samples C and D. Generally, the sentences obtained in this way can enable existing entities to appear multiple times in different context environments, thereby further reducing the problem of overfitting.

The random erasure refers to the random erasure of sentences in the corpus and non-entity parts. For a sentence A, neglecting part of the entity in the sentence, erasing each remaining character with a certain probability p, and taking the remaining content as a new sample A'. Generally, the sentences obtained in such a way enable the existing entities to gain more attention, thereby further reducing the problem of overfitting. And applying the three data enhancement means to the existing markup corpus to obtain a new markup corpus.

Step S104, preprocessing the label text after the data enhancement and the first label text to obtain a processed label text;

The corpus data from different sources may have different formats for various reasons, and such problems, if left untreated, may also affect experimental performance. Therefore, these data need to be preprocessed before being fed into the neural network module. In a Chinese scene, text is generally subjected to simplified and repeated conversion, word segmentation, word stop, removal of unusual punctuation and the like. The simplified and traditional conversion can introduce conversion tools such as OpenCC and the like, the word segmentation part can introduce a jieba word segmentation tool, and the word segmentation result is filtered by using a Chinese stop word list.

Step S105, performing iterative training by using the processed label text and the BERT model to obtain a named entity recognition model;

and for word vectors, pre-training is carried out through the existing massive unsupervised legal documents, and a domain-specific pre-training BERT model is generated. The whole workflow is divided into three stages of data preprocessing, training data generation, training performance optimization and pretraining effect optimization.

At the model level, with the usual baseline model: such as CRF \ Bi-LSTM \ CNN, and the like, and the adaptation, parameter adjustment and the like of the corresponding fields are carried out.

After the model is trained in one version, named entity recognition is carried out on the unmarked corpus by using the trained model, and a marking result is obtained, wherein the part of marking results can be verified in confidence and manual mode.

And S106, performing command entity recognition on the unmarked legal document to be recognized by using the named entity recognition model to obtain a named entity recognition result.

In the embodiment, on the basis of the existing deep learning based on the general field BERT, a special BERT model in the legal field is pre-trained, so that the field adaptability of the word vector is improved; meanwhile, the rule identification and the model identification are combined, so that a system under the fine granularity can have a better identification result; and the data enhancement technology is utilized, so that the labor cost for acquiring data is reduced.

As shown in fig. 2, in one embodiment, preferably, the method further comprises:

step S201, using the named entity recognition model to label the named entity of the unlabeled text to obtain a second labeled text;

step S202, performing confidence calculation and manual verification on the first annotation text and the second annotation text;

for data obtained by labeling in multiple modes, a confidence calculation mode is needed, and labeled data with a good result is screened out:

wherein E _standard And (4) representing confidence, wherein P and R are accuracy and recall respectively, and b is full time. Since the recall rate identified in practice tends to be low, a calculation that favors the recall rate is taken.

Then, the data are manually corrected, and finally a corpus is added to obtain a new high-quality corpus. This greatly reduces the cost of manually annotating the corpora from zeros.

Step S203, the marked text with the confidence coefficient larger than the preset value and passing the manual verification is used as the manual corpus according to the preset proportion and added to the initial marked text so as to update the initial marked text.

The method is characterized in that the unmarked linguistic data are marked through rules and a dictionary, the unmarked linguistic data are predicted after the marked linguistic data are trained through a neural network module, and the two types of linguistic data are subjected to confidence calculation and manual verification to obtain new marked linguistic data with higher quality. After the part of the markup language material is obtained, the initial markup language material is added according to a certain proportion according to the requirements of an application scene, and the next iteration training is carried out together, so that incremental training and maintenance are formed.

As shown in fig. 3, in one embodiment, preferably, the method further comprises:

step S301, labeling the legal document to be identified without labeling according to the named entity identification result to obtain a labeled legal document;

step S302, performing confidence calculation and manual verification on the marked legal document;

step S303, adding the legal documents with the confidence degrees larger than the preset value and passing the manual verification as manual corpora into the initial labeled text so as to update the initial labeled text.

After the named entity recognition model is obtained through training, when the legal documents to be recognized are recognized through the named entity recognition model, the marked legal documents can be added to the initial marked text participating in model training after confidence calculation and manual verification are carried out on the marked legal documents, and incremental learning is achieved.

As shown in fig. 4, in one embodiment, preferably, the method further comprises:

step S401, after training and learning by using one artificial corpus each time, re-labeling the artificial corpus by using the obtained named entity recognition model, and counting the number of corrected labels;

step S402, when the number of the correction labels presents a divergence trend, discarding the artificial corpus, and detecting the parameters of the named entity recognition model back to the state before learning.

Considering that the model may not have good cross-domain adaptability, the method increases the example of accumulation of model error recognition in the incremental learning part, and enables the model to learn to enhance generalization capability, thereby being suitable for named entity recognition in more open domains. Specifically, the model incremental learning is realized by using an artificial corpus obtained through artificial correction according to three stages of error learning, effectiveness judgment and incremental accumulation. The system performs re-labeling after learning by using one manually corrected corpus each time, counts the corrected number, abandons the incremental corpus and returns the parameters to the state before learning if the corrected number shows a divergent trend in the iteration process. By adding an incremental learning mechanism to the model, the system can be continuously improved, the recognition effect is improved, and extra expenses are not brought.

Fig. 5 is a block diagram illustrating a legal document named entity recognition apparatus in accordance with an exemplary embodiment.

As shown in fig. 5, according to a second aspect of the embodiments of the present invention, there is provided a legal document named entity recognition apparatus, including:

an obtaining module 51, configured to obtain an initial labeled text and an unlabeled text of a legal document, where a named entity is labeled in the initial labeled text and a named entity is not labeled in the unlabeled text;

the first labeling module 52 is configured to label named entities of the label-free text through a preset rule base to obtain a first labeled text;

the data enhancement module 53 is configured to perform data enhancement on the initial labeled text to obtain a labeled text after data enhancement;

a preprocessing module 54, configured to preprocess the data-enhanced labeled text and the first labeled text to obtain a processed labeled text;

the training module 55 is configured to perform iterative training by using the processed labeled text and the BERT model to obtain a named entity recognition model;

and the identification module 56 is used for performing command entity identification on the unmarked legal document to be identified by using the named entity identification model to obtain a named entity identification result.

In one embodiment, preferably, the data enhancement module is configured to:

In one embodiment, preferably, the random splicing method includes: randomly extracting a single sequence from the two texts with the context relationship, and splicing the single sequence into a new text with a label;

the random erasing method comprises the steps of erasing other characters except for named entities in each text according to preset probability, and taking the residual text content after erasing as a new marked sample.

In one embodiment, preferably, the apparatus further comprises:

the first post-processing module is used for performing confidence calculation and manual verification on the first labeling text and the second labeling text;

and the first updating module is used for adding the labeled text with the confidence coefficient larger than the preset value and passing the manual verification into the initial labeled text as a manual corpus according to a preset proportion so as to update the initial labeled text.

In one embodiment, preferably, the apparatus further comprises:

the second post-processing module is used for carrying out confidence calculation and manual verification on the marked legal documents;

In one embodiment, preferably, the apparatus further comprises:

and the processing module is used for discarding the artificial corpus and detecting the parameters of the named entity recognition model back to the state before learning when the number of the correction labels presents a divergence trend.

According to a third aspect of embodiments of the present invention, there is provided a legal document named entity recognition apparatus, the apparatus comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

performing iterative training by using the processed label text and the BERT model to obtain a named entity recognition model;

It is further understood that the use of "a plurality" in the present invention means two or more, and other terms are intended to be analogous. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. The singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It will be further understood that the terms "first," "second," and the like are used to describe various information and that such information should not be limited by these terms. These terms are only used to distinguish one type of information from another and do not denote a particular order or importance. Indeed, the terms "first," "second," and the like are fully interchangeable. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention.

It is further to be understood that while operations are depicted in the drawings in a particular order, this is not to be understood as requiring that such operations be performed in the particular order shown or in serial order, or that all illustrated operations be performed, to achieve desirable results. In certain environments, multitasking and parallel processing may be advantageous.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A legal document named entity recognition method, comprising:

2. The method of claim 1, wherein data enhancing the initially annotated text comprises:

3. The method of claim 1,

the random splicing method comprises the following steps: randomly extracting a single sequence from the two texts with the context relationship, and splicing the single sequence into a new text with the label;

4. The method of claim 1,

the preprocessing comprises the operations of carrying out simplified and repeated conversion, word segmentation, word stop and word removal and unusual punctuation mark symbol removal on the labeled text after the data enhancement and the first labeled text.

5. The method of claim 1, further comprising:

6. The method of claim 1, further comprising:

7. The method of claim 1, further comprising:

after training and learning are carried out by using one artificial corpus each time, re-labeling the artificial corpus by using the obtained named entity recognition model, and counting the number of corrected labels;

8. An apparatus for legal document named entity recognition, the apparatus comprising:

the data enhancement module is used for carrying out data enhancement on the initial labeled text to obtain a labeled text after the data enhancement;

9. An apparatus for legal document named entity recognition, the apparatus comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

10. A computer-readable storage medium having stored thereon computer instructions, which, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 7.