CN116579344A

CN116579344A - Case main body extraction method

Info

Publication number: CN116579344A
Application number: CN202310853155.9A
Authority: CN
Inventors: 段春先; 杨伊态; 许继伟; 赵舞玲; 付卓; 王敬佩; 李颖; 黄亚林; 张兆文; 陈胜鹏
Original assignee: Geospace Information Technology Co ltd
Current assignee: Geospace Information Technology Co ltd
Priority date: 2023-07-12
Filing date: 2023-07-12
Publication date: 2023-08-11
Anticipated expiration: 2043-07-12
Also published as: CN116579344B

Abstract

The invention provides a case main body extraction method, which comprises the following steps: extracting candidate case main bodies in each case text in the case corpus based on a named entity recognition model and an information extraction model of the PaddleNLP respectively, and acquiring key fragments of each candidate case main body based on a trained key fragment extraction model; carrying out box division treatment on all the key fragments to obtain a plurality of key fragment boxes; clustering all key fragments in each key fragment box based on a directed graph merging algorithm to obtain key fragments of a plurality of entity categories; and determining the final case main body of each identification category based on the candidate case main bodies corresponding to each key fragment in each entity category. The method can automatically extract the case main body from the case text, and classify and unify different descriptions pointing to the same real case main body, can reduce the manual expenditure of case information extraction, and improves the intelligent and automatic degree of case information extraction.

Description

Case main body extraction method

Technical Field

The invention relates to the field of computer artificial intelligence, in particular to a case main body extraction method.

Background

In the urban treatment field, the hot events are automatically discovered by using an artificial intelligence technology, so that a supervision department can be helped to process related transactions in time, and the negative influence of the events on society is reduced. If the event can be handled in time by the relevant government departments when more citizens complain about the same event, the event can be prevented from going gradually, and the situation is prevented from being enlarged. In real business, how to determine an event body is a very important part of case analysis.

Disclosure of Invention

The invention provides a case main body extraction method aiming at the technical problems in the prior art, which comprises the following steps:

training the key fragment extraction model based on a training sample set to obtain a trained key fragment extraction model, wherein the training sample set comprises a plurality of pieces of sample data, and each piece of sample data comprises a solid text and a key fragment;

extracting a first candidate case main body and a second candidate case main body in each case text in a case corpus based on a named entity recognition model of PaddleNLP and an information extraction model of PaddleNLP respectively, and merging the first candidate case main body and the second candidate case main body to form a candidate case main body set;

Acquiring key fragments in each candidate case main body in the candidate case main body set based on the trained key fragment extraction model, and correspondingly obtaining an entity key fragment tuple corpus by each candidate case main body and the key fragments;

carrying out box division processing on all key fragments in the entity key fragment tuple corpus to obtain a plurality of key fragment boxes;

clustering all key fragments in each key fragment box based on a directed graph merging algorithm to obtain key fragments of a plurality of entity categories;

and determining the final case main body of each identification category based on the candidate case main bodies corresponding to each key fragment in each entity category.

According to the case main body extraction method provided by the invention, the candidate case main bodies in each case text in the case corpus are extracted based on the named entity recognition model of the PaddleNLP and the information extraction model of the PaddleNLP respectively, and the case main bodies in the case text are extracted automatically by adopting two different models respectively, so that the automation efficiency and accuracy of extracting the case main bodies are improved; acquiring key fragments of each candidate case main body based on the trained key fragment extraction model; carrying out box division treatment on all the key fragments to obtain a plurality of key fragment boxes; all the key fragments in each key fragment box are clustered based on a directed graph merging algorithm to obtain key fragments of a plurality of entity categories, so that the clustering efficiency is improved. The method can automatically extract the case main body from the case text, and classify and unify different descriptions pointing to the same real case main body, can reduce the manual expenditure of case information extraction, and improves the intelligent and automatic degree of case information extraction.

Drawings

Fig. 1 is a flowchart of a case main body extraction method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a key segment extraction model;

FIG. 3 is a schematic diagram of extraction of candidate case bodies using a PaddleNLP tool;

fig. 4 is a schematic diagram of a case main body extraction flow provided in another embodiment of the present invention;

FIG. 5 is a schematic diagram of critical segment binning;

FIG. 6 is a schematic diagram of a key segment directed graph merging algorithm;

fig. 7 is a diagram of a unified case body text description.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. In addition, the technical features of each embodiment or the single embodiment provided by the invention can be combined with each other at will to form a feasible technical scheme, and the combination is not limited by the sequence of steps and/or the structural composition mode, but is necessarily based on the fact that a person of ordinary skill in the art can realize the combination, and when the technical scheme is contradictory or can not realize, the combination of the technical scheme is not considered to exist and is not within the protection scope of the invention claimed.

Fig. 1 is a flowchart of a case main body extraction method provided by the present invention, as shown in fig. 1, the method includes:

s1, training a key segment extraction model based on a training sample set to obtain a trained key segment extraction model, wherein the training sample set comprises a plurality of pieces of sample data, and each piece of sample data comprises a solid text and a key segment.

It can be understood that, first, training a preset key segment extraction model, referring to fig. 2, the key segment extraction model mainly includes a BERT model, a BiLSTM layer and a CRF layer, and a specific process of training the key segment extraction model includes:

(1) A training sample set is obtained, wherein the training sample set comprises a plurality of pieces of sample data, and the format of each piece of sample data is [ entity text, key fragment ].

For example, sample data one [ Hahal Happy Star Wednesday Co., happy Star Wednesday ]; sample data two [ Haha open investment Co., ltd., open investment ].

(2) For each sample data, the BERT model is used to convert the Entity text into character encoding vectors entity_tensor, as follows:

each word of the sample data is converted into a corresponding word code by using a BERT word segmentation device, and special character codes are added at the head and the tail of the code to form a word code vector of the entity text. The BERT model in the invention is a Chinese-BERT-wwm-extBERT (Bidirectional Encoder Representation from Transformers) pre-training model.

For example, the entity text "Ha Happy star DiXin Co" is converted into a word code: [101, 1506, 1506, 2356, 2571, 727, 3215, 4413, 1765, 928, 1062, 1385, 102]. Where 101 is the encoding of the special character 'CLS' and 102 is the encoding of the special character 'SEP'. For each entity word vector, it starts with code "101" and ends with code "102".

Inputting the word coding vector of the entity text into the BERT model to obtain an entity embedded vector [ E ] ₁ ,E ₂ ,E ₃ …E _N ]Where N is the number of words.

(3) Inputting the entity embedded vector into a BiLSTM network to obtain a corpus fragment emission probability matrix emit_m, wherein the method comprises the following steps of:

the entity embedding vector is input into a BiLSTM network to obtain a hidden state vector of the entity, and then the hidden state vector is input into a full connection layer to obtain an emission probability matrix Emit_m. The emission probability matrix emit_m is a tag_num×span_len dimensional matrix, where tag_num is the number of tags, and in the present invention, the number of tags is 5, i.e. the number of tags [ O, B, I, E, S ]. Span_len is the number of word vectors in the entity, and the number in the invention is the number of entity text words +2, i.e. the sum of the number of entity text words and two special characters.

Inputting the emission probability matrix emit_m into a CRF network, and calculating to obtain the correct mark sequence score and the total score of all possible mark sequences based on the emission probability matrix and the transition matrix, wherein the steps are as follows:

And for each piece of sample data, obtaining a correct marking sequence corresponding to the sample according to the entity text and the key fragment in the sample data. Specifically, for each word in the entity text, if the word does not belong to a key segment, it is marked as O. If the word belongs to the key segment and the number of key segment characters is greater than 1, the first character is marked as B, the last character is marked as E, and the other words are marked as I. If the character belongs to the keyword and the number of the characters of the keyword is equal to 1, the character is marked as S. The sequence of each word in the entity text is orderly formed into a sequence, the head of the sequence is added with a mark O, and the tail of the sequence is also added with a mark O, namely marks corresponding to two special marks [ CLS ] and [ SEP ]. And (5) composing to obtain the correct marking sequence corresponding to the sample.

For example, sample data one [ Hahal Happy Star West, happy Star West ], the resulting sequence is labeled [ O, O, O, B, I, I, I, E, O, O ], the sequence after the addition of the head-to-tail labeled character is labeled [ O, O, O, O, B, I, I, I, E, O, O, O ].

The transmission probability matrix emit_m is input into the CRF network, which uses a score formula to obtain the correct signature sequence score and the total score of all possible signature sequences based on the transmission probability matrix emit_m and the transition matrix trans_m. Wherein the correct marker sequence refers to the same sequence as the corpus sequence marker of the sample, and all possible marker sequences refer to the sum of sequences which can be generated by the model, and the sum is calculated Seed sequence, where tag_num is the number of tags in the Tag set and span_len is the number of word vectors in the entity. The transfer matrix Trans_m in CRF is initially a random assignmentThe value in Trans_m of the second training is the value adjusted after the s-1 training. The score calculation formula of each marking sequence is as follows:

；

in the formula, score (x, y) represents the input sample x, marked as a fraction of the marked sequence y, whereA transmission probability value representing the ith marker in the predicted marker sequence y, p being the length of the entire predicted marker sequence y,/->A transition probability value indicating that the i-1 st marker in the predicted markers y transitions to the i-th marker.

Calculating a correct tag sequence score and a total score of all possible tag sequences according to the formula, and calculating a loss score according to the correct tag sequence score and the total score of all possible tag sequences, wherein the calculation formula of the loss score is as follows:

；

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the correct marker sequence score for the input sample x. />Representing the fraction of any possible marker sequences for the input sample x,/for>The representation is based on the natural index e, the score of the tag sequence is an index, and the sum of all possible tag sequences is accumulated. / >Indicating for input sample x that the correct tag sequence is +.>Is a loss fraction of (c).

In the process of training the key segment extraction model based on the training sample set, the key segment extraction model traverses the training sample for a plurality of times, and after each time of traversing the training sample, the parameters of the key segment extraction model are modified and updated by using a gradient descent method, wherein the parameters of the model comprise network structure parameters of the model and the transfer matrix. And then, using the accuracy of the verification sample test model, the key segment extraction model training stage selects a parameter version with the highest verification accuracy, namely the lowest loss fraction, as a final trained key segment extraction model.

S2, extracting a first candidate case main body and a second candidate case main body in each case text in the case corpus based on a named entity recognition model of the PaddleNLP and an information extraction model of the PaddleNLP respectively, and merging the first candidate case main body and the second candidate case main body to form a candidate case main body set.

As an embodiment, the extracting, by using the named entity recognition model based on the PaddleNLP and the information extraction model based on the PaddleNLP, a first candidate case body and a second candidate case body in each case text in the case corpus includes: acquiring a case corpus, wherein the case corpus comprises a plurality of case texts, each case text describes a case, and one case comprises one or more case main bodies; inputting each case text in the case corpus into a named entity recognition model based on PaddleNLP, obtaining a plurality of first output results, wherein the structure of each first output result is [ entity text fragment, mark ], and obtaining a first output result marked as a first appointed mark value as a first candidate case main body; inputting each case text in the case corpus into an information extraction model based on PaddleNLP, obtaining a plurality of second output results, wherein the structure of each second output result is [ entity text fragment, mark ], and obtaining a second output result marked as a second appointed mark value as a second candidate case main body.

It can be understood that fig. 3 is a schematic flow chart of extracting candidate case bodies according to a case text, and the specific extraction process is that a case corpus is input, where the case corpus is composed of a plurality of case texts, each case text describes a case, and the case may include one or more case bodies or no case bodies.

Such as: case 1 "i want to complain about happy star company at happy street, delineating i 5 months wages nothing", possibly extracted case bodies include "happy star", "happy street", "happy star company at happy star company".

Case 2 "delineating me for 5 months without wages, please deal with the relevant department," may have no case body.

Processing each case text by using a PaddleNLP natural language processing tool, extracting a case main body corresponding to the case text, and extracting the case main body corresponding to each case text based on two different models respectively, wherein the specific steps are as follows:

for each case text, a named entity recognition model of PaddleNLP is used, with the model parameter mode value set to fast. The invention only screens the output results marked as ORG and LOC as candidate subjects. Where labeled ORG denotes the organization name and LOC denotes the place name.

For example, the text "" "three is a beautiful city", 24 parts of speech and special name category labels are built in the named entity recognition model, and one possible extraction result is [ ('three', 'LOC'), ('yes', 'v'), ('one','m'), ('beautiful', 'a'), ('', 'u'), ('city', 'n') ], the invention only screens ('three', 'LOC') as the extraction result.

For each case text, entity information is extracted using an information extraction model of the PaddleNLP, in which related information keywords of the model are set to "company". The invention only screens the output results of the text 'organization class' and not text 'concept' contained in the mark as candidate bodies.

Such as: case 1 "i complain about pay of Happy star limited company of Happy star was delineating me for 5 months, 93 kinds of part of speech and special name category labels are built in the information extraction model, and one possible extraction result is [ ('Happy star limited company', 'organization category') ], ('limited company', 'organization category_concept') ], ('company', 'organization category_concept')) ], and the invention only screens (('Happy star limited company', 'organization category')) as extraction results.

And taking the candidate subjects extracted by the two models as candidate case subjects, wherein the entity text fragments of all the candidate case subjects jointly form a candidate entity corpus.

And S3, acquiring key fragments in each candidate case main body in the candidate case main body set based on the trained key fragment extraction model, and correspondingly obtaining an entity key fragment tuple corpus by each candidate case main body and the key fragments.

It can be appreciated that, referring to fig. 4, for each candidate case body extracted in S2, the key segments in the candidate case body are extracted using the trained key segment extraction model. Specifically, for each candidate case main body, a tag sequence of the candidate case main body is obtained. The extraction stage is consistent with the training stage from inputting the candidate case main body to obtaining the corresponding predictive marker sequence. The tag sequence removes the first and last 1 tag each, and then takes out the words of all the non-O-tagged positions as the key fragments corresponding to the candidate case body.

For example, the entity text Hahal Happy star earth letter company is input into the model to obtain the mark sequence of [ O, O, O, O, B, I, I, I, E, O, O, O ], and removing head and tail marks to obtain [ O, O, O, B, I, I, I, E, O, O ] and taking out words with non-O marks to obtain a key fragment 'Happy star Wedney letter'.

And (3) inputting a candidate case main body into the trained key fragment extraction model, if the obtained key fragment is inconsistent with the key fragment of the sample data, indicating miss, and if the obtained key fragment is consistent with the key fragment of the sample data, indicating hit, wherein the accuracy of the model is obtained by dividing the total number of hits by the total number of verification samples.

Deleting the candidate case text if the key fragment inferred by the key fragment extraction model is empty; if the key fragment inferred by the key fragment extraction model is not empty, combining the obtained key fragment with the candidate case main body to obtain entity key fragment tuples, wherein each entity key fragment tuple consists of [ candidate case text, key fragment ].

Such as the candidate case body text "Hahashi Happy star ground letter company", with the key segment "happy star earth letter", forming entity key fragment tuple [ Hahal Happy star ground message company, happy star ground message ]; if the key segment extracted by the model is null (i.e. the predicted sequence is [ O, O, O, O ]) the entity text "haha city", the entity text "haha city" is deleted.

And forming the Key fragment Corpus Key_Corpus by all the obtained Key fragments, and performing de-duplication treatment on the Key_Corpus. If the key fragments of the entity text Ha Happy star West Xin and the entity text Ha star West Xin are possibly both "Happy star West Xin", the duplicate processing is performed on the two key fragments. All entity key fragment tuples together constitute an entity key fragment tuple corpus.

And for all Key fragments in the entity Key fragment tuple Corpus, removing the Key fragments containing the stop words in Key_Corpus by using the stop word dictionary.

Since some organizations are mediation or processing organizations in different case fields and are present in a large number of cases, the organization is removed as a stop word, and noise of case main body extraction is reduced.

S4, carrying out box division processing on all the key fragments in the entity key fragment tuple corpus to obtain a plurality of key fragment boxes.

As an embodiment, the step of performing a binning process on all the key segments in the entity key segment tuple corpus to obtain a plurality of key segment bins includes: creating an empty second-order index table, wherein a root node of the second-order index table is a root; each word in each key segment in the entity key segment tuple corpus is arranged and combined to generate a plurality of bytes consisting of 2 words; and generating a second-order index table based on the generated bytes, wherein the second-order index table comprises a first-order index and a second-order index, and a plurality of key fragments under each second-order index in the second-order index table form a key fragment box.

It is understood that the binning process is performed based on the second order index table for all the key segments extracted in S3. Specifically, the process of carrying out the box division processing on the key fragments comprises the following steps: an empty second-order index table is generated, and the root node of the index table is root. Each Key segment in the segment Corpus Key_Corpus is subjected to the following operation:

and arranging and combining the words in the key fragments to generate a plurality of bytes consisting of 2 words, wherein the sequence of the two words in the bytes is consistent with the sequence in the key fragments. When the key fragment is less than two words, the key fragment is discarded directly.

For example, the key fragment "kilo-kefir" generates a plurality of bytes: [ Qianke, qianfu, qianzhen, develop, develop. Assuming the key fragment "kilo", bytes are generated: [ Qianke ]. Such as the key fragment "family", is discarded directly.

And generating relevant information in a second-order index table according to the generated bytes, wherein the generation rule is as follows:

searching a first word of a byte in a first-order index in a second-order index table, if the first word is not found, generating a first-order index taking the first word as a key under a root node, taking a second word as a second index of the key, and hanging a key fragment under the second-order index. If the second word is found, the second word of the byte is searched in the second order index table, if the second word is not found, the second order index using the second word as a key is generated under the first order index, and the key fragment is hung under the second order index. If found, the key fragment is directly hung under the second order index.

The flow of the critical segment binning process of the present invention is illustrated by specific example, as in fig. 5, comprising the steps of:

firstly, inputting a key fragment 'family', generating bytes [ family ], and searching the 'family' in a first-order index of a second-order index table for the bytes family, wherein the 'family' is not found; a first order index "family" is generated under which a second order index "hair" is generated and key segments are suspended under the second order index.

And inputting a key fragment 'Qianck development', and generating bytes [ Qianck, qianck development, ke Zhen, development ]. For byte kilofamilies, looking up "kilos" in the first-order index of the second-order index table, and not finding; a first order index "thousand" is generated, a second order index "family" is generated below it, and key segments are suspended below the second order index "family".

For byte thousand, searching for 'thousand' in the first-order index of the second-order index table, and finding; the second order index "issue" is searched under it, not found, so the second order index "issue" is generated under the first order index "thousand", and the key fragment is hung under the second order index "issue".

For byte kiloexpansion, searching for 'kilos' in a first-order index of a second-order index table, and finding; the second-order index 'exhibition' is searched under the first-order index 'thousand', the second-order index 'exhibition' is generated under the first-order index 'thousand', and the key fragments are hung under the second-order index 'exhibition'.

For byte genealogy, searching "genealogy" in the first-order index of the second-order index table, and finding; searching a second-order index "send" under the index to find; the key fragment is "sent" down the second order index.

For byte kefir, searching "keke" in the first-order index of the second-order index table to find; the second order index "span" is searched under it, not found, so the second order index "span" is generated under the first order index "family" and the key fragment is hung under the second order index "span".

For byte development, searching for 'send' in a first-order index of a second-order index table, and not finding; a first order index "launch" is generated under which a second order index "expand" is generated and key segments are suspended under the second order index "expand".

…

Inputting a key fragment 'Qianke', generating bytes [ Qianke ], and searching 'Qianke' in a first-order index of a second-order index table for the bytes Qianke to find; find the second order index "family" under it; the key fragments are therefore hooked under the second order index "family".

And ending the box-dividing processing flow, wherein a plurality of fragments under each second-order index in the second-order index table form a key fragment box. For example, in the second-order index table, the key fragment box under the second-order index "send" of the first-order index "family" contains [ "family", "thousand-family development limited company", "family development hall", "science development hall" ]; in the second order index table, the first order index is "limited", and the key fragment box under the second order index is "public" contains [ "Qianke development Co., ltd" ].

It should be noted that, for the key segment bins with the number of key segments exceeding max_num, all the key segments are directly discarded, and the key segment bins are set to be empty, where max_num can be set autonomously.

For example, in fig. 5, assuming that max_num=4, the key fragment bin under the second order index "send" of the first order index "family" in the second order index table is directly emptied. The purpose of this is to increase the operational efficiency of the algorithm, as the efficiency of the binning directed graph merging algorithm increases critically to "maximum number of bins".

Assume that the total number of key fragments is 10 tens of thousands.

When max_num=1 ten thousand, the maximum number of matches is: 10000 x 10000/2 x 10=5 x 10 ⁸ ；

When max_num=1000, the maximum number of matches is: 1000 x 1000/2 x 100 = 5 x 10 ⁷ ；

When max_num=100, the maximum number of matches is: 100 x 100/2 x 1000 = 5 x 10 ⁶ ；

When a certain binned key segment is too many, the index is too common, for example, the index of first-order index of 'Wu' and second-order index of 'Han', so that virtual key segments of 'Wuhan A-B university', 'Wuhan C-three court', 'Wuhan haha-heaven-earth' and the like are all split under the index, a lot of meaningless comparisons are additionally added, and the matching efficiency is reduced.

S5, clustering all key fragments in each key fragment box based on a directed graph merging algorithm to obtain key fragments of a plurality of entity categories.

It will be appreciated that for each key segment bin, clustering is performed by entity class. As an embodiment, the clustering, based on the directed graph merging algorithm, of all the key segments in each key segment box to obtain key segments of multiple entity categories includes: creating an empty matched dictionary and an empty directed graph; generating directed graph information based on each key segment bin in the second order index table; based on the generated directed graph information, clustering all the key fragments in each key fragment box according to different entity categories to obtain key fragments of a plurality of entity categories.

Specifically, clustering the key fragment boxes by using a directed graph merging algorithm comprises the following steps:

an empty matched dictionary and an empty directed graph key_map are generated.

Generating directed graph information for each key segment bin in the second order index table as follows:

each key segment is used as a Node and endowed with entity category attribute, and the initial entity category of each Node is the key segment of the Node; traversing all nodes, comparing each Node with other nodes one and only once, if Node _i Comprises Node _j All words of Node _i And Node _j Connect one edge, and Node _i Degree of entry +1, node _j The output of (2) is +1; wherein Node _i For Node _j Arc head, node _j For Node _i Arc tails, i.e. Node _j Pointing to Node _i I, j are different node numbers.

For example, assume that the corpus of key segments is [ kefir, qianke development, qianke company, kg development Limited, kefir hall, science and technology development hall, qianke ], and node 1 represents kefir, node 2 represents Qianke development, and the other nodes are shown in FIG. 6.

Taking node 1 as an example, nodes 4,2,5, and 6 all contain all words of node 1, so that the outbound degree of node 1 is 4 and the inbound degree of node 4,2,5,6 is 1. Taking node 7 as an example, node 2,4,3 contains all words of node 7, so node 7 has an out-degree of 3. The resulting directed graph is shown in fig. 6.

When each node is compared with other nodes, a key a and a key b are generated according to the two compared nodes, wherein the structural form of the key a is [ the key segment of the comparison node 1, the key segment of the comparison node 2 ], and the structural form of the key b is [ the key segment of the comparison node 2, the key segment of the comparison node 1 ]. Searching for a key a and a key b from the matched dictionary, and if at least one is found, not performing node comparison; if neither key is found, a node comparison is made and key value pairs are added to the matched dictionary.

For example, assume that the key segment bin matching under the second order index "bin" has been completed for the first order index "bin" in the second order index table. The information in the matched dictionary at this time is { (Qianke development, qianke development Co., ltd.):1, (Qianke development, science and technology development Co., ltd.):1 }.

When matching of the key fragment box under the first order index "family" in the second order index table is started. For a plurality of key segments [ Qianke development, qianke development Limited, science and technology development Hall, science and development Hall ] one-to-one comparison is needed, but because part of the key segments in the matched dictionary are compared with information, all the key segments do not need to be compared one by one in practice. Such as "Qianke development" has been compared with "science and technology development hall".

Based on the generated directed graph, clustering all Key fragments, and performing the following operations on the directed graph Key_Map:

traversing all nodes, and setting entity class values as self key fragments for all nodes with the outbound degree of 0. Traversing all nodes with the ingress degree larger than 0, and assigning the entity class value of the node with the egress degree of 1 to the entity class value of the node with the ingress degree of 1. Such as node _i Has an outfeed of 1 and points to the node _j I.e. node _j Is node _i Arc head, node _j The entity class at this time is node _k Node then _i Also assign own entity class as node _k And outputs it to-1. Traversing all nodes with the entrance degree larger than 0, checking each arc head node for the nodes, deleting the arc head node if the entity class value of the arc head node is also the arc head node of the node, and outputting the node to the degree of-1. Repeating the steps until the directed graph traverses all nodes and stops when no change exists, and then each key segment corresponds to an entity class value.

The case main body can improve the efficiency of the case main body, and the case main body is specifically expressed in that: for searching of the keys in the dictionary and index searching in the second-order index table, the time complexity is O (1), namely that only 1 operation is needed for searching the corresponding keys in the dictionary, because the keys of the dictionary and the index structure in the index table are hashed; the second order index table needs to be operated once for searching the first order, and only 2 times of operations are needed for searching a certain key fragment box.

While string matching often requires many operations, such as "Qianke" and "science and technology development" require at least 5 operations and "Qianke" and "Qianke development Limited" require at least 8 operations. String comparisons are many times more than index lookup operations, and the longer the string, the more comparisons, and the index lookup is fixed 1 or 2 times.

Wherein, 7 entities in fig. 5 need to be compared one by one without using the case-division matching, 7*6/2=21 times need to be compared, and in the case of using the case-division matching, the key fragments under the 7 entity index "kefa" in fig. 5 need to be matched 5*4/2=10 times, the key fragments under the index "kezhan" need to be matched 0 times, the key fragments under the index "kegong" need to be matched 1 time, the key fragments under the index "keshi" need to be matched 0 times, …, and the total need to be matched 16 times, which is 21-16=5 times less than without using the case-division matching.

Assuming that 10 ten thousand entities are to do 10 without binning in general ⁵ *10 ⁵ /2=5*10 ⁹ Secondary string comparison. Matched in separate boxesIn the case, assuming that the maximum volume of each second order index table is 1000 (over a thousand direct drops), the number of matches per bin: 1000 x 1000/2=5 x 10 ⁵ The sub-matches, the sum is at most: 100000/1000=100 bins. The total matching times are not more than the highest: 5*10 ⁵ *100=5*10 ^7， Compared with the case without separating, the efficiency is improved by 100 times.

As shown in fig. 6, the step of clustering all the key segments in each key segment bin includes:

in the first step, 2 nodes with a degree of 0, node 4 and node 6 are found, and their entity categories are assigned as self, namely node 4 is assigned as "Qianke development Co., ltd", and node 6 is assigned as "science and technology development hall".

Secondly, finding out a node with the degree of 1, wherein:

the degree of the node 3 is 1, and the entity class of the node 3 is assigned as 'Qianke development Limited company', and the degree is-1; the degree of the node 2 is 1, and the entity class of the node 2 is assigned as 'Qianke development Limited company', and the degree is-1; and if the degree of the node 5 is 1, assigning the entity category of the node 5 as a 'science and technology development hall', and setting the degree to-1.

Thirdly, the node 7 has three arc head nodes, namely nodes 2,3 and 4, and the arc head of the node 2 is deleted and the degree of entry is-1 because the entity class of the node 2 is the node 4 and the node 4 also belongs to the arc head of the node 7; node 3 is similar to node 2, and the arc head of node 3 is deleted, and the degree is-1; at this time, the number of arc head nodes of the node 7 is only 1, and the output degree is also 1.

Node 1 has four arc head nodes, similar to node 7 in operation, arc head node 2 may be deleted, arc head node 5, and the degree of departure is changed from 4 to 2.

At this time, as shown in stage 2 in fig. 6.

And fourthly, assigning the entity category of the node 7 as 'Qianke development Limited' when the degree of departure of the node 7 is 1.

And fifthly, no node changes.

And sixthly, no node changes.

Seventh, no node changes.

At this point, the fifth, sixth and seventh steps, the directed graph has not changed, as shown in stage 3 of fig. 6, stopping the algorithm.

Finally, the entity categories of nodes 2,3,4 and 7 are all "Qianke development Limited", the entity categories of nodes 5 and 6 are all "science and technology development Hall", and the entity category of node 1 is "scientific and development".

For each entity key fragment tuple in the entity key fragment tuple corpus, searching a node corresponding to the key fragment in the directed graph, and replacing the key fragment with an entity class value of the corresponding node.

For example, the entity category value of the node where the key fragment is located in the entity key fragment tuple [ Qianke, qianke ], is "Qianke development Limited", so that the key fragment is changed into "Qianke development Limited". The entity key fragment tuple after the modification is as follows: [ Qianke Co., ltd., qianke development Co., ltd ].

Traversing the entity key fragment tuple corpus, gathering candidate case texts in entity key fragment tuples with the same key fragments into a class, and marking the key fragments of the class; and the key fragments are unique, the candidate case texts are independently used as one type, and the key fragments of the type are marked.

And outputting a plurality of finally clustered entity classes, for example, entity key fragment tuples [ Huha Qianke Co., qianke development limited ], [ Huha Hahaku Korea, korea ], [ Huha Qianke development limited ], and operating to form a class [ Huha Qianke Co., huha Qianke development ] and a class [ Huha Hahaku Korea ] after operation. The final output result is: [ [ Qiaceae, inc. ] of Hahaku, hahaku development ], [ Hahaku, inc. ], i.e., outputting a plurality of entity classes.

S6, determining a final case main body of each identification category based on the candidate case main bodies corresponding to each key segment in each entity category.

As an embodiment, the determining the final case body of each entity class based on the candidate case bodies corresponding to each key segment in each entity class includes: for a plurality of key fragments in any entity category, acquiring candidate case main bodies corresponding to each key fragment; and selecting the candidate case main body which has no prefix and has the largest character number from the plurality of candidate case main bodies as the final case main body under any entity category.

It can be understood that, based on the entity class output in S5, the entity key fragment tuple of each candidate case main body in each class is found, and the third candidate case main body is screened out, specifically, for each entity class, the corresponding entity key fragment tuple is found, and the key fragment corresponding to each candidate case main body is found, so as to form [ candidate case main body, key fragment ], as shown in fig. 7.

For example, the key fragments are the same type of "Qianke development practice" [ Happy street Qianke company, qianke practice company, happy street Qianke development practice ], and replaced with [ (Happy street Qianke company, qianke company), (Qianke practice company, qianke practice), (Happy street Qianke development practice, qianke practice) ].

And selecting the candidate case main body without prefix and the key fragment as a third candidate case main body for all candidate case main bodies in each entity category. The definition without prefix is: the first N characters of the candidate case body from the first character coincide with the key segment.

Such as: (Qianke ) is prefixed by Happy street "

(Qianke Utility company, qianke Utility) is suffix-free.

And selecting the candidate case main body with the most characters from the third candidate case main bodies as a final case main body. For each case text, replacing a plurality of case bodies extracted from the case text with corresponding case bodies of the final class, and eliminating repeated case bodies of the class.

For example, the case text "i complain that the company of the open street is delineating the wages for 5 months and the registered company is the company of the thousand-year industry", the extracted case body is [ the company of the open street, the company of the thousand-year industry ], the case body is replaced by the final case body, the company of the thousand-year industry development after removing the weight.

If the case text does not have the corresponding class final case body, outputting a [ case text, null value ]; if the case has only one final case body, outputting [ case text, final case body ]; if the case has a plurality of corresponding final case bodies, outputting the final case bodies in a [ case text, final case body ] format for each class of final case bodies one by one.

For example, the case text "I complain about open street Qianke company, I decorate in Qianke garden, delint I's 5 months wage without issue", the extracted entity is [ open street Qianke company, qianke Utility company ], and the final body is converted into a kind of final body [ Qianke Utility developing company, qianke Garden ]

The final output is [ "I complaint about open street Qianke company, I finish in Qianke garden, delineating I5 months wages without issue", open street Qianke company ], [ "I complaint about open street Qianke company, I finish in Qianke garden, delineating I5 months wages without issue", qianke garden ].

All the case texts are extracted in the mode, the final case main body is determined, and the output result forms the final case main body extraction result.

The invention provides a case main body extraction method, which uses PaddleNLP natural language processing tools to extract case main body texts in the case main body extraction stage, thereby improving the recall rate of the case main body. And in the main body alignment stage, extracting key fragments, matching the key fragments by using a box-division matching method, and finally clustering and aligning the case main bodies, thereby improving the accuracy and the matching efficiency of main body alignment and matching. In summary, the method of the invention has higher extraction accuracy and recall rate of the case main body and higher matching efficiency under the big data scene.

In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A case main body extraction method, comprising:

2. The case body extraction method according to claim 1, wherein the extracting the first candidate case body and the second candidate case body in each case text in the case corpus based on the named entity recognition model of the PaddleNLP and the information extraction model of the PaddleNLP, respectively, includes:

acquiring a case corpus, wherein the case corpus comprises a plurality of case texts, each case text describes a case, and one case comprises one or more case main bodies;

inputting each case text in the case corpus into a named entity recognition model based on PaddleNLP, obtaining a plurality of first output results, wherein the structure of each first output result is [ entity text fragment, mark ], and obtaining a first output result marked as a first appointed mark value as a first candidate case main body;

Inputting each case text in the case corpus into an information extraction model based on PaddleNLP, obtaining a plurality of second output results, wherein the structure of each second output result is [ entity text fragment, mark ], and obtaining a second output result marked as a second appointed mark value as a second candidate case main body.

3. The case body extraction method according to claim 1, wherein the step of obtaining a physical key segment tuple corpus by associating each candidate case body and a key segment includes:

if the key fragment output by the key fragment extraction model is empty, deleting the corresponding candidate case main body;

if the key fragment output by the key fragment extraction model is not null, combining the key fragment and the corresponding candidate case main body to form an entity key fragment tuple, wherein the structural form of the entity key fragment tuple is [ the candidate case main body, the key fragment ], and all entity key fragment tuples form an entity key fragment tuple corpus;

and performing de-duplication processing on all entity key fragment tuples in the entity key fragment tuple corpus, and deleting key fragments comprising stop words.

4. The case body extraction method according to claim 3, wherein the step of performing a binning process on all the key segments in the entity key segment tuple corpus to obtain a plurality of key segment bins includes:

Creating an empty second-order index table, wherein a root node of the second-order index table is a root;

each word in each key segment in the entity key segment tuple corpus is arranged and combined to generate a plurality of bytes consisting of 2 words;

and generating a second-order index table based on the generated bytes, wherein the second-order index table comprises a first-order index and a second-order index, and a plurality of key fragments under each second-order index in the second-order index table form a key fragment box.

5. The case body extraction method of claim 4, wherein the order of the two words in each of the bytes is identical to the order in the key segment, and the key segment is discarded when the key segment is less than two words.

6. The case body extraction method of claim 4, wherein the generating a second order index table based on the generated bytes comprises:

for any byte, searching a first word of any byte in a first-order index in the second-order index table, if the first word is not found, generating a first-order index with the first word as a key under a root node, and hanging a key fragment under the second-order index with a second word as a key;

If the first word is found, searching a second word of any byte in a second order index table, if the second word is not found, generating a second order index with the second word as a key under a first order index, and hanging a key fragment under the second order index; if the second word is found, the key fragment is directly hung under the second order index.

7. The case main body extraction method according to claim 6, wherein the clustering all key segments in each key segment box based on the directed graph merging algorithm to obtain key segments of a plurality of entity categories comprises:

creating an empty matched dictionary and an empty directed graph;

generating directed graph information based on each key segment bin in the second order index table;

based on the generated directed graph information, clustering all the key fragments in each key fragment box according to different entity categories to obtain key fragments of a plurality of entity categories.

8. The case body extraction method of claim 7, wherein generating the directed graph information based on each key segment bin in the second order index table comprises:

each key segment is used as a Node and endowed with entity category attribute, and the initial entity category of each Node is the key segment of the Node;

Traversing all nodes, comparing each Node with other nodes one and only once, if Node _i Comprises Node _j All words of Node _i And Node _j Connect one edge, and Node _i Degree of entry +1, node _j The output of (2) is +1;

wherein Node _i For Node _j Arc head, node _j For Node _i Arc tails, i.e. Node _j Pointing to Node _i I, j are different node numbers;

when each node is compared with other nodes, a key a and a key b are generated according to the two compared nodes, wherein the structural form of the key a is [ key fragment of a comparison node 1 and key fragment of a comparison node 2 ], and the structural form of the key b is [ key fragment of a comparison node 2 and key fragment of a comparison node 1 ];

searching for a key a and a key b from the matched dictionary, and if at least one is found, not performing node comparison; if neither key is found, a node comparison is made and key value pairs are added to the matched dictionary.

9. The case main body extraction method according to claim 8, wherein clustering all key segments in each key segment box according to different entity categories based on the generated directed graph information to obtain key segments of a plurality of entity categories comprises:

Traversing all nodes in the directed graph, and setting the key fragment as an entity class value of the key fragment for all nodes with the degree of 0;

if the input degree of the node is greater than 0 and the output degree is 1, assigning the entity class value as the entity class value of the arc head node, and outputting the entity class value to be-1;

for nodes with the input degree larger than 0, finding out an arc head node of each node in the directed graph, if the entity class value of the arc head node is consistent with the entity class value of the node, deleting the arc head node, and setting the output degree of the node to be-1;

traversing all nodes in the directed graph until the entity class values of all nodes are unchanged, and acquiring the entity class value corresponding to each node, namely acquiring the entity class value corresponding to each key fragment;

and gathering all the key fragments with the same entity class value into one class to obtain the key fragments of a plurality of entity classes.

10. The case body extraction method according to claim 1, 3 or 8, wherein the determining a final case body of each entity category based on candidate case bodies corresponding to each key segment of each entity category includes:

for a plurality of key fragments in any entity category, acquiring candidate case main bodies corresponding to each key fragment;

And selecting the candidate case main body which has no prefix and has the largest character number from the plurality of candidate case main bodies as the final case main body under any entity category.