CN112528036A - Knowledge graph automatic construction method for evidence correlation analysis - Google Patents
Knowledge graph automatic construction method for evidence correlation analysis Download PDFInfo
- Publication number
- CN112528036A CN112528036A CN202011372006.3A CN202011372006A CN112528036A CN 112528036 A CN112528036 A CN 112528036A CN 202011372006 A CN202011372006 A CN 202011372006A CN 112528036 A CN112528036 A CN 112528036A
- Authority
- CN
- China
- Prior art keywords
- evidence
- model
- entity
- data
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000010276 construction Methods 0.000 title claims abstract description 17
- 238000010219 correlation analysis Methods 0.000 title claims abstract description 9
- 238000000034 method Methods 0.000 claims abstract description 61
- 238000012549 training Methods 0.000 claims description 36
- 238000013528 artificial neural network Methods 0.000 claims description 27
- 238000004364 calculation method Methods 0.000 claims description 15
- 230000014509 gene expression Effects 0.000 claims description 15
- 238000005516 engineering process Methods 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 9
- 238000003062 neural network model Methods 0.000 claims description 9
- 238000011835 investigation Methods 0.000 claims description 7
- 238000012360 testing method Methods 0.000 claims description 7
- 238000012795 verification Methods 0.000 claims description 7
- 230000000694 effects Effects 0.000 claims description 6
- 238000007477 logistic regression Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 238000011160 research Methods 0.000 claims description 6
- 239000000463 material Substances 0.000 claims description 4
- 238000013461 design Methods 0.000 claims description 3
- 230000004927 fusion Effects 0.000 claims description 3
- 238000007499 fusion processing Methods 0.000 claims description 3
- 238000007689 inspection Methods 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 claims description 3
- 238000004806 packaging method and process Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 abstract description 2
- 238000013473 artificial intelligence Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001010 compromised effect Effects 0.000 description 1
- 238000009223 counseling Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3341—Query execution using boolean model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/18—Legal services
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Business, Economics & Management (AREA)
- General Health & Medical Sciences (AREA)
- Tourism & Hospitality (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Technology Law (AREA)
- Artificial Intelligence (AREA)
- Animal Behavior & Ethology (AREA)
- Economics (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to an automatic knowledge graph construction method, in particular to an automatic knowledge graph construction method for evidence correlation analysis, which comprises the following steps: step 1, constructing an ontology to describe a knowledge graph, step 2, extracting case-related evidences, step 3, extracting case structural elements, step 4, establishing an evidence relation between the evidences and the case structural elements, step 5, fusing the knowledge graph of the high-similarity entity, and step 6, storing the knowledge graph. In the prior art, the knowledge graph for storing and representing is lacked in the evidence field at present, but the method is simple to operate, can construct the high-quality knowledge graph with low labor cost, and improves the evidence analysis efficiency.
Description
Technical Field
The invention relates to a knowledge graph automatic construction method, in particular to a knowledge graph automatic construction method for evidence correlation analysis.
Background
At present, the national has no evidence law, a unified and definite regulation for evidence relevance is lacked, and only a few scattered regulations are provided on legislation. The ambiguity in the definition of evidence relevance sometimes makes it difficult for a judge to distinguish between relevant evidence and non-relevant evidence, resulting in a decrease in the efficiency of litigation and a case that is not easily detected. In judicial practice, the counseling parties may present various evidence interfering with the judgment for the purpose of victory, and the adoption of the evidence can cause bias, confusion and the like although the evidence has relevance. If the relevance of the evidence is not explained, the citizen's trust in the court, legal and governmental agencies may be compromised.
Criminal law and technical development are always closely connected, and technical support cannot be separated from investigation and material evidence. The concept of legal artificial intelligence is disassembled in a colorful way, the feasibility of the artificial intelligence for criminal trial evidence examination is analyzed, the application difficulty and limitation of the artificial intelligence in the criminal trial evidence examination are analyzed, and a reasonable application strategy is provided; the artificial intelligence technology is used for assisting in the improvement of criminal evidence standards, the functions of the intelligent technology in the aspects of evidence verification, leakage detection, gap filling and the like are fully exerted, and a foundation is laid for the application of big data and cloud computing to test the integrity of an evidence chain; by means of deep learning technology, the research and development team of advanced people's court in Shanghai city formulates evidence standard and evidence rule aiming at the problem that evidence is easy to be generated, frequently generated and common in the evidence obtaining link in the combed judicial practice, and provides an intelligent auxiliary case handling system for Shanghai criminal cases. The methods put more attention on evidence standards and lack judgment on evidence relevance, so that it is important to propose a model to make up for the technical gap in the aspect.
Disclosure of Invention
In order to make up for the defects in the prior art, the invention aims to provide an automatic knowledge graph construction method for evidence correlation analysis. The method can complete information extraction of massive legal documents, complete information slot filling according to the designed body, and automatically construct a high-quality knowledge map. The map provides an electronic database of historical case evidence to assist judicial personnel in completing evidence-related services.
In order to achieve the purpose of the invention and solve the problems in the prior art, the invention adopts the technical scheme that: an automatic knowledge graph construction method for evidence correlation analysis comprises the following steps:
step 1, constructing an ontology to describe a knowledge graph, and constructing a high-quality ontology structure to organize and express related knowledge by document research, data lookup, and artificial design of concepts, attributes and constraints, wherein the method specifically comprises the following substeps:
(a) analyzing evidence description in the certificate-taking and quality certificate-submitting lines, dividing evidence concepts into eight subclasses according to the regulations on evidence in the Chinese people's republic of China criminal litigation law (2018 amendment), wherein the eight subclasses comprise material certificates, book certificates, testimony, appraisal opinions, statements of the victim, criminal suspects and the victim for providing and resolving, investigation, inspection, recognition and investigation of experimental notes, audio-visual data and electronic data, simultaneously completing the definition of attributes, and mapping evidence information in the certificate-taking and quality certificate-submitting lines into the evidence concepts;
(b) analyzing the structure of the prosecution book, dividing the concept of the prosecution book into four sub-concepts of a suspect, a criminal fact, an evidence set and a monitoring hospital opinion, wherein the first two parts are used as main evidence objects of the evidence, and simultaneously redefining a natural condition concept and analyzing the concept of the suspect in order to ensure that the evidence strength is more precise;
(c) defining the relationship between the concept of the natural condition and the concept of the suspect, wherein the natural condition is used for describing the suspect, so that the relationship is defined according to the presentation of a judicial text, the head entity of the relationship is restricted to be the natural condition, and the tail entity of the relationship is restricted to be the suspect;
(d) defining the relation between an evidence concept and a prosecution book concept, defining the proving relation after the two concepts of the evidence and the prosecution book are constructed, constraining a head entity of the proving relation as the evidence and a tail entity of the proving relation as the prosecution book, establishing the relation between the two concepts, and completing the construction of a knowledge graph body;
step 2, extracting case-involved evidences, extracting evidences presented in the outline of proof and quality certification by using a named entity recognition technology, and automatically determining the proof direction of the evidence entities by rules, wherein the method specifically comprises the following substeps:
(a) constructing an evidence entity identification data set, wherein the proof and quality evidence synopsis contains description of related evidence, marking the evidence entities in the synopsis in a manual and regular mode, and constructing a training data set of the model;
(b) establishing a neural network to perform named entity recognition, adopting a classic encoder-decoder framework to perform entity recognition, using a pre-training model with strong language representation capability for an encoder, adopting a feedforward neural network for a decoder, describing the calculation process through a formula (l) and a formula (2),
ht=PLM(xt) (1)
wherein PLM represents a Pre-trained Language Model adopted, the Language Model is trained by scientific research institutions to obtain a back open source, xtInput data representing time t, htThe encoded intermediate vector representing the input at time t,
yt=FFN(ht) (2)
where FFN represents a feed-forward neural network, different neural network structures are selected according to different inputs, ytEntity tags representing corresponding positions of the input sequence;
(c) training the neural network model by using the marked data, firstly cutting a data set, dividing the data set into a training set, a verification set and a test set according to a proportion, then inputting the training set data into the model, calculating the accuracy, the recall rate and the F value of the model, adjusting the training times, the learning rate and the network structure hyper-parameter according to the test result of the model to obtain a parameter combination when the model represents the best, recording the parameters, and storing the model;
(d) packaging the optimal model in the training process, performing text preprocessing on a new input text according to the same pre-trained word vector, serializing text language data into a text vector which can be calculated by the model to express, obtaining a corresponding label set through model prediction, processing the label sequence again through a specific rule method to determine an entity boundary to obtain an evidence entity, and simultaneously obtaining entity type information to determine the evidence type of the entity;
step 3, extracting case structural elements, analyzing case structures in the prosecution book by using a method of combining a neural network and rules, dividing the case structures into different structural elements, and specifically comprising the following substeps:
(a) analyzing the prosecution text in the data set, dividing the document structure according to the designed body, positioning and dividing paragraphs and keywords in the paragraphs, roughly cutting the text by using Boolean operation matched with the keywords, and realizing coarse-grained division of the text;
(b) aiming at the text which can not be segmented or has poor segmentation effect by Boolean operation, a neural network model is built to realize the target, each paragraph in the document is firstly serialized into a word vector by using a neural network method, then a logistic regression model is built to predict, whether the paragraph corresponding to each word vector is a boundary paragraph is judged, the calculation process is described by a formula (3),
labi=LR(NN(pari)) (3)
wherein pariRepresenting the text sequence of the ith paragraph in the document, NN representing the neural network method for serializing a paragraph of text into a word vector, LR representing the logistic regression model for determining whether the paragraph is a boundary paragraph, labiA label representing the ith paragraph, wherein a result of 1 indicates that the paragraph is a boundary paragraph and a result of 0 indicates a non-boundary paragraph;
(c) training the model and predicting on new text data, inputting a document with a correct Boolean operation result as label data into the model, iteratively training for multiple rounds, adjusting the number of network layers, the learning rate and the optimizer parameters until the model achieves the optimal effect, and then applying the model to the document with the boundary which can not be positioned by Boolean operation to obtain the correct boundary;
(d) obtaining the serial number of the boundary paragraph in the prosecution book through the process, processing the prosecution book by using a rule, dividing the prosecution book into four parts, namely a suspect, a crime fact, an evidence set and a survey department opinion, mapping the content of the prosecution book with a prosecution book body, and instantiating the prosecution book body;
step 4, establishing a proving relation between the evidence and the case structural elements, analyzing the similarity between the proving object description and the structural elements by using a text matching technology, and judging whether the proving relation exists, wherein the method specifically comprises the following substeps:
(a) analyzing the text description of each evidence of the proof and quality certification outline about a proof object, analyzing the text description of four structural elements in a corresponding prosecution book, judging whether the analyzed evidence has a proof relation with a structure, manually designing a marking rule and a frame, carrying out a small amount of manual marking, and then enabling a third party to carry out manual verification to ensure the correctness of the marking;
(b) establishing a proving relation between the neural network model prediction evidence and the case structure elements, calculating the similarity between the proving object text description of the evidence and the case structure element text description through the neural network, and judging whether the proving relation exists or not by taking the relative size of the similarity as a reference;
(c) carrying out model training by using a remote supervision method, marking a small amount of high-quality data in the substep (a) of the step 4, and then carrying out data enhancement by using a remote supervision mode to realize the training of the model on a large data set, wherein parameters are continuously adjusted in the training process until an optimal model structure is stored;
(d) predicting the relationship between each group of text evidence sets and case structural elements by using the trained model, firstly extracting an evidence list and a case structural element list from a text group related to a specific case, then taking Cartesian products of elements in the two sets, calculating an evidence chain label between an evidence entity and the structural element by using the model, and finally adding a combination with a proving relationship into a triple set;
and 5, fusing knowledge of the high-similarity entity, calculating semantic mapping relations among different judicial text instances by using a neural network, and fusing the knowledge, wherein the method specifically comprises the following substeps:
(a) step 1 to step 4, a preliminary knowledge graph is built, but an entity with high similarity similar to household registration information and household registration certificate exists, the knowledge of the entity is expanded through remote supervision, then the attribute information of the entity and the information of a related entity are combined, and the three kinds of information are spliced to be used as vector expression of the entity;
(b) building a model to calculate the similarity between vector expressions of the entities, performing entity association in the horizontal direction to realize example data complementation, and if the similarity of the entities is higher than a threshold value, considering that the two entities describe the same information and performing entity linkage; if the similarity of the entities is low, entity linkage is not carried out, the two entities independently describe the respective information, the calculation process is described by formula (4),
sim=f(xexp;xattr;xadj) (4)
wherein x isexpRepresenting knowledge expressions, x, of entities in a third-party knowledge baseattrRepresenting an attribute expression, x, of an entityadjExpressing vector expression of related entities, f expressing a similarity calculation model, sim expressing a similarity value calculated by the model;
(c) performing knowledge fusion according to the similarity values obtained by calculation, firstly determining a central entity in an entity set which is linked with each other, then fusing the relationship and the attribute values of non-central entities to the central entity, and if a relationship or attribute conflict is detected in the fusion process, performing conflict resolution by adopting a voting-based method;
step 6, storing the knowledge graph, and storing the knowledge graph by using a graph database to improve the query efficiency, wherein the method specifically comprises the following substeps:
(a) the entities in the knowledge graph are regarded as nodes, the relations are regarded as edges with labels, the data of the knowledge graph obviously meet the graph model structure, the directed graph is used for modeling the data of the knowledge graph based on the storage method of the graph structure, and the data are represented and stored through the nodes, the edges and the attributes;
(b) and importing the automatically extracted relational data into a graph database in batch, storing the data into a csv structure, respectively defining a node file and a relational file of the csv structure, and importing the data by using a command carried by the graph database to complete automatic construction of the knowledge graph.
The invention has the beneficial effects that: an automatic knowledge graph construction method for evidence correlation analysis comprises the following steps: step 1, constructing an ontology to describe a knowledge graph, step 2, extracting case-related evidences, step 3, extracting case structural elements, step 4, establishing an evidence relation between the evidences and the case structural elements, step 5, fusing the knowledge graph of the high-similarity entity, and step 6, storing the knowledge graph. Compared with the prior art, the method has the advantages that the knowledge graph for storing and representing is lacked in the evidence field at present, the method is simple to operate, the high-quality graph can be constructed at low labor cost, and the evidence analysis efficiency is improved.
Drawings
FIG. 1 is a flow chart of the method steps of the present invention.
FIG. 2 is a diagram of an evidence ontology constructed by the present invention.
FIG. 3 is a representation of an evidence entity identification process of the present invention.
Detailed Description
The invention will be further explained with reference to the drawings.
As shown in fig. 1, an automatic knowledge graph construction method for evidence correlation analysis includes the following steps:
step 1, constructing an ontology to describe a knowledge graph, and constructing a high-quality ontology structure to organize and express related knowledge by document research, data lookup, and artificial design of concepts, attributes and constraints, wherein the method specifically comprises the following substeps:
(a) analyzing evidence description in the certificate-taking and quality certificate-submitting lines, dividing evidence concepts into eight subclasses according to the regulations on evidence in the Chinese people's republic of China criminal litigation law (2018 amendment), wherein the eight subclasses comprise material certificates, book certificates, testimony, appraisal opinions, statements of the victim, criminal suspects and the victim for providing and resolving, investigation, inspection, recognition and investigation of experimental notes, audio-visual data and electronic data, simultaneously completing the definition of attributes, and mapping evidence information in the certificate-taking and quality certificate-submitting lines into the evidence concepts;
(b) analyzing the structure of the prosecution book, dividing the concept of the prosecution book into four sub-concepts of a suspect, a criminal fact, an evidence set and a monitoring hospital opinion, wherein the first two parts are used as main evidence objects of the evidence, and simultaneously redefining a natural condition concept and analyzing the concept of the suspect in order to ensure that the evidence strength is more precise;
(c) defining the relationship between the concept of the natural condition and the concept of the suspect, wherein the natural condition is used for describing the suspect, so that the relationship is defined according to the presentation of a judicial text, the head entity of the relationship is restricted to be the natural condition, and the tail entity of the relationship is restricted to be the suspect;
(d) defining the relation between an evidence concept and a prosecution book concept, defining the proving relation after the two concepts of the evidence and the prosecution book are constructed, constraining a head entity of the proving relation as the evidence and a tail entity of the proving relation as the prosecution book, establishing the relation between the two concepts, and completing the construction of a knowledge graph body;
step 2, extracting case-involved evidences, extracting evidences presented in the outline of proof and quality certification by using a named entity recognition technology, and automatically determining the proof direction of the evidence entities by rules, wherein the method specifically comprises the following substeps:
(a) constructing an evidence entity identification data set, wherein the proof and quality evidence synopsis contains description of related evidence, marking the evidence entities in the synopsis in a manual and regular mode, and constructing a training data set of the model;
(b) establishing a neural network to perform named entity recognition, adopting a classic encoder-decoder framework to perform entity recognition, using a pre-training model with strong language representation capability for an encoder, adopting a feedforward neural network for a decoder, describing the calculation process through a formula (l) and a formula (2),
ht=PLM(xt) (1)
wherein PLM represents a Pre-trained Language Model adopted, the Language Model is trained by scientific research institutions to obtain a back open source, xtInput data representing time t, htThe encoded intermediate vector representing the input at time t,
yt=FFN(ht) (2)
where FFN represents a feed-forward neural network, different neural network structures are selected according to different inputs, ytEntity tags representing corresponding positions of the input sequence;
(c) training the neural network model by using the marked data, firstly cutting a data set, dividing the data set into a training set, a verification set and a test set according to a proportion, then inputting the training set data into the model, calculating the accuracy, the recall rate and the F value of the model, adjusting the training times, the learning rate and the network structure hyper-parameter according to the test result of the model to obtain a parameter combination when the model represents the best, recording the parameters, and storing the model;
(d) packaging the optimal model in the training process, performing text preprocessing on a new input text according to the same pre-trained word vector, serializing text language data into a text vector which can be calculated by the model to express, obtaining a corresponding label set through model prediction, processing the label sequence again through a specific rule method to determine an entity boundary to obtain an evidence entity, and simultaneously obtaining entity type information to determine the evidence type of the entity;
step 3, extracting case structural elements, analyzing case structures in the prosecution book by using a method of combining a neural network and rules, dividing the case structures into different structural elements, and specifically comprising the following substeps:
(a) analyzing the prosecution text in the data set, dividing the document structure according to the designed body, positioning and dividing paragraphs and keywords in the paragraphs, roughly cutting the text by using Boolean operation matched with the keywords, and realizing coarse-grained division of the text;
(b) aiming at the text which can not be segmented or has poor segmentation effect by Boolean operation, a neural network model is built to realize the target, each paragraph in the document is firstly serialized into a word vector by using a neural network method, then a logistic regression model is built to predict, whether the paragraph corresponding to each word vector is a boundary paragraph is judged, the calculation process is described by a formula (3),
labi=LR(NN(pari)) (3)
wherein pariRepresenting the text sequence of the ith paragraph in the document, NN representing the neural network method for serializing a paragraph of text into a word vector, LR representing the logistic regression model for determining whether the paragraph is a boundary paragraph, labiA label representing the ith paragraph, wherein a result of 1 indicates that the paragraph is a boundary paragraph and a result of 0 indicates a non-boundary paragraph;
(c) training the model and predicting on new text data, inputting a document with a correct Boolean operation result as label data into the model, iteratively training for multiple rounds, adjusting the number of network layers, the learning rate and the optimizer parameters until the model achieves the optimal effect, and then applying the model to the document with the boundary which can not be positioned by Boolean operation to obtain the correct boundary;
(d) obtaining the serial number of the boundary paragraph in the prosecution book through the process, processing the prosecution book by using a rule, dividing the prosecution book into four parts, namely a suspect, a crime fact, an evidence set and a survey department opinion, mapping the content of the prosecution book with a prosecution book body, and instantiating the prosecution book body;
step 4, establishing a proving relation between the evidence and the case structural elements, analyzing the similarity between the proving object description and the structural elements by using a text matching technology, and judging whether the proving relation exists, wherein the method specifically comprises the following substeps:
(a) analyzing the text description of each evidence of the proof and quality certification outline about a proof object, analyzing the text description of four structural elements in a corresponding prosecution book, judging whether the analyzed evidence has a proof relation with a structure, manually designing a marking rule and a frame, carrying out a small amount of manual marking, and then enabling a third party to carry out manual verification to ensure the correctness of the marking;
(b) establishing a proving relation between the neural network model prediction evidence and the case structure elements, calculating the similarity between the proving object text description of the evidence and the case structure element text description through the neural network, and judging whether the proving relation exists or not by taking the relative size of the similarity as a reference;
(c) carrying out model training by using a remote supervision method, marking a small amount of high-quality data in the substep (a) of the step 4, and then carrying out data enhancement by using a remote supervision mode to realize the training of the model on a large data set, wherein parameters are continuously adjusted in the training process until an optimal model structure is stored;
(d) predicting the relationship between each group of text evidence sets and case structural elements by using the trained model, firstly extracting an evidence list and a case structural element list from a text group related to a specific case, then taking Cartesian products of elements in the two sets, calculating an evidence chain label between an evidence entity and the structural element by using the model, and finally adding a combination with a proving relationship into a triple set;
and 5, fusing knowledge of the high-similarity entity, calculating semantic mapping relations among different judicial text instances by using a neural network, and fusing the knowledge, wherein the method specifically comprises the following substeps:
(a) step 1 to step 4, a preliminary knowledge graph is built, but an entity with high similarity similar to household registration information and household registration certificate exists, the knowledge of the entity is expanded through remote supervision, then the attribute information of the entity and the information of a related entity are combined, and the three kinds of information are spliced to be used as vector expression of the entity;
(b) building a model to calculate the similarity between vector expressions of the entities, performing entity association in the horizontal direction to realize example data complementation, and if the similarity of the entities is higher than a threshold value, considering that the two entities describe the same information and performing entity linkage; if the similarity of the entities is low, entity linkage is not carried out, the two entities independently describe the respective information, the calculation process is described by formula (4),
sim=f(xexp;xattr;xadj) (4)
wherein x isexpRepresenting knowledge expressions, x, of entities in a third-party knowledge baseattrRepresenting an attribute expression, x, of an entityadjExpressing vector expression of related entities, f expressing a similarity calculation model, sim expressing a similarity value calculated by the model;
(c) performing knowledge fusion according to the similarity values obtained by calculation, firstly determining a central entity in an entity set which is linked with each other, then fusing the relationship and the attribute values of non-central entities to the central entity, and if a relationship or attribute conflict is detected in the fusion process, performing conflict resolution by adopting a voting-based method;
step 6, storing the knowledge graph, and storing the knowledge graph by using a graph database to improve the query efficiency, wherein the method specifically comprises the following substeps:
(a) the entities in the knowledge graph are regarded as nodes, the relations are regarded as edges with labels, the data of the knowledge graph obviously meet the graph model structure, the directed graph is used for modeling the data of the knowledge graph based on the storage method of the graph structure, and the data are represented and stored through the nodes, the edges and the attributes;
(b) and importing the automatically extracted relational data into a graph database in batch, storing the data into a csv structure, respectively defining a node file and a relational file of the csv structure, and importing the data by using a command carried by the graph database to complete automatic construction of the knowledge graph.
Claims (1)
1. An automatic knowledge graph construction method for evidence correlation analysis is characterized by comprising the following steps:
step 1, constructing an ontology to describe a knowledge graph, and constructing a high-quality ontology structure to organize and express related knowledge by document research, data lookup, and artificial design of concepts, attributes and constraints, wherein the method specifically comprises the following substeps:
(a) analyzing evidence description in the certificate-taking and quality certificate-submitting lines, dividing evidence concepts into eight subclasses according to the regulations on evidence in the Chinese people's republic of China criminal litigation law (2018 amendment), wherein the eight subclasses comprise material certificates, book certificates, testimony, appraisal opinions, statements of the victim, criminal suspects and the victim for providing and resolving, investigation, inspection, recognition and investigation of experimental notes, audio-visual data and electronic data, simultaneously completing the definition of attributes, and mapping evidence information in the certificate-taking and quality certificate-submitting lines into the evidence concepts;
(b) analyzing the structure of the prosecution book, dividing the concept of the prosecution book into four sub-concepts of a suspect, a criminal fact, an evidence set and a monitoring hospital opinion, wherein the first two parts are used as main evidence objects of the evidence, and simultaneously redefining a natural condition concept and analyzing the concept of the suspect in order to ensure that the evidence strength is more precise;
(c) defining the relationship between the concept of the natural condition and the concept of the suspect, wherein the natural condition is used for describing the suspect, so that the relationship is defined according to the presentation of a judicial text, the head entity of the relationship is restricted to be the natural condition, and the tail entity of the relationship is restricted to be the suspect;
(d) defining the relation between an evidence concept and a prosecution book concept, defining the proving relation after the two concepts of the evidence and the prosecution book are constructed, constraining a head entity of the proving relation as the evidence and a tail entity of the proving relation as the prosecution book, establishing the relation between the two concepts, and completing the construction of a knowledge graph body;
step 2, extracting case-involved evidences, extracting evidences presented in the outline of proof and quality certification by using a named entity recognition technology, and automatically determining the proof direction of the evidence entities by rules, wherein the method specifically comprises the following substeps:
(a) constructing an evidence entity identification data set, wherein the proof and quality evidence synopsis contains description of related evidence, marking the evidence entities in the synopsis in a manual and regular mode, and constructing a training data set of the model;
(b) establishing a neural network to perform named entity recognition, adopting a classic encoder-decoder framework to perform entity recognition, using a pre-training model with strong language representation capability for an encoder, adopting a feedforward neural network for a decoder, describing the calculation process through a formula (l) and a formula (2),
ht=PLM(xt) (1)
wherein PLM represents a Pre-trained Language Model adopted, the Language Model is trained by scientific research institutions to obtain a back open source, xtInput data representing time t, htThe encoded intermediate vector representing the input at time t,
yt=FFN(ht) (2)
where FFN represents a feed-forward neural network, different neural network structures are selected according to different inputs, ytEntity tags representing corresponding positions of the input sequence;
(c) training the neural network model by using the marked data, firstly cutting a data set, dividing the data set into a training set, a verification set and a test set according to a proportion, then inputting the training set data into the model, calculating the accuracy, the recall rate and the F value of the model, adjusting the training times, the learning rate and the network structure hyper-parameter according to the test result of the model to obtain a parameter combination when the model represents the best, recording the parameters, and storing the model;
(d) packaging the optimal model in the training process, performing text preprocessing on a new input text according to the same pre-trained word vector, serializing text language data into a text vector which can be calculated by the model to express, obtaining a corresponding label set through model prediction, processing the label sequence again through a specific rule method to determine an entity boundary to obtain an evidence entity, and simultaneously obtaining entity type information to determine the evidence type of the entity;
step 3, extracting case structural elements, analyzing case structures in the prosecution book by using a method of combining a neural network and rules, dividing the case structures into different structural elements, and specifically comprising the following substeps:
(a) analyzing the prosecution text in the data set, dividing the document structure according to the designed body, positioning and dividing paragraphs and keywords in the paragraphs, roughly cutting the text by using Boolean operation matched with the keywords, and realizing coarse-grained division of the text;
(b) aiming at the text which can not be segmented or has poor segmentation effect by Boolean operation, a neural network model is built to realize the target, each paragraph in the document is firstly serialized into a word vector by using a neural network method, then a logistic regression model is built to predict, whether the paragraph corresponding to each word vector is a boundary paragraph is judged, the calculation process is described by a formula (3),
labi=LR(NN(pari)) (3)
wherein pariRepresenting the text sequence of the ith paragraph in the document, NN representing the neural network method for serializing a paragraph of text into a word vector, LR representing the logistic regression model for determining whether the paragraph is a boundary paragraph, labiA label representing the ith paragraph, wherein a result of 1 indicates that the paragraph is a boundary paragraph and a result of 0 indicates a non-boundary paragraph;
(c) training the model and predicting on new text data, inputting a document with a correct Boolean operation result as label data into the model, iteratively training for multiple rounds, adjusting the number of network layers, the learning rate and the optimizer parameters until the model achieves the optimal effect, and then applying the model to the document with the boundary which can not be positioned by Boolean operation to obtain the correct boundary;
(d) obtaining the serial number of the boundary paragraph in the prosecution book through the process, processing the prosecution book by using a rule, dividing the prosecution book into four parts, namely a suspect, a crime fact, an evidence set and a survey department opinion, mapping the content of the prosecution book with a prosecution book body, and instantiating the prosecution book body;
step 4, establishing a proving relation between the evidence and the case structural elements, analyzing the similarity between the proving object description and the structural elements by using a text matching technology, and judging whether the proving relation exists, wherein the method specifically comprises the following substeps:
(a) analyzing the text description of each evidence of the proof and quality certification outline about a proof object, analyzing the text description of four structural elements in a corresponding prosecution book, judging whether the analyzed evidence has a proof relation with a structure, manually designing a marking rule and a frame, carrying out a small amount of manual marking, and then enabling a third party to carry out manual verification to ensure the correctness of the marking;
(b) establishing a proving relation between the neural network model prediction evidence and the case structure elements, calculating the similarity between the proving object text description of the evidence and the case structure element text description through the neural network, and judging whether the proving relation exists or not by taking the relative size of the similarity as a reference;
(c) carrying out model training by using a remote supervision method, marking a small amount of high-quality data in the substep (a) of the step 4, and then carrying out data enhancement by using a remote supervision mode to realize the training of the model on a large data set, wherein parameters are continuously adjusted in the training process until an optimal model structure is stored;
(d) predicting the relationship between each group of text evidence sets and case structural elements by using the trained model, firstly extracting an evidence list and a case structural element list from a text group related to a specific case, then taking Cartesian products of elements in the two sets, calculating an evidence chain label between an evidence entity and the structural element by using the model, and finally adding a combination with a proving relationship into a triple set;
and 5, fusing knowledge of the high-similarity entity, calculating semantic mapping relations among different judicial text instances by using a neural network, and fusing the knowledge, wherein the method specifically comprises the following substeps:
(a) step 1 to step 4, a preliminary knowledge graph is built, but an entity with high similarity similar to household registration information and household registration certificate exists, the knowledge of the entity is expanded through remote supervision, then the attribute information of the entity and the information of a related entity are combined, and the three kinds of information are spliced to be used as vector expression of the entity;
(b) building a model to calculate the similarity between vector expressions of the entities, performing entity association in the horizontal direction to realize example data complementation, and if the similarity of the entities is higher than a threshold value, considering that the two entities describe the same information and performing entity linkage; if the similarity of the entities is low, entity linkage is not carried out, the two entities independently describe the respective information, the calculation process is described by formula (4),
sim=f(xexp;xattr;xadj) (4)
wherein x isexpRepresenting knowledge expressions, x, of entities in a third-party knowledge baseattrRepresenting an attribute expression, x, of an entityadjExpressing vector expression of related entities, f expressing a similarity calculation model, sim expressing a similarity value calculated by the model;
(c) performing knowledge fusion according to the similarity values obtained by calculation, firstly determining a central entity in an entity set which is linked with each other, then fusing the relationship and the attribute values of non-central entities to the central entity, and if a relationship or attribute conflict is detected in the fusion process, performing conflict resolution by adopting a voting-based method;
step 6, storing the knowledge graph, and storing the knowledge graph by using a graph database to improve the query efficiency, wherein the method specifically comprises the following substeps:
(a) the entities in the knowledge graph are regarded as nodes, the relations are regarded as edges with labels, the data of the knowledge graph obviously meet the graph model structure, the directed graph is used for modeling the data of the knowledge graph based on the storage method of the graph structure, and the data are represented and stored through the nodes, the edges and the attributes;
(b) and importing the automatically extracted relational data into a graph database in batch, storing the data into a csv structure, respectively defining a node file and a relational file of the csv structure, and importing the data by using a command carried by the graph database to complete automatic construction of the knowledge graph.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011372006.3A CN112528036B (en) | 2020-11-30 | 2020-11-30 | Knowledge graph automatic construction method for evidence correlation analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011372006.3A CN112528036B (en) | 2020-11-30 | 2020-11-30 | Knowledge graph automatic construction method for evidence correlation analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112528036A true CN112528036A (en) | 2021-03-19 |
CN112528036B CN112528036B (en) | 2021-09-07 |
Family
ID=74996482
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011372006.3A Active CN112528036B (en) | 2020-11-30 | 2020-11-30 | Knowledge graph automatic construction method for evidence correlation analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112528036B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113407688A (en) * | 2021-06-15 | 2021-09-17 | 西安理工大学 | Method for establishing knowledge graph-based survey standard intelligent question-answering system |
CN113407678A (en) * | 2021-06-30 | 2021-09-17 | 竹间智能科技(上海)有限公司 | Knowledge graph construction method, device and equipment |
CN114969384A (en) * | 2022-08-02 | 2022-08-30 | 联通(四川)产业互联网有限公司 | High-value judicial evidence chain acquisition and storage method and device and readable storage medium |
CN115238688A (en) * | 2022-08-15 | 2022-10-25 | 广州市刑事科学技术研究所 | Electronic information data association relation analysis method, device, equipment and storage medium |
CN116307566A (en) * | 2023-03-12 | 2023-06-23 | 武汉大学 | Dynamic design system for large-scale building construction project construction organization scheme |
CN116431835A (en) * | 2023-06-06 | 2023-07-14 | 中汽数据(天津)有限公司 | Automatic knowledge graph construction method, equipment and medium in automobile authentication field |
CN116542252A (en) * | 2023-07-07 | 2023-08-04 | 北京营加品牌管理有限公司 | Financial text checking method and system |
CN116720786A (en) * | 2023-08-01 | 2023-09-08 | 中国科学院工程热物理研究所 | KG and PLM fusion assembly quality stability prediction method, system and medium |
CN116737967A (en) * | 2023-08-15 | 2023-09-12 | 中国标准化研究院 | Knowledge graph construction and perfecting system and method based on natural language |
CN117830060A (en) * | 2024-03-04 | 2024-04-05 | 天津财经大学 | Injury crime law enforcement supervision and auxiliary decision-making system based on knowledge graph |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108009299A (en) * | 2017-12-28 | 2018-05-08 | 北京市律典通科技有限公司 | Law tries method and device for business processing |
CN110457479A (en) * | 2019-08-12 | 2019-11-15 | 贵州大学 | A kind of judgement document's analysis method based on criminal offence chain |
CN110837563A (en) * | 2018-08-17 | 2020-02-25 | 阿里巴巴集团控股有限公司 | Case judgment method, device and system |
EP3620997A1 (en) * | 2018-09-04 | 2020-03-11 | Siemens Aktiengesellschaft | Transfer learning of machine-learning models using knowledge graph database |
CN111241837A (en) * | 2020-01-04 | 2020-06-05 | 大连理工大学 | Theft case legal document named entity identification method based on anti-migration learning |
CN111651557A (en) * | 2020-05-09 | 2020-09-11 | 清华大学深圳国际研究生院 | Automatic text generation method and device and computer readable storage medium |
-
2020
- 2020-11-30 CN CN202011372006.3A patent/CN112528036B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108009299A (en) * | 2017-12-28 | 2018-05-08 | 北京市律典通科技有限公司 | Law tries method and device for business processing |
CN110837563A (en) * | 2018-08-17 | 2020-02-25 | 阿里巴巴集团控股有限公司 | Case judgment method, device and system |
EP3620997A1 (en) * | 2018-09-04 | 2020-03-11 | Siemens Aktiengesellschaft | Transfer learning of machine-learning models using knowledge graph database |
CN110457479A (en) * | 2019-08-12 | 2019-11-15 | 贵州大学 | A kind of judgement document's analysis method based on criminal offence chain |
CN111241837A (en) * | 2020-01-04 | 2020-06-05 | 大连理工大学 | Theft case legal document named entity identification method based on anti-migration learning |
CN111651557A (en) * | 2020-05-09 | 2020-09-11 | 清华大学深圳国际研究生院 | Automatic text generation method and device and computer readable storage medium |
Non-Patent Citations (4)
Title |
---|
ERWIN FILTZ: "Building and Processing a Knowledge-Graph for Legal Data", 《EUROPEAN SEMANTIC WEB CONFERENCE》 * |
洪文兴等: "面向司法案件的案情知识图谱自动构建", 《中文信息学报》 * |
邹爱玲: "基于法律的知识图谱构建", 《万方数据》 * |
陈彦光等: "基于刑事案例的知识图谱构建技术", 《郑州大学学报(理学版)》 * |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113407688A (en) * | 2021-06-15 | 2021-09-17 | 西安理工大学 | Method for establishing knowledge graph-based survey standard intelligent question-answering system |
CN113407688B (en) * | 2021-06-15 | 2022-09-16 | 西安理工大学 | Method for establishing knowledge graph-based survey standard intelligent question-answering system |
CN113407678A (en) * | 2021-06-30 | 2021-09-17 | 竹间智能科技(上海)有限公司 | Knowledge graph construction method, device and equipment |
CN114969384A (en) * | 2022-08-02 | 2022-08-30 | 联通(四川)产业互联网有限公司 | High-value judicial evidence chain acquisition and storage method and device and readable storage medium |
CN114969384B (en) * | 2022-08-02 | 2022-10-21 | 联通(四川)产业互联网有限公司 | High-value judicial evidence chain acquisition and storage method and device and readable storage medium |
CN115238688B (en) * | 2022-08-15 | 2023-08-01 | 广州市刑事科学技术研究所 | Method, device, equipment and storage medium for analyzing association relation of electronic information data |
CN115238688A (en) * | 2022-08-15 | 2022-10-25 | 广州市刑事科学技术研究所 | Electronic information data association relation analysis method, device, equipment and storage medium |
CN116307566A (en) * | 2023-03-12 | 2023-06-23 | 武汉大学 | Dynamic design system for large-scale building construction project construction organization scheme |
CN116307566B (en) * | 2023-03-12 | 2024-05-10 | 武汉大学 | Dynamic design system for large-scale building construction project construction organization scheme |
CN116431835B (en) * | 2023-06-06 | 2023-09-15 | 中汽数据(天津)有限公司 | Automatic knowledge graph construction method, equipment and medium in automobile authentication field |
CN116431835A (en) * | 2023-06-06 | 2023-07-14 | 中汽数据(天津)有限公司 | Automatic knowledge graph construction method, equipment and medium in automobile authentication field |
CN116542252A (en) * | 2023-07-07 | 2023-08-04 | 北京营加品牌管理有限公司 | Financial text checking method and system |
CN116542252B (en) * | 2023-07-07 | 2023-09-29 | 北京营加品牌管理有限公司 | Financial text checking method and system |
CN116720786B (en) * | 2023-08-01 | 2023-10-03 | 中国科学院工程热物理研究所 | KG and PLM fusion assembly quality stability prediction method, system and medium |
CN116720786A (en) * | 2023-08-01 | 2023-09-08 | 中国科学院工程热物理研究所 | KG and PLM fusion assembly quality stability prediction method, system and medium |
CN116737967A (en) * | 2023-08-15 | 2023-09-12 | 中国标准化研究院 | Knowledge graph construction and perfecting system and method based on natural language |
CN116737967B (en) * | 2023-08-15 | 2023-11-21 | 中国标准化研究院 | Knowledge graph construction and perfecting system and method based on natural language |
CN117830060A (en) * | 2024-03-04 | 2024-04-05 | 天津财经大学 | Injury crime law enforcement supervision and auxiliary decision-making system based on knowledge graph |
CN117830060B (en) * | 2024-03-04 | 2024-05-28 | 天津财经大学 | Injury crime law enforcement supervision and auxiliary decision-making system based on knowledge graph |
Also Published As
Publication number | Publication date |
---|---|
CN112528036B (en) | 2021-09-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112528036B (en) | Knowledge graph automatic construction method for evidence correlation analysis | |
WO2021103492A1 (en) | Risk prediction method and system for business operations | |
CN111737495B (en) | Middle-high-end talent intelligent recommendation system and method based on domain self-classification | |
WO2021031383A1 (en) | Intelligent auxiliary judgment method and apparatus, and computer device and storage medium | |
CN110674840B (en) | Multi-party evidence association model construction method and evidence chain extraction method and device | |
CN112612902A (en) | Knowledge graph construction method and device for power grid main device | |
CN111967761B (en) | Knowledge graph-based monitoring and early warning method and device and electronic equipment | |
CN110675023B (en) | Litigation request rationality prediction model training method based on neural network, and litigation request rationality prediction method and device based on neural network | |
CN113779272B (en) | Knowledge graph-based data processing method, device, equipment and storage medium | |
CN103207855A (en) | Fine-grained sentiment analysis system and method specific to product comment information | |
CN106991161A (en) | A kind of method for automatically generating open-ended question answer | |
WO2020010834A1 (en) | Faq question and answer library generalization method, apparatus, and device | |
CN111899089A (en) | Enterprise risk early warning method and system based on knowledge graph | |
CN116992005B (en) | Intelligent dialogue method, system and equipment based on large model and local knowledge base | |
CN111026880B (en) | Joint learning-based judicial knowledge graph construction method | |
CN113239208A (en) | Mark training model based on knowledge graph | |
CN114331122A (en) | Key person risk level assessment method and related equipment | |
CN109241199A (en) | A method of it is found towards financial knowledge mapping | |
CN116257759A (en) | Structured data intelligent classification grading system of deep neural network model | |
Lai et al. | Large language models in law: A survey | |
Liu et al. | Research and citation analysis of data mining technology based on Bayes algorithm | |
CN116561264A (en) | Knowledge graph-based intelligent question-answering system construction method | |
CN112613611A (en) | Tax knowledge base system based on knowledge graph | |
Zhong et al. | Construction project risk prediction model based on EW-FAHP and one dimensional convolution neural network | |
CN117252255B (en) | Disaster emergency knowledge graph construction method oriented to auxiliary decision |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |