CN118228818B - Knowledge extraction method and system in injury crime inquiry stroke - Google Patents

Knowledge extraction method and system in injury crime inquiry stroke Download PDF

Info

Publication number
CN118228818B
CN118228818B CN202410642135.1A CN202410642135A CN118228818B CN 118228818 B CN118228818 B CN 118228818B CN 202410642135 A CN202410642135 A CN 202410642135A CN 118228818 B CN118228818 B CN 118228818B
Authority
CN
China
Prior art keywords
crime
sentence
sentences
knowledge
injury
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410642135.1A
Other languages
Chinese (zh)
Other versions
CN118228818A (en
Inventor
华斌
李宣毅
吴诺
孙博文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin university of finance and economics
Original Assignee
Tianjin university of finance and economics
Filing date
Publication date
Application filed by Tianjin university of finance and economics filed Critical Tianjin university of finance and economics
Priority to CN202410642135.1A priority Critical patent/CN118228818B/en
Publication of CN118228818A publication Critical patent/CN118228818A/en
Application granted granted Critical
Publication of CN118228818B publication Critical patent/CN118228818B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a knowledge extraction method and a system in an injury crime inquiry stroke list, relates to the technical field of natural language processing and knowledge engineering, and aims at solving the problem that a large amount of manpower, material resources and time are still required for processing nonstandard stroke list information in the processes of law enforcement supervision and case handling. Firstly, extracting answering contents corresponding to query questions related to a knowledge ontology from original transcript data, wherein the obtained answering is related to a case, so that the information quantity of subsequent case auditing is reduced; performing index analysis, sentence splitting, sentence denoising, sentence completion and triplet extraction to realize automatic triplet extraction of entity-entity relationship-entity; the method covers a complete stroke record processing flow, the step results are clear and definite, and by the processing, the triplet knowledge aiming at the law-enforcement is formed, so that the electronic stroke record with the language expression not very standard can be supported to be processed by a machine, and the efficiency and objectivity of law enforcement supervision are greatly improved.

Description

Knowledge extraction method and system in injury crime inquiry stroke
Technical Field
The invention relates to the technical fields of natural language processing and knowledge engineering, in particular to a knowledge extraction method and a system in an injury crime inquiry stroke.
Background
Traditional electronic law enforcement and case handling system based on flow cannot solve the problem of irregular writing in case handling process. Knowledge engineering technology application based on causal logic paradigms and natural language processing technology are basic technical approaches to solve such problems.
In the prior art, knowledge extraction is realized through a machine learning method, but the following defects exist:
(1) The traditional machine learning method mainly aims at standard text data, and can achieve a result with higher accuracy for the standard text data, but because the text content in the stroke is often nonstandard or has a related unhappy choice of words problem under the influence of the psychological state and the language expression level of a person to be examined in the scene of inquiring the stroke, the accuracy of the result obtained by the traditional machine learning method when knowledge is extracted from the stroke data is lower, and the extractable knowledge is less;
(2) Existing language models have black box properties and can be summarized as predicting text to be processed to obtain results by training a large number of texts, the results are limited by the format, quality and quantity of training data, and the obtained results have low interpretability. The training data available for interrogation of the stroke is less and conventional machine learning methods are not suitable for this scenario because of the serious public security decisions involved, which require rationalization, logic.
Therefore, knowledge extraction is performed on the inquiry records of the injury crimes, and continuous research is still required for realizing the treatment of nonstandard records.
Disclosure of Invention
In order to solve the problems that the existing language model and law enforcement and case handling platform cannot process nonstandard stroke information and still consume a large amount of manpower, material resources and time in the processes of law enforcement supervision and case handling, the invention provides a knowledge extraction method and a system in a damage crime inquiry stroke.
A knowledge extraction method in an injury crime inquiry stroke list specifically comprises the following steps:
step S0, acquiring the file information of the injury crime case to be detected according to the case setting information; the volume information includes: the name of the person involved, the characteristics of the person involved, the time of case occurrence and the place of case occurrence;
Step S1, acquiring an original query stroke of a crime case of an injury to be detected, inputting the original query stroke into a trained natural language processing model, and outputting related class questions; the related questions are questions comprising an injury crime ontology; the injury crime ontology comprises: crime subject, object of offender, offender result, crime act, crime tool;
Step S2, extracting answering contents corresponding to the related class questions; acquiring reference pronouns in the answering content, determining a person involved corresponding to each reference pronoun according to the name of the person involved in the answering and the characteristics of the person involved in the answering, and replacing each reference pronoun with the name of the person involved in the answering to acquire the replaced answering content;
s3, splitting the answer content into a plurality of sentences according to commas, semicolons, periods, exclamation marks and question marks in the replaced answer content;
step S4, carrying out data denoising on each sentence obtained by splitting in the step S3, and reserving sentences containing the harms crime knowledge body;
S5, supplementing sentences which lack subjects, predicates or objects in sentences containing the harmfulness criminal knowledge body, so that each sentence contains the subjects, predicates and objects, and obtaining the supplemented sentences;
Step S6, extracting subjects, predicates and objects of each of the supplemented sentences to obtain a triplet short sentence containing an injury crime knowledge body; the triplet phrase is a phrase comprising two entities and the entity relationship of the two entities;
Step S7, entity inspection is carried out on the extracted result, the triplet short sentence which does not contain the entity is deleted, and the triplet short sentence which contains the entity and the entity relation is reserved; the entity comprises name, apparatus, property and limb, and the entity relationship comprises: criminals, belonging to, holding.
Further, in the step S1, the natural language processing model is a trained Bert pre-training model, and the training steps are as follows:
Step S11, data acquisition: acquiring a history inquiry stroke list of the injury crime, and extracting the question and answer content in the history inquiry stroke list;
Step S12, data classification: marking the answering content containing the injury crime ontology as related answering, and taking the answering content not containing the ontology as unrelated answering;
And step S13, taking the history inquiry records as input, taking the questions corresponding to the related class answers as output, and training the Bert pre-training model to obtain a trained Bert pre-training model.
Further, in the step S12, the answering content including the injury criminal knowledge body is marked, specifically: if the answer content contains at least one conceptual vocabulary of the knowledge body, marking the answer content; the concept vocabulary of the knowledge body is a vocabulary set obtained by expanding each item of content of the knowledge body.
Further, the method for acquiring the concept vocabulary of the ontology comprises the following steps:
Step S121, constructing a knowledge body: determining the knowledge body and the value condition of the knowledge body; the ontology comprises crime subjects, crimes, crime tools, harmful objects and harmful results;
Step S122, the ontology concept vocabulary expansion: expanding concept vocabularies contained in the knowledge body, specifically comprising expanding crimes by constructing a crime feature word bag and a public place feature word bag, and expanding harmful objects by constructing a human body part knowledge concept tree;
The criminal behavior characteristic word bag is as follows: utilizing the identified criminal act words in the history case, and expanding the identified criminal act words in the synonym word cloud and the paraphrasing word cloud to obtain a criminal act characteristic word bag;
The public place characteristic word bag is as follows: public place words in the history case are utilized, and public place feature word bags are obtained by expanding synonym words and near meaning word clouds;
the human body part knowledge concept tree is as follows: dividing a human body according to definition of limb parts in forensic science to obtain standardized human body part concept words and hierarchical relations;
step S123, taking all the vocabularies contained in the extended ontology as ontology concept vocabularies.
Further, in the step S2, the reference expression includes: feature pronouns and human pronouns; determining a person involved corresponding to each reference pronoun according to the name of the person involved and the characteristics of the person involved, and replacing each reference pronoun by the name of the person involved, wherein the method specifically comprises the following steps of:
Step S21, processing the feature pronouns: according to the characteristics of the person involved, associating the description of the characteristics of the person in the answering content with the name of the person involved; replacing the characteristic pronouns with the associated personnel names related to the case;
step S22, processing the human pronouns: replacing you in the answering content with a question-asking person, and replacing me in the answering content with a person name corresponding to the question-asking person; he in the answering content is replaced by the name of the person involved in the answering content, except for you, which is closest to the person's call pronoun.
Further, in the step S4, data denoising is performed on each sentence obtained by splitting in the step S3, and sentences containing the harms crime ontology are reserved, specifically: if the sentences contain the concept vocabulary of the harming crime ontology, the sentences are reserved, and if the sentences do not contain the concept vocabulary of the harming crime ontology, the sentences are deleted.
Further, in the step S5, sentences lacking a subject, a predicate or an object are supplemented, so that each sentence contains a subject predicate object, specifically: performing dependency syntactic analysis on the sentence to obtain a component analysis result of the sentence, and supplementing subjects, predicates and objects of the sentence completely according to the component analysis result; the component analysis results include: main-term relationship, core relationship, move guest relationship, right additional relationship and centering relationship.
Further, in the step S6, the subject, the predicate and the object of each sentence are extracted to obtain a triplet short sentence containing the harmfulness criminal knowledge body, which specifically includes:
Step S61, dividing the sentence into a normal sentence, a special sentence, a complex sentence and a pathological sentence; the special sentences are word-in-word sentences and word-out sentences, the complex sentences are sentences comprising more than two knowledge bodies, and the pathological sentences are sentences which do not belong to normal sentences, special sentences and complex sentences;
Step S62, processing the normal sentence: performing dependency syntactic analysis on the normal sentence to obtain a component analysis result of the normal sentence, and determining subjects, predicates and objects of the normal sentence according to the component analysis result to obtain a triplet short sentence containing an injury crime knowledge body;
Step S63, processing the special sentence:
Extracting subjects and predicate objects of the words and sentences as follows: positioning the word, taking the part of speech before the word as noun and the name of person appearing in the ontology as subject, taking the verb conforming to criminal as predicate, and taking the part of speech after the word and the property or name of person in the ontology as object to obtain a triplet short sentence containing the harmfulness criminal ontology;
subject predicate object extraction is carried out on the words and sentences as follows: positioning the passive word, taking the part of speech after the passive word as noun and the property or name appearing in the knowledge body as subject, taking the verb conforming to criminal as predicate, and taking the part of speech before the passive word and the name in the knowledge body as object to obtain a triplet short sentence containing the harmfulness criminal knowledge body;
Step S64, processing the complex sentence: inputting the complex sentence into a trained natural language processing model to obtain a triplet short sentence containing an injury crime knowledge body;
step S65, processing the phrases: and taking the instruments, property and limbs in the phrases as objects, wherein the subjects of the instruments, property and limbs are the case-involved personnel closest to the instruments, property and limbs, and criminal behaviors among the case-involved personnel and the instruments, property and limbs belong to or are held as predicates to obtain the triplet phrases comprising the injury criminal knowledge body.
Further, in the step S64, the natural language processing model is a Bert-BiLSTM-CRF model, and the training steps are as follows:
step S641, obtaining a history inquiry stroke of the injury crime, and extracting answering contents corresponding to related class questions in the history inquiry stroke;
step S642, extracting complex sentences in the answering content and triple short sentences extracted according to the complex sentences;
step S643, constructing a Bert-BiLSTM-CRF model;
And step S644, taking the complex sentence as input, taking the triplet short sentence extracted according to the complex sentence as output, and training the Bert-BiLSTM-CRF model.
A knowledge extraction system in an injury crime interrogation profile, using a method of knowledge extraction in an injury crime interrogation profile as claimed in any one of the preceding claims, comprising the following modules:
The file establishing module is used for acquiring file information of the crime cases to be detected according to the case setting information; the volume information includes: the case-related personnel, the characteristics of the case-related personnel, the case-related time and the case-related place;
The related question-and-talk acquisition module is used for acquiring an original query list of a crime case of an injury to be detected, inputting the original query list into a trained natural language processing model, and outputting related questions; the related questions are questions comprising an injury crime ontology; the injury crime ontology comprises: crime subject, object of offender, offender result, crime act, crime tool;
The reference pronoun analysis module is used for extracting answering contents corresponding to the related class questions; acquiring reference pronouns in the answering content, determining a person involved corresponding to each reference pronoun according to the name of the person involved in the answering and the characteristics of the person involved in the answering, and replacing each reference pronoun with the name of the person involved in the answering to acquire the replaced answering content;
the sentence splitting module is used for splitting the answering content into a plurality of sentences according to commas, semicolons, periods, exclamation marks and question marks in the replaced answering content;
The sentence denoising module is used for denoising data of each sentence and reserving sentences containing the harms crime knowledge body;
the sentence complementation module is used for complementing sentences lacking subjects, predicates or objects, so that each sentence contains the subjects, predicates and objects;
The triple extraction module is used for extracting subjects, predicates and objects of each sentence to obtain a triple short sentence containing an injury crime knowledge body; the triplet phrase is a phrase comprising two entities and the entity relationship of the two entities;
The entity checking module is used for carrying out entity checking on the extracted result, deleting the triplet short sentence which does not contain the entity, and reserving the triplet short sentence which contains the entity and the entity relation; the entity comprises name, apparatus, property and limb, and the entity relationship comprises: criminals, belonging to, holding.
Compared with the prior art, the invention has the beneficial effects that:
Firstly, the invention extracts the answering content corresponding to the query questions related to the knowledge ontology in the original written data, and the obtained answering is the content related to the case, thereby reducing the information quantity of the subsequent case examination; performing index analysis, sentence splitting, sentence denoising, sentence completion and triplet extraction to realize automatic triplet extraction of entity-entity relationship-entity; the method covers a complete stroke record processing flow, the step results are clear and definite, and the interference of invalid information is removed through the processing, so that the finally extracted triples are more accurate and comprehensive, and the problem of automatic processing of nonstandard stroke records in the case handling process is solved;
secondly, the invention extracts the knowledge ontology from the injury crime inquiry stroke list, and then extracts the knowledge contained in the nonstandard text data according to different logic processing rules for the special sentences, the complex sentences and the pathological sentences, thereby improving the accuracy of knowledge extraction.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow diagram of a knowledge extraction method in an injury crime query entry;
FIG. 2 is a schematic diagram of a knowledge extraction system in a damage crime query.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the invention, are within the scope of the invention.
The following describes specific embodiments of the present application with reference to the drawings (tables).
Aiming at the problem that a large amount of manpower, material resources and time are still required for processing of nonstandard written information in the processes of law enforcement supervision and case handling, firstly, the method extracts the answering content corresponding to the interrogation problem related to the knowledge body in the original written data, and the obtained answering is the content related to the case, so that the information quantity of the subsequent case examination is reduced; performing index analysis, sentence splitting, sentence denoising, sentence completion and triplet extraction to realize automatic triplet extraction of entity-entity relationship-entity; the method covers a complete stroke record processing flow, the step results are clear and definite, and the interference of invalid information is removed through the processing, so that the finally extracted triples are more accurate and comprehensive, and the problem of automatic processing of nonstandard stroke records in the case handling process is solved.
Example 1
As shown in fig. 1, the invention provides a knowledge extraction method in an injury crime inquiry stroke, which specifically comprises the following steps:
step S0, acquiring the file information of the injury crime case to be detected according to the case setting information; the volume information includes: the name of the person involved, the characteristics of the person involved, the time of case occurrence and the place of case occurrence;
A comparison table of the name and the personnel characteristic, such as wearing characteristics, physical characteristics, occupational characteristics and the like, is established, and can directly provide basis for replacing the reference pronouns during subsequent reference pronouns analysis. Wherein the wearing feature comprises a clothing feature, a color feature, etc.; physical characteristics include gender, tattoo, age, body shape, etc.
Step S1, acquiring an original query stroke of a crime case of an injury to be detected, inputting the original query stroke into a trained natural language processing model, and outputting related class questions; the related questions are questions comprising an injury crime ontology;
Injury crime notes often contain specific legal terms, complex descriptions, personal statements, etc., which may not be common in training data sets of plain text recognition models. If the quality of the training data is not high, such as blurred images, much noise, low resolution, etc., or the diversity of the data set is insufficient, it is difficult for the model to learn enough features to accurately identify these special contents. Therefore, the acquired damage crime records need to be screened in a targeted manner, and useful information is reserved, so that the data processing amount is reduced, and the processing efficiency is improved.
In the prior art, various data preprocessing methods exist, and data extraction with a specific function can be realized. But there is no effective treatment for injury crime notes. In addition, because of the special acquisition and application scene of the damage crime record, the conventional data preprocessing method cannot accurately extract the damage crime record information.
According to the invention, firstly, the questions comprising the injury crime ontology are extracted, and then the answers corresponding to the questions are used as the contents comprising the injury crime ontology, so that the problem that the content obtained by identifying the related words is not summarized comprehensively is avoided.
The natural language processing model is a trained Bert pre-training model, bert (Bidirectional Encoder Representations from Transformers) is a deep learning-based pre-training language model, a large amount of language knowledge is learned through pre-training, and then fine adjustment is performed in the text classification task.
Firstly, extracting an inquiry question and an answer from an original inquiry stroke list, dividing the answer according to whether the answer content relates to the knowledge ontology content, and dividing the answer into: "related classes" and "unrelated classes". And adopting the Bert pre-training model to perform machine learning training on the interrogation problem. And removing the irrelevant type problems and the answering contents thereof by using the trained model to achieve the effect of denoising the strokes, and finally reserving the relevant type problems for subsequent processing.
By judging whether the answer content is related to the ontology, the question marks can be obtained, namely the question marks of the related classes are obtained, so that the Bert pre-training model is used for screening the question marks of the related classes.
The training steps of the Bert pre-training model are as follows:
Step S11, data acquisition: acquiring a history inquiry stroke list of the injury crime, and extracting the question and answer content in the history inquiry stroke list;
Step S12, data classification: marking the answering content containing the injury crime ontology as related answering, and taking the answering content not containing the ontology as unrelated answering;
In the step S12, the answer content including the injury criminal knowledge body is marked, specifically: and if the answer content contains at least one conceptual vocabulary of the knowledge body, marking the answer content. The method comprises the following steps:
Step S121, constructing a knowledge body: determining the knowledge body and the value condition of the knowledge body; the ontology comprises crime subjects, crimes, crime tools, harmful objects and harmful results;
Each ontology has a specific value, for example, the criminal main body comprises a suspect and a criminal organization; the crime tool comprises limbs, instruments and random objects; the criminal acts include violent infliction, language infliction, crime participation and criminal organization; the pest applying objects comprise a victim, a pest location and property; the consequences of such damage include personal injury, death, and property damage.
Step S122, the ontology concept vocabulary expansion: expanding concept vocabularies contained in the knowledge body, specifically comprising expanding crimes by constructing a crime feature word bag and a public place feature word bag, and expanding harmful objects by constructing a human body part knowledge concept tree;
The criminal behavior characteristic word bag is as follows: utilizing the identified criminal act words in the history case, and expanding the identified criminal act words in the synonym word cloud and the paraphrasing word cloud to obtain a criminal act characteristic word bag; the injury crimes can be decomposed into limb crimes, language harmful actions, holding crimes and the like, and are derived from a large number of crime words in past cases;
The public place characteristic word bag is as follows: public place words in the history case are utilized, and public place feature word bags are obtained by expanding synonym words and near meaning word clouds;
the human body part knowledge concept tree is as follows: dividing a human body according to definition of limb parts in forensic science to obtain standardized human body part concept words and hierarchical relations;
In addition, an instrumentation concept tree can be constructed: expanding according to expert knowledge and past written data, and covering guns, bullets, control cutters, sharps and blunt instruments. The purpose of constructing the appliance word bag is to support the machine to position appliances appearing in the stroke data in the stroke, so as to fill the attribution relation and the maintenance relation;
Constructing a random word bag: expanding according to expert knowledge and past written data to cover random objects. The purpose of constructing the random word bag is to support the machine to locate random matters in the stroke data in the stroke, so as to fill the attribution relation, the holding relation and the holding relation.
Step S123, taking all the vocabularies contained in the extended ontology as ontology concept vocabularies.
And step S13, taking the history inquiry records as input, taking the questions corresponding to the related class answers as output, and training the Bert pre-training model to obtain a trained Bert pre-training model.
Step S2, extracting answering contents corresponding to the related class questions; acquiring reference pronouns in the answering content, determining a person involved corresponding to each reference pronoun according to the name of the person involved in the answering and the characteristics of the person involved in the answering, and replacing each reference pronoun with the name of the person involved in the answering to acquire the replaced answering content;
The reference pronouns include: feature pronouns and human pronouns; determining a person involved corresponding to each reference pronoun according to the name of the person involved and the characteristics of the person involved, and replacing each reference pronoun by the name of the person involved, wherein the method specifically comprises the following steps of:
Step S21, processing the feature pronouns: according to the characteristics of the person involved, associating the description of the characteristics of the person in the answering content with the name of the person involved; replacing the characteristic pronouns with the associated personnel names related to the case;
step S22, processing the human pronouns: replacing you in the answering content with a question-asking person, and replacing me in the answering content with a person name corresponding to the question-asking person; he in the answering content is replaced by the name of the person involved in the answering content, except for you, which is closest to the person's call pronoun.
For them in the content of the answer, the interrogator will usually then determine what they refer to by asking "who they include" and the like, and thus can be obtained in conjunction with the context.
In view of the different expression levels of the answer content, this step may incorporate a manual judgment to confirm whether the results of the reference analysis are correct.
S3, splitting the answer content into a plurality of sentences according to commas, semicolons, periods, exclamation marks and question marks in the replaced answer content;
step S4, carrying out data denoising on each sentence obtained by splitting in the step S3, and reserving sentences containing the harms crime knowledge body;
In the step S4, data denoising is performed on each sentence obtained by splitting in the step S3, and sentences containing the harms crime ontology are reserved, specifically: if the sentences contain the concept vocabulary of the harming crime ontology, the sentences are reserved, and if the sentences do not contain the concept vocabulary of the harming crime ontology, the sentences are deleted.
Through the step, sentences containing the injury crime knowledge body and sentences related to the injury crime knowledge body are finally obtained, so that the information quantity to be processed in case auditing is greatly reduced, and the case analysis efficiency and accuracy are improved.
S5, supplementing sentences which lack subjects, predicates or objects in sentences containing the harmfulness criminal knowledge body, so that each sentence contains the subjects, predicates and objects, and obtaining the supplemented sentences;
In the step S5, sentences lacking a subject, a predicate or an object are supplemented so that each sentence contains the subject predicate object, specifically: performing dependency syntactic analysis on the sentence to obtain a component analysis result of the sentence, and supplementing subjects, predicates and objects of the sentence completely according to the component analysis result; the component analysis results include: main-term relationship, core relationship, move guest relationship, right additional relationship and centering relationship.
Dependency syntactic analysis (DEPENDENCY PARSING) is a key technique in natural language processing to reveal syntactic structures by analyzing dependencies between components within a language unit.
The sentence complementation step comprises:
step S51, dividing the long sentence by taking commas, semicolons, periods, exclamation marks and question marks in the Chinese punctuation marks as demarcation points (without quotation marks and internal punctuation marks) and called short sentences;
step S52, sequentially performing dependency syntax analysis on each short sentence, and taking a demarcation point behind the sentence as a splitting point if a word with the dependency relationship of 'core word' and the part of speech of 'v' exists in the sentence;
Step S53, if there is no word whose dependency relationship is 'main-predicate relationship' and whose part of speech is noun in the phrase, firstly placing the main phrase in the previous phrase containing main phrase at the beginning of the phrase;
Step S54, if there is no word whose dependency relationship is 'guests relationship' and whose part of speech is noun or word whose dependency relationship is 'moving guests relationship' and whose part of speech is noun in the short sentence, placing the object in the last short sentence after the predicate of the short sentence;
Step S55, splitting sentences according to splitting points after the long sentences are completely complemented;
And step S56, performing dependency syntax analysis on each split sentence, and if a word which is in parallel relation with a core verb exists in the sentence and is a word in a criminal word bag, copying the sentence according to the number of the core verb and the parallel relation thereof, and enabling each sentence to have only one predicate, wherein the subject and the object corresponding to the predicate in the original sentence are respectively in the new sentence.
Step S6, extracting subjects, predicates and objects of each of the supplemented sentences to obtain a triplet short sentence containing an injury crime knowledge body; the triplet phrase is a phrase comprising two entities and the entity relationship of the two entities;
in the step S6, the subject, the predicate and the object of each sentence are extracted to obtain a triplet short sentence containing the harmfulness criminal knowledge ontology, which specifically includes:
Step S61, dividing the sentence into a normal sentence, a special sentence, a complex sentence and a pathological sentence; the special sentences are word-in-word sentences and word-out sentences, the complex sentences are sentences comprising more than two knowledge bodies, and the pathological sentences are sentences which do not belong to normal sentences, special sentences and complex sentences;
Step S62, processing the normal sentence: performing dependency syntactic analysis on the normal sentence to obtain a component analysis result of the normal sentence, and determining subjects, predicates and objects of the normal sentence according to the component analysis result to obtain a triplet short sentence containing an injury crime knowledge body;
Step S63, processing the special sentence:
Extracting subjects and predicate objects of the words and sentences as follows: positioning the word, taking the part of speech before the word as noun and the name of person appearing in the ontology as subject, taking the verb conforming to criminal as predicate, and taking the part of speech after the word and the property or name of person in the ontology as object to obtain a triplet short sentence containing the harmfulness criminal ontology;
subject predicate object extraction is carried out on the words and sentences as follows: positioning the passive word, taking the part of speech after the passive word as noun and the property or name appearing in the knowledge body as subject, taking the verb conforming to criminal as predicate, and taking the part of speech before the passive word and the name in the knowledge body as object to obtain a triplet short sentence containing the harmfulness criminal knowledge body;
Step S64, processing the complex sentence: inputting the complex sentence into a trained natural language processing model to obtain a triplet short sentence containing an injury crime knowledge body;
In the step S64, the natural language processing model is a Bert-BiLSTM-CRF model, and the training steps are as follows:
step S641, obtaining a history inquiry stroke of the injury crime, and extracting answering contents corresponding to related class questions in the history inquiry stroke;
step S642, extracting complex sentences in the answering content and triple short sentences extracted according to the complex sentences;
step S643, constructing a Bert-BiLSTM-CRF model;
And step S644, taking the complex sentence as input, taking the triplet short sentence extracted according to the complex sentence as output, and training the Bert-BiLSTM-CRF model.
The Bert-BiLSTM-CRF model is a sequence labeling model for natural language processing tasks, which combines three components, bert, biLSTM (two-way long and short term memory network) and CRF (conditional random field).
In the Bert-BiLSTM-CRF model, BERT is taken as a feature extractor to acquire a context representation of the text; then BiLSTM models the context representation, capturing the information in the sequence; finally, CRF is used to jointly model and optimize the tags in the sequence. In this way, the relation extraction task for complex sentences is finally completed.
Step S65, processing the phrases: and taking the instruments, property and limbs in the phrases as objects, wherein the subjects of the instruments, property and limbs are the case-involved personnel closest to the instruments, property and limbs, and criminal behaviors among the case-involved personnel and the instruments, property and limbs belong to or are held as predicates to obtain the triplet phrases comprising the injury criminal knowledge body.
When analyzing the phrases, the vocabulary in the limb concept tree can be positioned through regular matching. With crime words as a boundary, if a limb word is located before the crime word and before the crime word has a name conforming to the case word bag, the limb belongs to the name to form a triplet (limb belongs to name). If the limb is positioned behind the criminal word and then the name of the person conforming to the case word bag exists, the limb is attributed to the name of the person. If no person is known, the limb is judged to be attributed to the victim or the donor according to the injury identification report.
In one embodiment, crime words are located, and before and after the crime words, the names, properties, limbs and instruments which are not repeated and accord with the case word bags are searched. Before criminal words, if names and limbs conforming to case word bags exist, forming a triplet (limb-before, belonging to person-before); if there are instruments or random objects that fit the instrument concept tree and the random object word bag, a (person-before, hold, thing-before) triplet is constructed. After the criminal words, if the names and limbs of the case word bags are met, forming (limb-after, belonging to person-after) triples; if there are conforming instruments or randomizers, then a (person-after, hold, thing-after) triplet is constructed. Finally, a (person-before, criminal words, person-after) triplet is formed.
Step S7, entity inspection is carried out on the extracted result, the triplet short sentence which does not contain the entity is deleted, and the triplet short sentence which contains the entity and the entity relation is reserved; the entity comprises name, apparatus, property and limb, and the entity relationship comprises: criminals, belonging to, holding.
In the normal triplet extraction result, the subjects, predicates and objects are matched with criminal behavior feature word bags, public place feature word bags and human body part knowledge concept trees. If the matching is successful, reserving the triplet; if the subject is not in the bag of words, checking whether the bag of words of the criminal behavior characteristic exists before the words of the criminal behavior, if so, replacing the subject; if not, the offender is replaced with the subject. For object processing, if the object is not in the word bag, checking whether the criminal words have word bag related contents or not, if so, replacing the object; if not, the victim is replaced with the object.
The invention forms independent and complete software components, which can be used as artificial intelligence core components in law enforcement and case handling platform commonly used by public security authorities in various provinces. The invention has universality for legal administrative supervision and assistance of injury crimes, and has reference function for supervision and assistance intellectualization of other types of cases.
In the processes of law enforcement supervision and case handling, how to set up entities and entity relationships so that the acquisition of triplet phrases related to injury crimes from the acquired triplet phrases is an important problem to be solved. After the triple short sentence is obtained, screening the triple short sentence, deleting the triple short sentence which does not contain the entity, and reserving the triple short sentence which contains the entity and the entity relation; in this step, how to screen and which useful information is screened becomes a problem to be solved.
Firstly, the efficiency of information processing can be improved through screening, and again, the basis of screening is also the setting that takes into account the characteristics of injury crime written information, and the entity includes name, apparatus, property and limbs, and the entity relation includes: criminal behavior, belonging to and holding is another key technical feature in the invention. This arrangement makes the extraction of injury crime writing information more accurate and efficient.
Example 2
As shown in fig. 2, the present invention further proposes a knowledge extraction system in an injury crime query, using a knowledge extraction method in an injury crime query as described in any one of embodiment 1, comprising the following modules:
The file establishing module is used for acquiring file information of the crime cases to be detected according to the case setting information; the volume information includes: the case-related personnel, the characteristics of the case-related personnel, the case-related time and the case-related place;
The related question-and-talk acquisition module is used for acquiring an original query list of a crime case of an injury to be detected, inputting the original query list into a trained natural language processing model, and outputting related questions; the related questions are questions comprising an injury crime ontology; the injury crime ontology comprises: crime subject, object of offender, offender result, crime act, crime tool;
The reference pronoun analysis module is used for extracting answering contents corresponding to the related class questions; acquiring reference pronouns in the answering content, determining a person involved corresponding to each reference pronoun according to the name of the person involved in the answering and the characteristics of the person involved in the answering, and replacing each reference pronoun with the name of the person involved in the answering to acquire the replaced answering content;
the sentence splitting module is used for splitting the answering content into a plurality of sentences according to commas, semicolons, periods, exclamation marks and question marks in the replaced answering content;
The sentence denoising module is used for denoising data of each sentence and reserving sentences containing the harms crime knowledge body;
the sentence complementation module is used for complementing sentences lacking subjects, predicates or objects, so that each sentence contains the subjects, predicates and objects;
The triple extraction module is used for extracting subjects, predicates and objects of each sentence to obtain a triple short sentence containing an injury crime knowledge body; the triplet phrase is a phrase comprising two entities and the entity relationship of the two entities;
The entity checking module is used for carrying out entity checking on the extracted result, deleting the triplet short sentence which does not contain the entity, and reserving the triplet short sentence which contains the entity and the entity relation; the entity comprises name, apparatus, property and limb, and the entity relationship comprises: criminals, belonging to, holding.
Example 3
An electronic device, the electronic device comprising:
A processor and a memory;
the processor is configured to perform the steps of the method for knowledge extraction in a injury crime query transcript as described in any of embodiment 1 by invoking a program or instructions stored in the memory.
Example 4
A computer readable storage medium comprising computer program instructions for causing a computer to perform the steps of a method of knowledge extraction in a damage-type crime query transcript as claimed in any one of embodiment 1.
The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of the present application. As used in this specification, the terms "a," "an," "the," and/or "the" are not intended to be limiting, but rather are to be construed as covering the singular and the plural, unless the context clearly dictates otherwise. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method or apparatus that includes the element.
It should also be noted that the positional or positional relationship indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the positional or positional relationship shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or element in question must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Unless specifically stated or limited otherwise, the terms "mounted," "connected," and the like are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the essence of the corresponding technical solutions from the technical solutions of the embodiments of the present invention.

Claims (9)

1. A method for extracting knowledge in a query stroke of a injury crime, comprising the steps of:
step S0, acquiring the file information of the injury crime case to be detected according to the case setting information; the volume information includes: the name of the person involved, the characteristics of the person involved, the time of case occurrence and the place of case occurrence;
step S1, acquiring an original query stroke of a crime case of an injury to be detected, inputting the original query stroke into a trained natural language processing model, and outputting related class questions; the related questions are questions comprising an injury crime ontology; the injury crime ontology comprises: crime subject, object of offender, offender result, crime act, crime tool; the natural language processing model is a trained Bert pre-training model, and the training steps are as follows:
Step S11, data acquisition: acquiring a history inquiry stroke list of the injury crime, and extracting the question and answer content in the history inquiry stroke list;
Step S12, data classification: marking the answering content containing the injury crime ontology as related answering, and taking the answering content not containing the ontology as unrelated answering;
step S13, taking the history inquiry records as input, taking the questions corresponding to the related class answers as output, and training the Bert pre-training model to obtain a trained Bert pre-training model;
Step S2, extracting answering contents corresponding to the related class questions; acquiring reference pronouns in the answering content, determining a person involved corresponding to each reference pronoun according to the name of the person involved in the answering and the characteristics of the person involved in the answering, and replacing each reference pronoun with the name of the person involved in the answering to acquire the replaced answering content;
s3, splitting the answer content into a plurality of sentences according to commas, semicolons, periods, exclamation marks and question marks in the replaced answer content;
step S4, carrying out data denoising on each sentence obtained by splitting in the step S3, and reserving sentences containing the harms crime knowledge body;
S5, supplementing sentences which lack subjects, predicates or objects in sentences containing the harmfulness criminal knowledge body, so that each sentence contains the subjects, predicates and objects, and obtaining the supplemented sentences;
Step S6, extracting subjects, predicates and objects of each of the supplemented sentences to obtain a triplet short sentence containing an injury crime knowledge body; the triplet phrase is a phrase comprising two entities and the entity relationship of the two entities;
Step S7, entity inspection is carried out on the extracted result, the triplet short sentence which does not contain the entity is deleted, and the triplet short sentence which contains the entity and the entity relation is reserved; the entity comprises name, apparatus, property and limb, and the entity relationship comprises: criminals, belonging to, holding.
2. The method for extracting knowledge in a query stroke of an injury crime according to claim 1, wherein in the step S12, the answer content including the injury crime ontology is marked, specifically: if the answer content contains at least one conceptual vocabulary of the knowledge body, marking the answer content; the concept vocabulary of the knowledge body is a vocabulary set obtained by expanding each item of content of the knowledge body.
3. The method for extracting knowledge in a query stroke of a crime in the injury category of claim 2, wherein the method for acquiring concept vocabulary of the ontology is as follows:
Step S121, constructing a knowledge body: determining the knowledge body and the value condition of the knowledge body; the ontology comprises crime subjects, crimes, crime tools, harmful objects and harmful results;
Step S122, the ontology concept vocabulary expansion: expanding concept vocabularies contained in the knowledge body, specifically comprising expanding crimes by constructing a crime feature word bag and a public place feature word bag, and expanding harmful objects by constructing a human body part knowledge concept tree;
The criminal behavior characteristic word bag is as follows: utilizing the identified criminal act words in the history case, and expanding the identified criminal act words in the synonym word cloud and the paraphrasing word cloud to obtain a criminal act characteristic word bag;
The public place characteristic word bag is as follows: public place words in the history case are utilized, and public place feature word bags are obtained by expanding synonym words and near meaning word clouds;
the human body part knowledge concept tree is as follows: dividing a human body according to definition of limb parts in forensic science to obtain standardized human body part concept words and hierarchical relations;
step S123, taking all the vocabularies contained in the extended ontology as ontology concept vocabularies.
4. The method for extracting knowledge in a query of a crime in a category of injury as claimed in claim 1, wherein in the step S2, the reference pronouns include: feature pronouns and human pronouns; determining a person involved corresponding to each reference pronoun according to the name of the person involved and the characteristics of the person involved, and replacing each reference pronoun by the name of the person involved, wherein the method specifically comprises the following steps of:
Step S21, processing the feature pronouns: according to the characteristics of the person involved, associating the description of the characteristics of the person in the answering content with the name of the person involved; replacing the characteristic pronouns with the associated personnel names related to the case;
step S22, processing the human pronouns: replacing you in the answering content with a question-asking person, and replacing me in the answering content with a person name corresponding to the question-asking person; he in the answering content is replaced by the name of the person involved in the answering content, except for you, which is closest to the person's call pronoun.
5. The method for extracting knowledge in a query stroke of a harmless crime according to claim 1, wherein in the step S4, data denoising is performed on each sentence obtained by splitting in the step S3, and sentences containing the ontology of the harmless crime are reserved, specifically: if the sentences contain the concept vocabulary of the harming crime ontology, the sentences are reserved, and if the sentences do not contain the concept vocabulary of the harming crime ontology, the sentences are deleted.
6. The method for extracting knowledge in a query stroke list of a wounding crime as claimed in claim 1, wherein in step S5, sentences lacking subject, predicate or object are supplemented so that each sentence contains subject predicate object, specifically: performing dependency syntactic analysis on the sentence to obtain a component analysis result of the sentence, and supplementing subjects, predicates and objects of the sentence completely according to the component analysis result; the component analysis results include: main-term relationship, core relationship, move guest relationship, right additional relationship and centering relationship.
7. The method for extracting knowledge in a query stroke of a wounding crime as claimed in claim 1, wherein in step S6, the subject, the predicate and the object of each sentence are extracted to obtain a triplet phrase including a wounding crime ontology, specifically:
Step S61, dividing the sentence into a normal sentence, a special sentence, a complex sentence and a pathological sentence; the special sentences are word-in-word sentences and word-out sentences, the complex sentences are sentences comprising more than two knowledge bodies, and the pathological sentences are sentences which do not belong to normal sentences, special sentences and complex sentences;
Step S62, processing the normal sentence: performing dependency syntactic analysis on the normal sentence to obtain a component analysis result of the normal sentence, and determining subjects, predicates and objects of the normal sentence according to the component analysis result to obtain a triplet short sentence containing an injury crime knowledge body;
Step S63, processing the special sentence:
Extracting subjects and predicate objects of the words and sentences as follows: positioning the word, taking the part of speech before the word as noun and the name of person appearing in the ontology as subject, taking the verb conforming to criminal as predicate, and taking the part of speech after the word and the property or name of person in the ontology as object to obtain a triplet short sentence containing the harmfulness criminal ontology;
subject predicate object extraction is carried out on the words and sentences as follows: positioning the passive word, taking the part of speech after the passive word as noun and the property or name appearing in the knowledge body as subject, taking the verb conforming to criminal as predicate, and taking the part of speech before the passive word and the name in the knowledge body as object to obtain a triplet short sentence containing the harmfulness criminal knowledge body;
Step S64, processing the complex sentence: inputting the complex sentence into a trained natural language processing model to obtain a triplet short sentence containing an injury crime knowledge body;
step S65, processing the phrases: and taking the instruments, property and limbs in the phrases as objects, wherein the subjects of the instruments, property and limbs are the case-involved personnel closest to the instruments, property and limbs, and criminal behaviors among the case-involved personnel and the instruments, property and limbs belong to or are held as predicates to obtain the triplet phrases comprising the injury criminal knowledge body.
8. The method for extracting knowledge in a query stroke of a crime in the category of injury as claimed in claim 7, wherein in the step S64, the natural language processing model is a Bert-BiLSTM-CRF model, and the training step is as follows:
step S641, obtaining a history inquiry stroke of the injury crime, and extracting answering contents corresponding to related class questions in the history inquiry stroke;
step S642, extracting complex sentences in the answering content and triple short sentences extracted according to the complex sentences;
step S643, constructing a Bert-BiLSTM-CRF model;
And step S644, taking the complex sentence as input, taking the triplet short sentence extracted according to the complex sentence as output, and training the Bert-BiLSTM-CRF model.
9. A system for extracting knowledge in an injury crime inquiry stroke list, which uses the method for extracting knowledge in an injury crime inquiry stroke list according to any one of claims 1 to 8, comprising the following modules:
The file establishing module is used for acquiring file information of the crime cases to be detected according to the case setting information; the volume information includes: the case-related personnel, the characteristics of the case-related personnel, the case-related time and the case-related place;
The related question-and-talk acquisition module is used for acquiring an original query list of a crime case of an injury to be detected, inputting the original query list into a trained natural language processing model, and outputting related questions; the related questions are questions comprising an injury crime ontology; the injury crime ontology comprises: crime subject, object of offender, offender result, crime act, crime tool;
The reference pronoun analysis module is used for extracting answering contents corresponding to the related class questions; acquiring reference pronouns in the answering content, determining a person involved corresponding to each reference pronoun according to the name of the person involved in the answering and the characteristics of the person involved in the answering, and replacing each reference pronoun with the name of the person involved in the answering to acquire the replaced answering content;
the sentence splitting module is used for splitting the answering content into a plurality of sentences according to commas, semicolons, periods, exclamation marks and question marks in the replaced answering content;
The sentence denoising module is used for denoising data of each sentence and reserving sentences containing the harms crime knowledge body;
the sentence complementation module is used for complementing sentences lacking subjects, predicates or objects, so that each sentence contains the subjects, predicates and objects;
The triple extraction module is used for extracting subjects, predicates and objects of each sentence to obtain a triple short sentence containing an injury crime knowledge body; the triplet phrase is a phrase comprising two entities and the entity relationship of the two entities;
The entity checking module is used for carrying out entity checking on the extracted result, deleting the triplet short sentence which does not contain the entity, and reserving the triplet short sentence which contains the entity and the entity relation; the entity comprises name, apparatus, property and limb, and the entity relationship comprises: criminals, belonging to, holding.
CN202410642135.1A 2024-05-23 Knowledge extraction method and system in injury crime inquiry stroke Active CN118228818B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410642135.1A CN118228818B (en) 2024-05-23 Knowledge extraction method and system in injury crime inquiry stroke

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410642135.1A CN118228818B (en) 2024-05-23 Knowledge extraction method and system in injury crime inquiry stroke

Publications (2)

Publication Number Publication Date
CN118228818A CN118228818A (en) 2024-06-21
CN118228818B true CN118228818B (en) 2024-07-16

Family

ID=

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444353A (en) * 2020-04-03 2020-07-24 杭州叙简科技股份有限公司 Construction and use method of warning situation knowledge graph
CN111723564A (en) * 2020-05-27 2020-09-29 西安交通大学 Event extraction and processing method for case-following electronic file

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444353A (en) * 2020-04-03 2020-07-24 杭州叙简科技股份有限公司 Construction and use method of warning situation knowledge graph
CN111723564A (en) * 2020-05-27 2020-09-29 西安交通大学 Event extraction and processing method for case-following electronic file

Similar Documents

Publication Publication Date Title
DE60123952T2 (en) GENERATION OF A UNIFORM TASK DEPENDENT LANGUAGE MODEL THROUGH INFORMATION DISCUSSION PROCESS
CN108376151A (en) Question classification method, device, computer equipment and storage medium
US10503830B2 (en) Natural language processing with adaptable rules based on user inputs
CN107943911A (en) Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing
RU2704531C1 (en) Method and apparatus for analyzing semantic information
CN109947952B (en) Retrieval method, device, equipment and storage medium based on English knowledge graph
CN110337645A (en) The processing component that can be adapted to
CN109635288A (en) A kind of resume abstracting method based on deep neural network
CN112380848B (en) Text generation method, device, equipment and storage medium
CN111723870B (en) Artificial intelligence-based data set acquisition method, apparatus, device and medium
Sheshikala et al. Natural language processing and machine learning classifier used for detecting the author of the sentence
CN113688630A (en) Text content auditing method and device, computer equipment and storage medium
CN114580418B (en) Police physical training knowledge graph system
CN109840255A (en) Reply document creation method, device, equipment and storage medium
CN111866004A (en) Security assessment method, apparatus, computer system, and medium
CN116611449A (en) Abnormality log analysis method, device, equipment and medium
CN114186041A (en) Answer output method
CN114118398A (en) Method and system for detecting target type website, electronic equipment and storage medium
CN118228818B (en) Knowledge extraction method and system in injury crime inquiry stroke
CN112562736A (en) Voice data set quality evaluation method and device
CN111859934A (en) Chinese sentence metaphor recognition system
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN116976321A (en) Text processing method, apparatus, computer device, storage medium, and program product
CN118228818A (en) Knowledge extraction method and system in injury crime inquiry stroke
CN114582449A (en) Electronic medical record named entity standardization method and system based on XLNet-BiGRU-CRF model

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant