CN111222305B - Information structuring method and device - Google Patents

Information structuring method and device Download PDF

Info

Publication number
CN111222305B
CN111222305B CN201911301079.0A CN201911301079A CN111222305B CN 111222305 B CN111222305 B CN 111222305B CN 201911301079 A CN201911301079 A CN 201911301079A CN 111222305 B CN111222305 B CN 111222305B
Authority
CN
China
Prior art keywords
event
word
vector
entity
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911301079.0A
Other languages
Chinese (zh)
Other versions
CN111222305A (en
Inventor
魏海巍
王伟伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gongdao Network Technology Co ltd
Original Assignee
Gongdao Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gongdao Network Technology Co ltd filed Critical Gongdao Network Technology Co ltd
Priority to CN201911301079.0A priority Critical patent/CN111222305B/en
Publication of CN111222305A publication Critical patent/CN111222305A/en
Application granted granted Critical
Publication of CN111222305B publication Critical patent/CN111222305B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The present disclosure provides an information structuring method, which aims at obtaining an event statement, and performs entity extraction and event extraction on the event statement; determining an argument role characterized by each entity in the event statement according to the extracted trigger words and at least one entity; and then matching in a pre-constructed event map according to the event trigger word and the argument role of the event statement, and acquiring the structural information corresponding to the event statement according to the matched event unit. The method and the device convert unstructured information into structured information by combining event extraction and event templates, do not need manual data review, and are high in efficiency.

Description

Information structuring method and device
Technical Field
The disclosure relates to the technical field of information processing, and in particular relates to an information structuring method, device, equipment and system.
Background
In the current litigation process, unstructured data occupies a large proportion, and the unstructured data comprises: document photo, prosecution, debate, etc. When the case knowledge base is established, the unstructured data information is needed, and the judicial data is extracted and classified manually, so that the case knowledge base is established and perfected.
However, the information extraction and classification by manual work is large in workload, the work efficiency is limited by the number and effort of workers, and information extraction deviation may occur by manual review of judicial information, so that a case knowledge base is wrong.
Disclosure of Invention
Aiming at the technical problems, the embodiment of the disclosure provides an information structuring method, which comprises the following steps:
according to a first aspect of embodiments of the present disclosure, there is provided an information structuring method, the method comprising:
acquiring an event statement for describing a single event;
performing entity extraction and event extraction on the event sentences to obtain event trigger words and a plurality of entity words of the event sentences, and determining argument roles corresponding to N entity words with dependency relations with the event trigger words in the plurality of entity words, wherein N is greater than 0;
acquiring an event unit matched with the event sentence from a pre-constructed event map; the event unit matched with the event statement comprises the event trigger word and a plurality of argument character items to be filled corresponding to the event trigger word;
and filling the N entity words into N argument role items to be filled in an event unit matched with the event statement to obtain a structured statement corresponding to the event statement.
According to a second aspect of embodiments of the present disclosure, there is provided an information structuring method, the method comprising:
obtaining a text to be processed;
analyzing the data of the text to be processed to extract at least one event sentence included in the text;
according to the information structuring method of the first aspect, event units matched with each event sentence are obtained from a pre-constructed event map;
and combining the event sentences according to the matched event relationship among the event units to obtain the structured event information corresponding to the text to be processed.
According to a third aspect of embodiments of the present disclosure, there is provided an information structuring apparatus, the apparatus comprising:
statement acquisition module: configured to obtain an event statement describing a single event;
an event extraction module: the method comprises the steps of carrying out entity extraction and event extraction on an event sentence to obtain an event trigger word and a plurality of entity words of the event sentence, and determining argument roles corresponding to N entity words with a dependency relationship with the event trigger word in the plurality of entity words, wherein N is greater than 0;
the first event matching module: an event unit configured to obtain an event unit matched with the event sentence from a pre-constructed event map; the event unit matched with the event statement comprises the event trigger word and a plurality of argument character items to be filled corresponding to the event trigger word;
And an information filling module: and filling the N entity words into N argument role items to be filled in an event unit matched with the event statement to obtain a structured statement corresponding to the event statement.
According to a fourth aspect of embodiments of the present disclosure, there is provided an information structuring apparatus, the apparatus including a text obtaining module: configured to obtain text to be processed;
and a data analysis module: is configured to parse the text to be processed to extract at least one event sentence included in the text;
a second event matching module: an information structuring device configured to obtain event units matching the respective event sentences in a pre-constructed event map according to the third aspect;
and a matter combining module: and the method is configured to combine the event sentences according to the matched event relationship among the event units so as to obtain the structured event information corresponding to the text to be processed.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the information structuring method according to the first or second aspect when executing the program:
According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the information structuring method as described in the first or second aspect.
The technical scheme provided by the embodiment of the disclosure provides an information structuring method, aiming at acquiring an event statement, carrying out entity extraction and event extraction on the event statement; determining an argument role characterized by each entity in the event statement according to the extracted trigger words and at least one entity; and then matching in a pre-constructed event map according to the event trigger word and the argument role of the event statement, and acquiring the structural information corresponding to the event statement according to the matched event unit. The method and the device convert unstructured information into structured information by combining event extraction and event templates, do not need manual data review, and are high in efficiency.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of embodiments of the disclosure.
Moreover, not all of the above-described effects need be achieved by any of the embodiments of the present disclosure.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the following description will briefly introduce the drawings that are required to be used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present disclosure, and other drawings may also be obtained according to these drawings for a person having ordinary skill in the art.
FIG. 1 is a flow chart of a method of structuring information shown in an exemplary embodiment of the present disclosure;
FIG. 2 is a schematic diagram of an entity extraction model shown in an exemplary embodiment of the present disclosure;
FIG. 3 is a schematic diagram of an event extraction model shown in an exemplary embodiment of the present disclosure;
FIG. 4 is a flowchart illustrating a method of event extraction model determination in accordance with an exemplary embodiment of the present disclosure;
FIG. 5 is a flowchart of a method of information structuring as illustrated by an exemplary embodiment of the present disclosure;
FIG. 6 is a schematic diagram of an information structuring apparatus shown in an exemplary embodiment of the present disclosure;
FIG. 7 is a schematic diagram of an information structuring apparatus shown in an exemplary embodiment of the present disclosure;
fig. 8 is a schematic diagram illustrating a computer device according to an exemplary embodiment of the present disclosure.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
In the current litigation process, unstructured data occupies a large proportion, and the unstructured data comprises: document photo, prosecution, debate, etc. When the case knowledge base is established, the unstructured judicial data are extracted and classified manually, so that the case knowledge base is established and perfected.
However, the information extraction and classification by manual work is large in workload, the work efficiency is limited by the number and effort of workers, and information extraction deviation may occur by manual review of judicial information, so that a case knowledge base is wrong.
In view of the above problems, embodiments of the present disclosure provide an information structuring method and an information structuring apparatus applying the method.
Referring to fig. 1, an embodiment of the present disclosure provides an information structuring method, including:
step S101, acquiring an event statement for describing a single event;
step S102, carrying out entity extraction and event extraction on the event sentences to obtain event trigger words and a plurality of entity words of the event sentences, and determining argument roles corresponding to N entity words with a dependency relationship with the event trigger words in the entity words, wherein N is greater than 0;
First, the entity extraction related to this step will be described: entity extraction, named entity recognition (Named Entity Recognition), is used to identify entities in text that have a specific meaning, mainly including person names, place names, organization names, proper nouns, etc. The entity identification process generally includes two parts: (1) entity boundary identification; (2) The entity class (person name, place name, organization name, or others) is determined.
When the entity extraction is carried out, firstly, multiple types of domain entities need to be defined, the entities can be divided into generalized entities and domain entities, the generalized entities are general entity types such as people, addresses, amounts and the like, and the domain entities are non-general entity types of specific domains. Taking text to be processed as text information in the judicial field as an example, judicial entity types in a plurality of judicial fields need to be predefined, and the judicial entity types can specifically include multiple types of entities such as legal laws, litigation identities, criminal names, contracts, documents, jobs and the like.
After the entity type is predefined, entity extraction is performed based on the defined entity type so as to identify each entity contained in the event statement, and the currently commonly used entity identification algorithm model can comprise BiLSTM-CRF, transformer, bert and the like.
Taking BiLSTM-CRF algorithm model as an example for entity extraction: firstly, performing character embedding calculation on event sentences, inputting the obtained features into a bidirectional LSTM model, obtaining entity identification sequence labeling results through a conditional random field CRF, referring to a BiLSTM-CRF entity identification model of FIG. 2, and obtaining after entity extraction: zhang San-person name PER, lisi Tetraman name PER, 1 Wanyuan-amount MNY.
The entity extraction can be the first step of event extraction, namely entity extraction is carried out firstly, then the event extraction flow is continued based on the extracted entity, and the entity is determined to be an argument character of a trigger word in the event extraction; alternatively, when some sentences obtained after the data analysis of the text to be processed are not event description sentences, only the sentences can be extracted physically.
The event extraction process to which the present disclosure relates is described below:
event (Event) refers to a manifestation of information defined as the objective fact that a particular person, thing, and object interact at a particular time and a particular place, and in general, event extraction is directed to sentence-level events, i.e., event sentences in the present disclosure. Elements that make up an Event include Trigger words (Trigger), event Type (Event Type), argument (Argument), and Argument Role (Role). The trigger word is a core word representing occurrence of an event, is used for identifying the type of the event, and is mostly a verb or noun. An argument refers to a participant of an event, typically an Entity. An argument role is the role that an argument plays in an event, such as a business, object, place, etc. In a natural language analysis method for chinese, an argument may be divided into multiple roles.
The event extraction is a natural language processing task, and text information is taken as an analysis object. In particular, event extraction may be divided into two subtasks, event type recognition and argument word role classification. The event type is identified as identifying the event type based on the trigger words, and the argument word roles are classified as classifying argument words onto the roles of the event type. Namely, extracting the trigger words of the event, and determining the argument roles characterized by each entity in the event statement according to the extracted trigger words and at least one entity.
The event extraction method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities. Generally, extraction of events is required through training and use of neural networks. The neural network may be a bi-directional RNN network. And obtaining the trigger word of the event sentence after the event is extracted, and the argument character corresponding to the trigger word.
Specifically, when model training is performed, a plurality of memory matrices can be used to store trigger words, argument roles and dependency relationships between the trigger words and argument roles, and model training can be performed by preparing a labeled sentence training set in advance. The annotated sentence training set comprises a plurality of training sentences, and each word in each training sentence is annotated with the category of the trigger word or the category of the argument character to which the word belongs.
Step S103, obtaining an event unit matched with the event sentence from a pre-constructed event map; the event unit matched with the event statement comprises the event trigger word and a plurality of argument character items to be filled corresponding to the event trigger word;
the event map can be regarded as a logic link consisting of event units, and each event unit comprises a trigger word and an argument role of the event.
Specifically, the event map includes a plurality of event units, each of which includes a class of event trigger words, and an argument character corresponding to the trigger word. Illustrating: the event unit a includes a trigger word "borrow", and an argument character "borrower", "borrowed amount", and the like corresponding to "borrow".
It can be known that the events form different logic links as basic units of the event map, and the event units in the event map can be regarded as structured event templates. The template comprises trigger words of corresponding events and a plurality of argument character slots.
After the trigger words and the corresponding argument roles of the event sentences are obtained through entity extraction and event extraction, the corresponding event templates are searched through the event trigger words, and the trigger words and the argument words are filled into the corresponding trigger words and argument role slots in the event templates, so that the structural information of the event sentences is obtained.
In an alternative embodiment of the present disclosure, in performing step S103, the following method may be employed, but is not limited to:
(1-1) in a pre-constructed event map, searching an event unit set matched with the event trigger word;
(1-2) for one of the event units, populating N entity words of the event statement into argument character items to be populated of the event unit, determining a number of fillable argument character items;
(1-3) determining the event unit with the largest number of fillable argument character items as the event unit with successful matching.
Optionally, the extracted event trigger word may be subjected to paraphrasal matching, a plurality of paraphrasal of the trigger word are obtained, and a corresponding matchable event unit set is searched based on the word set of the event trigger word and the paraphrasal thereof.
And step S104, filling the N entity words into N argument role items to be filled in an event template matched with the event statement so as to obtain a structured statement corresponding to the event statement.
The method flow diagram of event extraction is described in detail below. The method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities.
The event extraction of the event sentence needs to be based on a predetermined event extraction model, referring to fig. 3, the event extraction model may be a neural network shown in fig. 3, where the neural network includes at least an input layer, a vector conversion layer, a trigger word event classification layer, an argument word role classification layer, and an output layer.
As shown in fig. 4, the method for determining the event extraction model includes the following steps:
step S401, receiving word sets obtained after word segmentation processing of training corpus sentences at an input layer;
the word segmentation process refers to the process of segmenting the text into reasonable vocabulary sequences under the condition that the body vocabulary is ensured not to be cut off, and the segmented vocabulary is linguistically meaningful, namely can be used as a semantic unit. Chinese word segmentation may be performed using algorithms applicable in the art, for example, chinese word segmentation algorithms commonly used today may include forward longest match, conditional random field, cyclic neural network, etc.
Step S402, marking each word in the word set at a vector conversion layer, and encoding the words into corresponding word vectors;
step S403, at a trigger word event classification layer, predicting whether the word is an event trigger word for an ith word in the word set, and determining the event trigger word type of the ith word vector;
Step S404, at an argument character classification layer, determining the dependency relationship between entity words included in the current event sentence and the event trigger words, and determining the argument character of each entity word corresponding to the event trigger word through the dependency relationship;
step S405, at the output layer, three memory matrices for storing the event trigger word, the argument character, the dependency relationship between the event trigger word and the argument character are updated respectively according to the prediction result.
At the vector conversion layer, each word in the word set needs to be encoded into a word vector formed by connecting a first vector, a second vector and a third vector, wherein the first vector, the second vector and the third vector are respectively determined by the following modes:
(2-1) querying a pre-trained word vector table, obtaining a word vector of a current word, and determining the word vector as a first vector;
(2-2) determining the entity type of the current word, inquiring a preset entity type embedding table according to the entity type, and determining the inquired entity type vector as a second vector of the word;
the entity type vector to which a word corresponds is an entity for characterizing which type the word belongs to, for example, whether the word belongs to an organization name, a person name, a place name, or the like. More specifically, assuming that the word is "Payment" the entity type vector of the word characterizes that the word belongs to the organization name. It should be noted that when a word is not an entity, the word also corresponds to an entity type vector, and the entity type vector of the word characterizes that the word is not an entity.
(2-3) determining a dependency vector of the current word, determining the dependency vector as a third vector of the word, wherein a value of an ith element in the dependency vector represents whether a semantic dependency exists between the current word and an ith word of an event sentence, and the length of the dependency vector is the same as the number of the dependency relations of the current word in the clause. Wherein the third vector is a 0-1 value vector.
When the neural network model is trained, two RNN networks can be used for respectively learning the representation of one event sentence in the forward/reverse direction, each word in the word set of the event sentence is encoded into a corresponding word vector in the training process, and after the Encoding step is completed, the obtained Encoding result and Sentence Embedding (sentence vector) are used for prediction. In the actual model training process, the following operations are performed in the ith step:
(3-1) predicting whether the i-th word wi is a trigger word and its trigger word type, and assigning the predicted trigger word type to the word wi; setting a type other1, and if the current word wi is not considered as a trigger word, assigning the type other1 to wi;
(3-2) predicting the argument character type of each entity for the trigger word wi for all named entities e1, …, ej in the current sentence, and assigning the predicted argument character type to each entity of the ith prediction step; setting a type other2, and if an entity is not considered as an argument character of a word wi, assigning the type other2 to the entity;
(3-3) updating the three memory matrices described above (for storing the argument character, and the dependency between the argument character and the trigger word) based on the trigger word, the argument character, and the dependency between the trigger word and the argument character.
The event map related in the disclosure is a logical link with an event as a unit, is a multi-element group comprising an event, an argument set, a logical relationship and the like, and besides the classification relationship such as an upper level and a lower level, the event and the event have non-classification relationship, including a composition relationship, a causal relationship, a concurrency relationship, a conditional relationship, a rejection relationship and the like, and the relationships describe the logical knowledge in the real dynamic knowledge together. Several types of rational logic mainly include causal, conditional, inversion, cis-bearing, upper and lower, composition, concurrency, etc.
The event sentences for describing the single events are obtained through the text to be processed, and the text to be processed is usually directly processed in the practical application process of information structuring. The present disclosure also provides an information structuring method for text to be processed, see fig. 5, the method comprising the steps of:
Step S501, obtaining a text to be processed;
the text to be processed is unstructured text corpus which needs to be structured, and the unstructured text corpus can comprise various types of text corpus. Taking corpus in judicial field as an example, in the processes of intelligent judgment, case knowledge base establishment and the like, a large amount of legal corpus including judge documents (judgment documents, prosecution, evidence and the like) and borrows, contracts and the like needs to be obtained.
The legal corpus comprises a lot of unstructured information, such as a borrowing list, which may comprise a lot of nonsensical descriptive sentences or clutter descriptive sentences, and in the intelligent judging process, the unstructured case information cannot provide business knowledge for an artificial intelligence system of legal scenes, and the unstructured case information needs to be structured, so that a case knowledge base is built by using the structured information. In this step, unstructured text information that needs to be structured is text to be processed.
Step S502, data analysis is carried out on the text to be processed so as to extract at least one event statement included in the text;
the process of data analysis of the text to be processed can be divided into two steps, namely chapter analysis and sentence analysis.
Performing chapter analysis, namely dividing the text to be processed into a plurality of basic modules, wherein each basic module is used for representing different information types contained in the basic modules;
based on the text format of the case text to be analyzed, the case text to be analyzed is divided into different modules through a classification algorithm. Taking a complaint shape as an example, the complaint shape is segmented according to a format, and after the segmentation is completed, basic information, complaint content and facts are obtained.
After analyzing the basic module of the text to be processed, performing sentence analysis on the basic module to cut out at least one event sentence contained in the text. Specifically, the segmentation can be performed according to the description statement of the text based on the service scene of the module, such as: descriptive sentences can be segmented in the text based on borrowing descriptions, repayment descriptions and the like, and the segmented descriptive sentences are event sentences, and each event sentence is used for describing a single event.
Illustrating the parsing flow of the text to be processed: firstly, performing chapter analysis, obtaining a basic module 'fact reason' which needs to be continuously structured, and performing sentence analysis aiming at the module 'fact reason', wherein the module comprises a borrowing description: "Zhang Sanxiang" hold four borrowings 1000 yuan "and this borrowing description is determined as an event statement.
It should be noted that the divided event sentences are not sentences based on punctuation as the division basis, but rather are sentences based on event description as the division basis, i.e. each event sentence correspondingly describes an event. The sentence-level analysis may use a sequence labeling mode to segment event sentences, and specifically may use a calculation model such as RNN to perform sequence labeling, which is not limited in this disclosure.
In an alternative embodiment of the disclosure, before the text to be processed is parsed, a pretreatment may be performed, and noise such as redundant spaces in the text and common messy code symbols of pdf documents may be filtered.
Step S503, entity extraction and event extraction are carried out on each event sentence, event trigger words and a plurality of entity words corresponding to each event sentence are obtained, and event units matched with each event sentence are obtained from a pre-constructed event map;
specific entity extraction and event extraction may refer to the above embodiments, and are not described herein.
Step S504, according to the matched event relationship among the event units, combining the event sentences to obtain the structured event information corresponding to the text to be processed.
The text to be processed can be analyzed to obtain a plurality of event sentences, and after the event sentences are respectively matched and filled into event units corresponding to the event map, each event sentence can be combined according to the event relation among each event unit in the event map so as to obtain the structured event information corresponding to the text to be processed.
Optionally, a priori reasoning can be performed according to a rational relationship between each event unit in the rational map, and the filling result of the argument character slots in each matched event unit is further added or corrected according to the reasoning result.
It can be seen that the above embodiments combine event extraction and event templates to convert unstructured information into structured information, so that manual data review is not required in the information structuring process, and the efficiency is high.
Corresponding to the above method embodiments, the present disclosure further provides an information structuring apparatus, as shown in fig. 6, where the apparatus may include: statement acquisition module 610, event extraction module 620, event matching module 630, and information population module 640.
Statement acquisition module 610: configured to obtain an event statement describing a single event;
Event extraction module 620: the method comprises the steps of carrying out entity extraction and event extraction on an event sentence to obtain an event trigger word and a plurality of entity words of the event sentence, and determining argument roles corresponding to N entity words with a dependency relationship with the event trigger word in the plurality of entity words, wherein N is greater than 0;
the first event matching module 630: an event unit configured to obtain an event unit matched with the event sentence from a pre-constructed event map; the event unit matched with the event statement comprises the event trigger word and a plurality of argument character items to be filled corresponding to the event trigger word;
information populating module 640: and filling the N entity words into N argument role items to be filled in an event unit matched with the event statement to obtain a structured statement corresponding to the event statement.
Optionally, the event extraction module is configured to perform event extraction on the event sentence based on a predetermined event extraction model;
the apparatus further includes a model determination module configured to:
word segmentation processing is carried out on the training corpus sentence, and the segmented words are marked and encoded into word vectors;
Inputting the word vectors into a pre-trained bidirectional RNN model in sequence to predict whether the ith word vector of the event sentence is an event trigger word or not, and determining the event trigger word type of the ith word vector;
under the condition that the ith word vector is a trigger word, determining the dependency relationship between entity words included in the current event sentence and the event trigger word, and determining the argument roles of each entity word corresponding to the event trigger word through the dependency relationship;
and respectively updating three memory matrixes for storing the event trigger words, the argument roles, the dependency relations between the event trigger words and the argument roles according to the prediction results.
Optionally, the model determining module is configured to:
for each word, encoding the word into a word vector formed by concatenating a first vector, a second vector and a third vector, wherein,
inquiring a pre-trained word vector table, and determining an inquiring result of a current word in the word vector table as a first vector;
determining the entity type of the current word, inquiring a preset entity type embedding table according to the entity type, and determining the inquired entity type vector as a second vector of the word;
Determining a dependency vector of a current word, determining the dependency vector as a third vector of the word, wherein the value of an ith element in the dependency vector represents whether a semantic dependency exists between the current word and an ith word of an event sentence, and the length of the dependency vector is the same as the number of the dependency relations of the current word in the clause.
Optionally, the event matching module is configured to:
searching an event unit set matched with the event trigger words in a pre-constructed event map;
for one event unit, filling N entity words of the event statement into argument role items to be filled of the event unit, and determining the number of the argument role items which can be filled;
and determining the event unit with the largest number of fillable argument character items as the event unit with successful matching.
Corresponding to the above method embodiments, the present disclosure further provides an information structuring apparatus, as shown in fig. 7, where the apparatus may include: a text determination module 710, a data parsing module 720, a second event matching module 730, and a fact combination module 740.
Text determination module 710: configured to obtain text to be processed;
Data parsing module 720: is configured to parse the text to be processed to extract at least one event sentence included in the text;
the second event matching module 730: the information structuring device is configured to obtain event units matched with the event sentences in a pre-constructed event map according to the information structuring device shown in fig. 6;
the matter combining module 740: and the method is configured to combine the event sentences according to the matched event relationship among the event units so as to obtain the structured event information corresponding to the text to be processed.
Optionally, the data parsing module is configured to:
dividing the text to be processed into different basic modules based on the text format of the text to be processed, wherein each basic module is used for representing different information types contained in the basic modules;
and carrying out sequence labeling processing on the text in the basic module so as to extract at least one event sentence contained in the text.
The embodiment of the disclosure further provides a computer device, referring to fig. 8, where the computer device at least includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the following information structuring method when executing the program:
Acquiring an event statement for describing a single event;
performing entity extraction and event extraction on the event sentences to obtain event trigger words and a plurality of entity words of the event sentences, and determining argument roles corresponding to N entity words with dependency relations with the event trigger words in the plurality of entity words, wherein N is greater than 0;
acquiring an event unit matched with the event sentence from a pre-constructed event map; the event unit matched with the event statement comprises the event trigger word and a plurality of argument character items to be filled corresponding to the event trigger word;
and filling the N entity words into N argument role items to be filled in an event unit matched with the event statement to obtain a structured statement corresponding to the event statement.
The embodiment of the disclosure also provides a computer device, which at least comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the following information structuring method:
obtaining a text to be processed;
analyzing the data of the text to be processed to extract at least one event sentence included in the text;
According to the information structuring method, event units matched with each event sentence are obtained from a pre-constructed event map;
and combining the event sentences according to the matched event relationship among the event units to obtain the structured event information corresponding to the text to be processed.
The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the following information structuring method:
acquiring an event statement for describing a single event;
performing entity extraction and event extraction on the event sentences to obtain event trigger words and a plurality of entity words of the event sentences, and determining argument roles corresponding to N entity words with dependency relations with the event trigger words in the plurality of entity words, wherein N is greater than 0;
acquiring an event unit matched with the event sentence from a pre-constructed event map; the event unit matched with the event statement comprises the event trigger word and a plurality of argument character items to be filled corresponding to the event trigger word;
and filling the N entity words into N argument role items to be filled in an event unit matched with the event statement to obtain a structured statement corresponding to the event statement.
The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the following information structuring method:
obtaining a text to be processed;
analyzing the data of the text to be processed to extract at least one event sentence included in the text;
according to the information structuring method, event units matched with each event sentence are obtained from a pre-constructed event map;
and combining the event sentences according to the matched event relationship among the event units to obtain the structured event information corresponding to the text to be processed.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the objectives of the disclosed solution. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.
The foregoing is merely a specific implementation of the embodiments of this disclosure, and it should be noted that, for a person skilled in the art, several improvements and modifications may be made without departing from the principles of the embodiments of this disclosure, which should also be considered as the protection scope of the embodiments of this disclosure.

Claims (10)

1. A method of structuring information, the method comprising:
acquiring an event statement for describing a single event;
performing entity extraction and event extraction on the event sentences to obtain event trigger words and a plurality of entity words of the event sentences, and determining argument roles corresponding to N entity words with dependency relations with the event trigger words in the plurality of entity words, wherein N is greater than 0;
Acquiring an event unit matched with the event sentence from a pre-constructed event map; the event unit matched with the event statement comprises the event trigger word and a plurality of argument character items to be filled corresponding to the event trigger word;
filling the N entity words into N argument role items to be filled in an event unit matched with the event statement to obtain a structured statement corresponding to the event statement;
the event extracting for the event statement comprises the following steps: carrying out event extraction on the event statement based on a predetermined event extraction model;
the method for determining the event extraction model comprises the following steps:
word segmentation processing is carried out on the training corpus sentence, and the segmented words are marked and encoded into word vectors;
inputting the word vectors into a pre-trained bidirectional RNN model in sequence to predict whether the ith word vector of the event sentence is an event trigger word or not, and determining the event trigger word type of the ith word vector;
under the condition that the ith word vector is a trigger word, determining the dependency relationship between entity words included in the current event sentence and the event trigger word, and determining the argument roles of each entity word corresponding to the event trigger word through the dependency relationship;
Respectively updating three memory matrixes for storing the event trigger word, the argument character, the dependency relationship between the event trigger word and the argument character according to the prediction result;
the encoding of the split words into word vectors comprises:
for each word, encoding the word into a word vector formed by concatenating a first vector, a second vector and a third vector, wherein,
inquiring a pre-trained word vector table, and determining an inquiring result of a current word in the word vector table as a first vector;
determining the entity type of the current word, inquiring a preset entity type embedding table according to the entity type, and determining the inquired entity type vector as a second vector of the word;
determining a dependency vector of a current word, determining the dependency vector as a third vector of the word, wherein the value of an ith element in the dependency vector represents whether a semantic dependency exists between the current word and an ith word of an event sentence, and the length of the dependency vector is the same as the number of the dependency of the current word in the event sentence.
2. The method of claim 1, wherein the obtaining the event units matching the event statement from the pre-constructed event map comprises:
Searching an event unit set matched with the event trigger word in a pre-constructed event database;
for one event unit, filling N entity words of the event statement into argument role items to be filled of the event unit, and determining the number of the argument role items which can be filled;
and determining the event unit with the largest number of fillable argument character items as the event unit with successful matching.
3. A method of structuring information, the method comprising:
obtaining a text to be processed;
analyzing the data of the text to be processed to extract at least one event sentence included in the text;
the information structuring method according to claim 1, wherein event units matched with the event sentences are obtained from a pre-constructed event map;
and combining the event sentences according to the matched event relationship among the event units to obtain the structured event information corresponding to the text to be processed.
4. A method according to claim 3, wherein said data parsing of said text to be processed comprises:
dividing the text to be processed into different basic modules based on the text format of the text to be processed, wherein each basic module is used for representing different information types contained in the basic modules;
And carrying out sequence labeling processing on the text in the basic module so as to extract at least one event sentence contained in the text.
5. An information structuring apparatus, the apparatus comprising:
statement acquisition module: configured to obtain an event statement describing a single event;
an event extraction module: the method comprises the steps of carrying out entity extraction and event extraction on an event sentence to obtain an event trigger word and a plurality of entity words of the event sentence, and determining argument roles corresponding to N entity words with a dependency relationship with the event trigger word in the plurality of entity words, wherein N is greater than 0;
the first event matching module: an event unit configured to obtain an event unit matched with the event sentence from a pre-constructed event map; the event unit matched with the event statement comprises the event trigger word and a plurality of argument character items to be filled corresponding to the event trigger word;
and an information filling module: the method comprises the steps of filling N entity words into N argument role items to be filled in an event unit matched with the event statement to obtain a structured statement corresponding to the event statement;
The event extraction module is configured to perform event extraction on the event statement based on a predetermined event extraction model;
the apparatus further includes a model determination module configured to:
word segmentation processing is carried out on the training corpus sentence, and the segmented words are marked and encoded into word vectors;
inputting the word vectors into a pre-trained bidirectional RNN model in sequence to predict whether the ith word vector of the event sentence is an event trigger word or not, and determining the event trigger word type of the ith word vector;
under the condition that the ith word vector is a trigger word, determining the dependency relationship between entity words included in the current event sentence and the event trigger word, and determining the argument roles of each entity word corresponding to the event trigger word through the dependency relationship;
respectively updating three memory matrixes for storing the event trigger word, the argument character, the dependency relationship between the event trigger word and the argument character according to the prediction result;
the model determination module is configured to:
for each word, encoding the word into a word vector formed by concatenating a first vector, a second vector and a third vector, wherein,
Inquiring a pre-trained word vector table, and determining an inquiring result of a current word in the word vector table as a first vector;
determining the entity type of the current word, inquiring a preset entity type embedding table according to the entity type, and determining the inquired entity type vector as a second vector of the word;
determining a dependency vector of a current word, determining the dependency vector as a third vector of the word, wherein the value of an ith element in the dependency vector represents whether a semantic dependency exists between the current word and an ith word of an event sentence, and the length of the dependency vector is the same as the number of the dependency of the current word in the event sentence.
6. The apparatus of claim 5, wherein the event matching module is configured to:
searching an event unit set matched with the event trigger words in a pre-constructed event map;
for one event unit, filling N entity words of the event statement into argument role items to be filled of the event unit, and determining the number of the argument role items which can be filled;
and determining the event unit with the largest number of fillable argument character items as the event unit with successful matching.
7. An information structuring apparatus, the apparatus comprising:
text acquisition module: configured to obtain text to be processed;
and a data analysis module: is configured to parse the text to be processed to extract at least one event sentence included in the text;
a second event matching module: an information structuring device configured to obtain event units matching the respective event sentences in a pre-constructed event map according to claim 5;
and a matter combining module: and the method is configured to combine the event sentences according to the matched event relationship among the event units so as to obtain the structured event information corresponding to the text to be processed.
8. The apparatus of claim 7, wherein the data parsing module is configured to:
dividing the text to be processed into different basic modules based on the text format of the text to be processed, wherein each basic module is used for representing different information types contained in the basic modules;
and carrying out sequence labeling processing on the text in the basic module so as to extract at least one event sentence contained in the text.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 2 or implements the method of any one of claims 3 to 4 when the program is executed by the processor.
10. A computer readable storage medium, characterized in that a computer program is stored thereon, which program, when being executed by a processor, implements the method according to any of claims 1 to 2 or implements the method according to any of claims 3 to 4.
CN201911301079.0A 2019-12-17 2019-12-17 Information structuring method and device Active CN111222305B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911301079.0A CN111222305B (en) 2019-12-17 2019-12-17 Information structuring method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911301079.0A CN111222305B (en) 2019-12-17 2019-12-17 Information structuring method and device

Publications (2)

Publication Number Publication Date
CN111222305A CN111222305A (en) 2020-06-02
CN111222305B true CN111222305B (en) 2024-03-22

Family

ID=70829775

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911301079.0A Active CN111222305B (en) 2019-12-17 2019-12-17 Information structuring method and device

Country Status (1)

Country Link
CN (1) CN111222305B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111797241B (en) * 2020-06-17 2023-08-22 北京北大软件工程股份有限公司 Event Argument Extraction Method and Device Based on Reinforcement Learning
CN111967601B (en) * 2020-06-30 2024-02-20 北京百度网讯科技有限公司 Event relation generation method, event relation rule generation method and device
CN111782800B (en) * 2020-06-30 2023-11-21 上海仪电(集团)有限公司中央研究院 Intelligent conference analysis method for event tracing
CN112001265B (en) * 2020-07-29 2024-01-23 北京百度网讯科技有限公司 Video event identification method and device, electronic equipment and storage medium
CN111985221B (en) * 2020-08-12 2024-03-26 北京百度网讯科技有限公司 Text event relationship identification method, device, equipment and storage medium
CN112084381A (en) * 2020-09-11 2020-12-15 广东电网有限责任公司 Event extraction method, system, storage medium and equipment
CN112860852B (en) * 2021-01-26 2024-03-08 北京金堤科技有限公司 Information analysis method and device, electronic equipment and computer readable storage medium
CN112817561B (en) * 2021-02-02 2023-08-18 山东省计算中心(国家超级计算济南中心) Transaction type functional point structured extraction method and system for software demand document
CN114154495A (en) * 2021-12-03 2022-03-08 海南港航控股有限公司 Entity extraction method and system based on keyword matching
CN114330354B (en) * 2022-03-02 2022-12-23 杭州海康威视数字技术股份有限公司 Event extraction method and device based on vocabulary enhancement and storage medium
CN114757189B (en) * 2022-06-13 2022-10-18 粤港澳大湾区数字经济研究院(福田) Event extraction method and device, intelligent terminal and storage medium
CN115358896B (en) * 2022-10-20 2023-02-03 四川大学华西医院 Method, device, equipment and medium for constructing criminal evolution network by using massive documents

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20030068856A (en) * 2002-02-18 2003-08-25 한국전자통신연구원 Apparatus for extracting information desired by users from unstructured documents and method thereof
CN104156352A (en) * 2014-08-15 2014-11-19 苏州大学 Method and system for handling Chinese event
CN107562772A (en) * 2017-07-03 2018-01-09 南京柯基数据科技有限公司 Event extraction method, apparatus, system and storage medium
CN109582949A (en) * 2018-09-14 2019-04-05 阿里巴巴集团控股有限公司 Event element abstracting method, calculates equipment and storage medium at device
CN110032641A (en) * 2019-02-14 2019-07-19 阿里巴巴集团控股有限公司 Method and device that computer executes, that event extraction is carried out using neural network
CN110134757A (en) * 2019-04-19 2019-08-16 杭州电子科技大学 A kind of event argument roles abstracting method based on bull attention mechanism
CN110135457A (en) * 2019-04-11 2019-08-16 中国科学院计算技术研究所 Event trigger word abstracting method and system based on self-encoding encoder fusion document information
CN110489520A (en) * 2019-07-08 2019-11-22 平安科技(深圳)有限公司 Event-handling method, device, equipment and the storage medium of knowledge based map

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170075904A1 (en) * 2015-09-16 2017-03-16 Edgetide Llc System and method of extracting linked node graph data structures from unstructured content

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20030068856A (en) * 2002-02-18 2003-08-25 한국전자통신연구원 Apparatus for extracting information desired by users from unstructured documents and method thereof
CN104156352A (en) * 2014-08-15 2014-11-19 苏州大学 Method and system for handling Chinese event
CN107562772A (en) * 2017-07-03 2018-01-09 南京柯基数据科技有限公司 Event extraction method, apparatus, system and storage medium
CN109582949A (en) * 2018-09-14 2019-04-05 阿里巴巴集团控股有限公司 Event element abstracting method, calculates equipment and storage medium at device
CN110032641A (en) * 2019-02-14 2019-07-19 阿里巴巴集团控股有限公司 Method and device that computer executes, that event extraction is carried out using neural network
CN110135457A (en) * 2019-04-11 2019-08-16 中国科学院计算技术研究所 Event trigger word abstracting method and system based on self-encoding encoder fusion document information
CN110134757A (en) * 2019-04-19 2019-08-16 杭州电子科技大学 A kind of event argument roles abstracting method based on bull attention mechanism
CN110489520A (en) * 2019-07-08 2019-11-22 平安科技(深圳)有限公司 Event-handling method, device, equipment and the storage medium of knowledge based map

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
朱少华 ; 李培峰 ; 朱巧明 ; .基于MLN的中文事件论元推理方法.计算机科学.2016,43(03),252-255、261. *
王学锋 ; 杨若鹏 ; 李雯 ; .基于深度学习的作战文书事件抽取方法.信息工程大学学报.2019,20(05),635-640. *
邱盈盈 ; 洪宇 ; 周文 ; 姚建民 ; 朱巧明 ; .面向事件抽取的深度与主动联合学习方法.中文信息学报.2018,32(06),98-106. *

Also Published As

Publication number Publication date
CN111222305A (en) 2020-06-02

Similar Documents

Publication Publication Date Title
CN111222305B (en) Information structuring method and device
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN110020424B (en) Contract information extraction method and device and text information extraction method
CN110781276A (en) Text extraction method, device, equipment and storage medium
CN110598203A (en) Military imagination document entity information extraction method and device combined with dictionary
CN112069298A (en) Human-computer interaction method, device and medium based on semantic web and intention recognition
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
CN111078837A (en) Intelligent question and answer information processing method, electronic equipment and computer readable storage medium
CN111291177A (en) Information processing method and device and computer storage medium
CN112270188A (en) Questioning type analysis path recommendation method, system and storage medium
CN111401065A (en) Entity identification method, device, equipment and storage medium
CN111274829A (en) Sequence labeling method using cross-language information
CN106030568A (en) Natural language processing system, natural language processing method, and natural language processing program
CN113821605A (en) Event extraction method
CN110929520A (en) Non-named entity object extraction method and device, electronic equipment and storage medium
CN110347802A (en) A kind of text analyzing method and device
Celikyilmaz et al. A graph-based semi-supervised learning for question-answering
CN116109732A (en) Image labeling method, device, processing equipment and storage medium
JP6867963B2 (en) Summary Evaluation device, method, program, and storage medium
CN114282513A (en) Text semantic similarity matching method and system, intelligent terminal and storage medium
CN112434533A (en) Entity disambiguation method, apparatus, electronic device, and computer-readable storage medium
CN117216214A (en) Question and answer extraction generation method, device, equipment and medium
CN110705258A (en) Text entity identification method and device
CN108021609B (en) Text emotion classification method and device, computer equipment and storage medium
CN115796177A (en) Method, medium and electronic device for realizing Chinese word segmentation and part-of-speech tagging

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant