CN111310461B - Event element extraction method, device, equipment and storage medium - Google Patents
Event element extraction method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN111310461B CN111310461B CN202010041054.8A CN202010041054A CN111310461B CN 111310461 B CN111310461 B CN 111310461B CN 202010041054 A CN202010041054 A CN 202010041054A CN 111310461 B CN111310461 B CN 111310461B
- Authority
- CN
- China
- Prior art keywords
- word
- trigger
- event
- entity
- corpus information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Machine Translation (AREA)
Abstract
The embodiment of the application discloses an event element extraction method, an event element extraction device, event element extraction equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: extracting a first trigger element of a target event from the first corpus information; obtaining a first entity element in the first corpus information based on the entity element extraction model; and determining the first trigger element and the first entity element as a first event element of the target event. The entity element extraction model is obtained by training according to a second event element extracted from the second corpus information, and the second event element comprises a second trigger element and a second entity element of the target event. Therefore, the second event element is used as a sample, the entity element extraction model is trained, the event types corresponding to various event elements do not need to be marked, the operation is simple and convenient, manual marking is not needed, the waste of manpower and time is avoided, and the efficiency is improved. And the second event element is used as a sample, so that the sample coverage rate is high, and the recall ratio of the entity element extraction model can be improved.
Description
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to an event element extraction method, an event element extraction device, event element extraction equipment and a storage medium.
Background
Event element extraction refers to extracting information of an event, such as an event type, event participants and other words capable of representing the subject of the event, from a natural language text, and presenting the information in a structured form. Therefore, the event element extraction can be used for mining the events in which the user is interested from a large amount of corpus information, and can be applied to various fields such as news information analysis, article content extraction, knowledge base construction, information retrieval and the like.
In the related art, the event element may include a trigger element, a last-in-affair element and a last-in-affair element, the event element is extracted from the sample corpus information, a synonym of the event element is obtained to be used as another event element to construct an event element dictionary, and the type of each event element in the event element dictionary is also automatically labeled. And subsequently, matching the event element dictionary with certain corpus information so as to extract the event elements in the corpus information.
However, the above scheme requires that the type of each event element in the event element dictionary is labeled in advance, and the operation is complicated. Moreover, due to the complexity of the corpus information, many event elements are not fixed, and the scheme can only extract preset event elements, so that the recall ratio is low and the flexibility is poor.
Disclosure of Invention
The embodiment of the application provides an event element extraction method, an event element extraction device, event element extraction equipment and a storage medium, and can improve recall rate and flexibility of event element extraction. The technical scheme is as follows:
in one aspect, an event element extraction method is provided, and the method includes:
extracting a first trigger element of a target event from the first corpus information;
processing the first corpus information based on an entity element extraction model to obtain a first entity element in the first corpus information;
determining the first trigger element and the first entity element as a first event element of the target event;
the entity element extraction model is obtained by training according to a second event element extracted from second corpus information, the second event element comprises a second trigger element and a second entity element of the target event, and the entity element of the target event comprises at least one of an action element or a subject element of the target event.
Optionally, the method further comprises:
segmenting words of the event type name of the target event to obtain trigger words belonging to verbs;
adding the trigger words and the corresponding synonyms to the set of trigger words.
Optionally, the processing, based on the entity element extraction model, the word mixture vector corresponding to the multiple words to determine a prefix and an end of the first entity element includes:
processing the word and word mixed vectors corresponding to the plurality of words based on the entity element extraction model to obtain identification information corresponding to the plurality of words, wherein the identification information is used for indicating whether the corresponding words belong to the prefix or the suffix of the first entity element;
and determining the prefix and the suffix of the first entity element according to the identification information.
Optionally, the determining, according to the identification information, the prefix and the suffix of the first entity element includes:
determining at least two words corresponding to the first identification in the first corpus information;
and sequentially determining the prefix and the suffix of the first entity element according to the arrangement sequence of at least two words corresponding to the first identifier in the first corpus information.
In another aspect, there is provided an event element extraction apparatus, the apparatus including:
the element extraction module is used for extracting a first trigger element of the target event from the first corpus information;
the information processing module is used for processing the first corpus information based on an entity element extraction model to obtain a first entity element in the first corpus information;
an event element determination module, configured to determine the first trigger element and the first entity element as a first event element of the target event;
the entity element extraction model is obtained by training according to a second event element extracted from second corpus information, the second event element comprises a second trigger element and a second entity element of the target event, and the entity element of the target event comprises at least one of an action element or a subject element of the target event.
Optionally, the apparatus further comprises:
the element extraction module is further configured to extract a second event element of the target event from second corpus information;
and the model training module is used for training an entity element extraction model according to the second event element, and the entity element extraction model is used for extracting entity elements from any corpus information.
Optionally, the trigger element of the target event includes a trigger word, and the element extraction module includes:
the first matching unit is used for matching the trigger word set of the target event with the second corpus information;
an extracting unit, configured to extract trigger terms included in the trigger term set from the second corpus information, as the second trigger element;
and a first determining unit, configured to determine, according to a syntax component of the second trigger element in the second corpus information, a second entity element corresponding to the second trigger element.
Optionally, the apparatus further comprises:
the first word segmentation module is used for segmenting the event type name of the target event to obtain a trigger word belonging to a verb;
a first adding module, configured to add the trigger word and the corresponding synonym to the trigger word set.
Optionally, the trigger element of the target event includes at least one of a trigger word or a trigger phrase, the trigger phrase includes at least two words, and the element extraction module includes:
a second matching unit, configured to match a trigger word pattern set of the target event with the second corpus information, where the trigger word pattern set includes at least one trigger word pattern, and the trigger word pattern includes a trigger word of the target event and an auxiliary word used for combining with the trigger word;
the combining unit is used for combining the trigger words and the auxiliary words to obtain trigger phrases when any trigger word and the auxiliary words corresponding to the trigger words are included in the second corpus information and negative words do not exist between the trigger words and the auxiliary words;
a second determining unit, configured to determine, according to a syntactic component of the trigger word in the second corpus information, an entity element corresponding to the trigger word as a second entity element;
the second determining unit is further configured to determine, according to the syntactic components of the trigger word and the auxiliary word in the second corpus information, an entity element corresponding to the trigger phrase as the second entity element.
Optionally, the combining unit is further configured to, when the second corpus information includes the trigger word and an auxiliary word corresponding to the trigger word, no negative word exists between the trigger word and the auxiliary word, and a distance between the trigger word and the auxiliary word is not greater than a first preset distance, combine the trigger word and the auxiliary word to obtain a trigger phrase.
Optionally, the apparatus further comprises:
the second word segmentation module is used for segmenting the event type name of the target event to obtain a plurality of target words;
the second adding module is used for adding the trigger words belonging to the verbs in the target words and the corresponding synonyms into the trigger word set;
the combination module is used for combining each target word in the target words and the corresponding synonym into a word set to obtain a plurality of word sets;
and the third adding module is used for selecting any word from each word set, combining the selected multiple words to obtain the trigger word mode, and adding the trigger word mode into the trigger word mode set.
Optionally, the second event element satisfies at least one of the following conditions:
the part of speech of the second trigger element in the second corpus information is a verb;
the syntactic component of the second trigger element in the second corpus information is a predicate;
the length of the second event element is not less than a preset value;
the part of speech of the second entity element in the second corpus information is a noun;
the second entity element does not belong to a preset type word, the preset type word representing a specific type but not representing a specific object belonging to the specific type.
Optionally, the model training module includes:
a labeling unit, configured to label, according to the second entity element, a plurality of words in the second corpus information to obtain sample identification information corresponding to the plurality of words, where the plurality of words at least include a word in the second entity element, and the sample identification information is used to indicate whether the corresponding word belongs to a prefix or an end of the second entity element;
the first splicing unit is used for splicing the word vectors corresponding to the plurality of characters with the corresponding word vectors to obtain word and word mixed vectors corresponding to the plurality of characters;
the first processing unit is used for processing the word and word mixed vectors corresponding to the multiple words based on the current entity element extraction model to obtain training identification information corresponding to the multiple words, wherein the training identification information is used for indicating whether the corresponding words belong to the prefix or the suffix of the entity element;
and the training unit is used for training the entity element extraction model according to the training identification information and the sample identification information corresponding to the plurality of characters to obtain the trained entity element extraction model.
Optionally, the information processing module includes:
the second splicing unit is used for splicing word vectors corresponding to a plurality of characters in the first corpus information with corresponding word vectors to obtain word and word mixed vectors corresponding to the plurality of characters;
the second processing unit is used for processing the word and word mixed vectors corresponding to the plurality of words based on the entity element extraction model, and determining a prefix and an suffix of the first entity element;
and a merging unit, configured to merge the prefix and the suffix of the first entity element and the word between the prefix and the suffix of the first entity element according to the arrangement order in the first corpus information, so as to obtain the first entity element.
Optionally, the first entity element includes an event application element of the target event, and the information processing module includes:
the word acquisition unit is used for acquiring at least one word in a second preset distance before the first trigger element;
the second processing unit is further configured to process, based on the entity element extraction model, the at least one word and the word-word mixed vector corresponding to each word in the first trigger element, and determine a prefix and a suffix of the event-applying element;
the merging unit is further configured to merge the prefix and the suffix of the event-applying element and the characters between the prefix and the suffix of the event-applying element according to the arrangement order in the first corpus information to obtain the event-applying element.
Optionally, the first entity element includes a subject element of the target event, and the information processing module includes:
the word acquisition unit is used for acquiring at least one word within a third preset distance behind the first trigger element;
the second processing unit is further configured to process, based on the entity element extraction model, the word-word mixed vector corresponding to each word in the at least one word and the first trigger element, and determine a prefix and a suffix of the subject element;
the merging unit is further configured to merge the prefix and the suffix of the subject element and the characters between the prefix and the suffix of the subject element according to the arrangement order in the first corpus information, so as to obtain the subject element.
Optionally, the second splicing unit is further configured to divide the first corpus information into words, and obtain a word vector of each word in the first corpus information;
the second splicing unit is further configured to perform word segmentation on the first corpus information to obtain a word vector of each word in the first corpus information;
the second splicing unit is also used for splicing the word vector of each character with the word vector of the corresponding word, and obtaining the word mixed vector corresponding to each character.
Optionally, the second processing unit is further configured to process, based on the entity element extraction model, the word and word mixed vectors corresponding to the multiple words to obtain identification information corresponding to the multiple words, where the identification information is used to indicate whether a corresponding word belongs to a prefix or an end of the first entity element;
the second processing unit is further configured to determine a prefix and a suffix of the first entity element according to the identification information.
Optionally, the identification information includes a first identification or a second identification, the first identification represents a prefix or a suffix belonging to the first entity element, the second identification represents a prefix or a suffix not belonging to the first entity element, and the second processing unit is further configured to determine at least two words corresponding to the first identification in the first corpus information;
the second processing unit is further configured to sequentially determine a prefix and a suffix of the first entity element according to an arrangement order of at least two words corresponding to the first identifier in the first corpus information.
In another aspect, an apparatus is provided that includes a processor and a memory having at least one program code stored therein, the at least one program code being loaded and executed by the processor to perform operations as performed in the event element extraction method.
In still another aspect, a computer-readable storage medium having at least one program code stored therein is provided, the at least one program code being loaded and executed by a processor to implement the operations as performed in the event element extraction method.
According to the method, the device, the equipment and the storage medium provided by the embodiment of the application, the first trigger element of the target event is extracted from the first corpus information; processing the first corpus information based on the entity element extraction model to obtain a first entity element in the first corpus information; and determining the first trigger element and the first entity element as a first event element of the target event. The entity element extraction model is obtained by training according to a second event element extracted from the second corpus information, the second event element comprises a second trigger element and a second entity element of the target event, and the entity element of the target event comprises at least one of an action element or a subject element of the target event. Therefore, the second event element of the extracted target event is used as a sample, the entity element extraction model is trained, the event types corresponding to various event elements do not need to be marked in sequence, the operation is simple and convenient, manual marking is not needed, the waste of manpower and time is avoided, and the efficiency is improved. And the second event element extracted from the second corpus information is used as a sample, so that the sample coverage rate is high, the recall ratio of the trained entity element extraction model can be improved, and the flexibility of event element extraction is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of an event element extraction method according to an embodiment of the present application.
Fig. 2 is a flowchart of another event element extraction method provided in an embodiment of the present application.
Fig. 3 is a flowchart of another event element extraction method provided in an embodiment of the present application.
Fig. 4 is a flowchart of training an entity element extraction model according to an embodiment of the present application.
Fig. 5 is a schematic diagram of a concatenated word vector according to an embodiment of the present application.
Fig. 6 is a schematic diagram of an entity element extraction model according to an embodiment of the present application.
Fig. 7 is a flowchart of another event element extraction method provided in an embodiment of the present application.
Fig. 8 is a schematic structural diagram of an event element extraction apparatus according to an embodiment of the present application.
Fig. 9 is a schematic structural diagram of another event element extraction apparatus according to an embodiment of the present application.
Fig. 10 is a schematic structural diagram of a terminal according to an embodiment of the present application.
Fig. 11 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present application more clear, the embodiments of the present application will be further described in detail with reference to the accompanying drawings.
It will be understood that the terms "first," "second," and the like as used herein may be used herein to describe various concepts, which are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. For example, the first corpus information may be referred to as second corpus information, and similarly, the second corpus information may be referred to as first corpus information, without departing from the scope of the present application.
For example, the plurality of target words may be any integer number of target words, such as two target words, three target words, and the like, which is greater than or equal to two. Each refers to each of the at least one, for example, each word set refers to each word set in a plurality of word sets, and if the plurality of word sets is 3 word sets, each word set refers to each word set in the 3 word sets.
The embodiment of the application provides an event element extraction method, and an execution main body is computer equipment. The computer device obtains a corpus, wherein the corpus comprises a plurality of pieces of corpus information. The computer equipment extracts a second trigger element and a second entity element of the target event from the second corpus information, determines the second trigger element and the second entity element as a second event element of the target event, and trains an entity element extraction model for extracting the entity element according to the second event element. The computer equipment extracts a first trigger element of a target event from the first corpus information, processes the first corpus information based on the entity element extraction model to obtain a first entity element of the first corpus information, and determines the first trigger element and the first entity element as a first event element of the target event. The second event element and the first event element are event elements of the target event, and the extraction of the event elements is realized.
In one possible implementation manner, the computer device is a terminal, and the terminal may be a mobile phone, a computer, a tablet computer, a smart television, or other types of devices.
In another possible implementation manner, the computer device is a server, and the server may be one server, a server cluster composed of several servers, or one cloud computing service center.
The method provided by the embodiment of the application can be applied to any scene of extracting the event elements.
For example, in the financial field, by adopting the method provided by the embodiment of the application, financial event elements such as financial information and viewpoints are mined, stock fluctuation can be analyzed, a financial knowledge graph is constructed, the construction of a financial system is enriched, and the like, so that important influences are generated on investment and financing, stock fluctuation prediction, financial enterprise portrait analysis, assistance on financial enterprise health operation, and the like.
For example, in the medical field, the method provided by the embodiment of the present application is adopted to extract biological event elements of one or more biomedical entities (such as proteins, cells, chemical substances, etc.), and a biological network composed of such event elements can be constructed in an information retrieval and question-answering system, so as to have important effects on knowledge discovery, exploration of new associations between the biomedical entities, understanding of physiological and pathogenesis mechanisms, and the like.
Fig. 1 is a flowchart of an event element extraction method according to an embodiment of the present application. The execution subject of the embodiment of the present application is a computer device, and referring to fig. 1, the method includes:
101. and the computer equipment performs word segmentation on the event type name of the target event to obtain a plurality of target words.
The method provided by the embodiment of the application is used for extracting the event element, wherein the event element comprises an entity element and a trigger element, the entity element comprises at least one of an event applying element and an event accepting element, the event applying element refers to an event applicator, namely an entity initiating the event, and the event accepting element refers to an event acceptor, namely an object of the event. The trigger element refers to a relationship between an actor and a victim, or behavior states of the actor and the victim, or the like. Extracting event elements, namely extracting entity elements and trigger elements of the events, and determining an event according to the extracted entity elements and trigger elements.
Before performing event element extraction, the computer device presets a target event from which an event element is to be extracted. The target event may be various events occurring in a specific field, the specific field may include various fields such as a financial field, a news field, a medical field, an internet field, and an education field, and the target event may include one or more events such as a financial event, a news event, a medical event, an internet event, and an education event.
The target event corresponds to at least one event type, wherein the event type name is a phrase comprising a plurality of words, and the event type name can be used for describing the event type. For example, the event type names of financial events may be performance growth, equity transfer, sustained price drop, expansion of business, difficulties in financing of a civil enterprise, etc. The event type name may be set by default by the computer device.
In addition to the phrase as the event type name, there are other phrases or words that can be used to describe the event type, such as phrases or words having the same or similar semantics as the event type name. Therefore, other words or phrases capable of describing the event type can be extended based on the event type name.
The computer device obtains the event type name of the target event, and performs word segmentation on the event type name to obtain a plurality of target words included in the event type name.
The term segmentation of the event type name refers to a process of separating a term with an independent meaning from the event type name. The event type name may be segmented using a predetermined segmentation algorithm, including but not limited to a maximum matching segmentation, a semantic segmentation, and a statistical segmentation. For example, the event type name "performance growth" is segmented, and the target words obtained are "performance" and "growth".
102. And the computer equipment combines each target word in the target words and the corresponding synonym into a word set to obtain a plurality of word sets.
When the computer equipment obtains a plurality of target words corresponding to the event type, obtaining synonyms of each target word in the target words, respectively combining each target word and the synonyms corresponding to the target words into a word set, and obtaining a plurality of word sets corresponding to the event type. Each target word corresponds to a word set, and each word set comprises a plurality of words with the same or similar semantics.
Alternatively, the computer device obtains the Word Vector of the target Word via Word2Vec (a cluster of related models used To generate the Word Vector). Word vectors have semantic properties and can be used to represent word features. Therefore, the computer equipment obtains the words of which the word vector similarity between the word vector and the target words is smaller than the preset threshold value by calculating the similarity of the word vectors, and the words are used as synonyms of the target words. Alternatively, the computer device may also obtain synonyms for the target term from the synonym forest.
For example, for an event type named "performance growth," the target words obtained are "performance" and "growth," and the synonyms for "performance" and "growth" are expanded, respectively, resulting in the following two sets of words:
1. the word set corresponding to "achievement" is: "Performance", "Business", "and" run ")" Ying Hui ";
2. "grow" the corresponding set of words: "increase", "rise", "increase", "raise", "enhance", "double", "lift" and "profit".
103. The computer equipment selects any word from each word set, combines a plurality of selected words to obtain a trigger word mode, and adds the trigger word mode into the trigger word mode set.
When the computer equipment obtains a plurality of word sets, any word is selected from each word set respectively to obtain a plurality of words, and the selected words are combined to obtain a trigger word mode corresponding to the event type. Any word in any word set can be combined with any word in other word sets, namely the computer equipment can obtain a plurality of trigger word modes, and the plurality of trigger word modes are added into the trigger word mode set.
The trigger word pattern is a pattern for determining whether an event element of the event type is included in the corpus information. And when the corpus information does not meet the trigger word mode, the corpus information does not include the event element of the event type.
In this embodiment of the application, the trigger word pattern includes a plurality of words, where a word whose part of speech is a verb in the plurality of words is a trigger word of the target event, and other words are auxiliary words used for combining with the trigger word. For example, if the trigger word mode is "performance & growth," then "performance" is the assist word and "growth" is the trigger word; the trigger word mode is "stock price & continuous & drop", then "stock price" and "continuous" are auxiliary words, and "drop" is a trigger word. The condition that the corpus information satisfies the trigger word mode is as follows: the corpus information includes a trigger word and an auxiliary word in the trigger word mode.
For example, for an event type named "performance growth," the computer device gets two sets of words for the event type. The term set a corresponding to the target term "achievement" includes "achievement", "performance", "business" and "business", and the term set B corresponding to the target term "growth" includes "increase", "rise", "promote" and "improve". A plurality of trigger word patterns in the set of trigger word patterns corresponding to the event type can be seen in table 1 below.
TABLE 1
In a possible implementation manner, the trigger word pattern further includes a preset distance, where the preset distance refers to a maximum distance between words, for example, the preset distance is 5 words or 10 words. The condition that the trigger word pattern is satisfied is: the corpus information comprises trigger words and auxiliary words in the trigger word mode, and the distance between the trigger words and the auxiliary words is not more than a preset distance.
In another possible implementation, the trigger word patterns include a forward trigger word pattern and a reverse trigger word pattern corresponding to the forward trigger word pattern. The forward trigger word mode is composed of trigger words and auxiliary words, and the condition that the corpus information meets the forward trigger word mode is as follows: the corpus information includes the trigger words and auxiliary words in the forward trigger word mode. The reverse trigger word mode is composed of at least one negative word and the trigger word and the auxiliary word in the corresponding forward trigger word mode. Wherein the at least one negative word may refer to a negative word in a negative dictionary, such as "no", "limited", "difficult", "impossible", "limited", "failed", etc., wherein the condition that the corpus information satisfies the reverse trigger word pattern is: the corpus information comprises a trigger word and an auxiliary word in the reverse trigger word mode, and any negative word in the reverse trigger word mode exists between the trigger word and the auxiliary word. The condition that the corpus information satisfies the trigger word mode is as follows: the corpus information satisfies the forward trigger word mode and does not satisfy the reverse trigger word mode.
104. And matching the triggering word mode set of the target event with the second corpus information by the computer equipment.
The computer equipment obtains a corpus of event elements to be extracted, the corpus comprises at least one piece of corpus information, the corpus information can comprise text information intercepted from news information, current affair comments, special reports, academic papers and other information, and the corpus information can be a word, an article, a sentence and the like. Such as the title, abstract, text, etc. of news information, and may also include picture information captured from news information, current affair comments, topical reports, academic papers, etc. If the corpus information is text information, the computer device can directly extract the event element from the text information, and if the corpus information is picture information, the computer device can perform character recognition on the picture information to obtain corresponding text information, and then extract the event element from the text information.
The corpus may be downloaded by the computer device from another device, uploaded to the computer device by one or more other devices, stored by an operator in the computer device, or may be a corpus from another source.
Taking the second corpus information in the corpus as an example, after the computer device obtains the trigger word mode set of the target event, when the event element extraction is performed on the second corpus information, the trigger word mode set is matched with the second corpus information. That is, the computer device compares at least one trigger word mode in the trigger word mode set with the second corpus information, and determines whether the second corpus information includes a trigger word in a certain trigger word mode and an auxiliary word corresponding to the trigger word.
105. When the second corpus information comprises any trigger word and an auxiliary word corresponding to the trigger word, and no negative word exists between the trigger word and the auxiliary word, the computer equipment combines the trigger word and the auxiliary word to obtain a trigger phrase.
When the computer equipment determines that the second corpus information comprises any trigger word in the trigger word mode set and an auxiliary word corresponding to the trigger word, and no negative word exists between the trigger word and the auxiliary word, the computer equipment determines that the second corpus information comprises an event element corresponding to the event type, and then the computer equipment combines the trigger word and the corresponding auxiliary word according to the sequence in the second corpus information to obtain the trigger phrase. The trigger phrase includes at least two words, the trigger phrase may be used to describe the second corpus information, and the trigger phrase may also be used to describe an event type corresponding to the trigger word mode, that is, the event in the second corpus information belongs to the event type.
For example, if the second corpus information is "36% of progress is shown in the performance report", the second corpus information includes the trigger word "progress" and the corresponding auxiliary word "performance", and there is no negative word between "performance" and "progress", the trigger word mode corresponding to the second corpus information is "performance & progress", and the event type corresponding to the trigger word mode is named "performance increase". The computer device combines the triggering term "lift" with the auxiliary term "performance" in the order of arrangement in the second corpus information to obtain the triggering phrase "performance lift", which can be used to describe the target event of the event type named "performance growth".
For example, the second corpus information is "performance is difficult to greatly increase this year when the price of the raw material is increased", the second corpus information includes the trigger word "increase" and the corresponding auxiliary word "performance", but a negative word "difficult" exists between "performance" and "increase", and therefore, the computer device determines that the second corpus information does not include the event element corresponding to the event type "performance increase".
When the second corpus information includes the trigger term and the auxiliary term and no negative term exists between the trigger term and the auxiliary term, there is also a case where the distance between the trigger term and the auxiliary term in the second corpus information is long. In order to avoid errors in the event element extraction process, it is necessary to ensure that the distance between the trigger word and the auxiliary word is not too large. Therefore, in a possible implementation manner, when the second corpus information includes the trigger word and the auxiliary word corresponding to the trigger word, no negative word exists between the trigger word and the auxiliary word, and the distance between the trigger word and the auxiliary word is not greater than the first preset distance, the computer device combines the trigger word and the auxiliary word to obtain the trigger phrase. The first preset distance may be set by a default of the computer device, or by a default of the computer device, for example, the first preset distance is 5 words or 10 words.
In another possible implementation manner, the trigger word patterns include a forward trigger word pattern and a reverse trigger word pattern corresponding to the forward trigger word pattern, and the forward trigger word pattern and the reverse trigger word pattern may be as described in step 103 above. And the computer equipment matches the forward trigger word mode with the second corpus information, and matches the second corpus information with the reverse trigger word mode when the second corpus information meets the forward trigger word mode. When the second corpus information does not meet the reverse trigger mode, the computer equipment determines that the second corpus information comprises event elements corresponding to the event type, and then combines the trigger words and the auxiliary words according to the arrangement sequence to obtain trigger phrases corresponding to the event type. When the second corpus information meets the reverse trigger mode, the computer equipment determines that the second corpus information does not include the event element corresponding to the event type.
106. And the computer equipment determines the entity element corresponding to the trigger word according to the syntactic component of the trigger word in the second corpus information, takes the trigger word as a second trigger element, and takes the entity element corresponding to the trigger word as a second entity element.
When the computer equipment determines that the second corpus information comprises any trigger word and an auxiliary word corresponding to the trigger word, the computer equipment performs syntactic analysis on the second corpus information, determines syntactic components of the trigger word in the second corpus information, and determines an entity element corresponding to the trigger word according to the syntactic components of the trigger word. The computer device takes the trigger word as a second trigger element of the target event, and takes the entity element as a second entity element of the target event, wherein the second trigger element corresponds to the second entity element.
The syntactic component is a relation item that bears a structural relationship in a syntactic structure, and the syntactic structure is an internal structure of a sentence having a word as a basic unit. Syntax elements may include subjects, predicates, objects, actors, determinants, subjects, complements, hearts, and the like. For example, the second corpus information is that "the first company invests the second company", the second corpus information includes the trigger word "invest", the computer device performs syntactic analysis on the second corpus information, determines that the syntactic component of the trigger word in the second corpus information is a predicate, and may further determine that the syntactic component of "the first company" in the second corpus information is a subject and the syntactic component of "the second company" in the second corpus information is an object.
In a possible implementation manner, because the trigger word belongs to a verb, the computer device performs syntactic analysis on the second corpus information according to syntactic components of the trigger word, determines that the syntactic components of the trigger word in the second corpus information are predicates, obtains words in the second corpus information, which form a subject-predicate relationship with the trigger word, and takes the words with the syntactic components as subject as event-applying elements of the target event; and performing syntactic analysis on the second corpus information, and acquiring a word forming a guest-moving relationship with the trigger word, wherein the word is an object of the trigger word, and the word with the syntactic component being the object is used as a subject element of the target event. And the computer equipment takes the action element and the affair element as second entity elements corresponding to the trigger words.
Optionally, the second corpus information includes only a fact element, and the computer device uses the fact element as a second entity element corresponding to the trigger word; or only including the subject element in the second corpus information, and using the subject element as a second entity element corresponding to the trigger word by the computer equipment.
In one possible implementation, the computer device parses the second corpus information based on the dependency syntax rules. Or, the computer device may also perform syntactic analysis on the second corpus information by using other methods, which is not limited in the embodiment of the present application.
107. And the computer equipment determines entity elements corresponding to the trigger phrase according to the syntactic components of the trigger words and the auxiliary words in the second corpus information, takes the trigger phrase obtained by combining the trigger words and the auxiliary words as a second trigger element, and takes the entity elements corresponding to the trigger phrase as second entity elements.
When the computer equipment determines that the second corpus information comprises any trigger word and an auxiliary word corresponding to the trigger word, the computer performs syntactic analysis on the second corpus information, determines syntactic components of the trigger word and the auxiliary word in the second corpus information, and determines an entity element corresponding to the trigger phrase formed by the trigger word and the auxiliary word according to the syntactic components of the trigger word and the auxiliary word. The computer device takes the trigger phrase as a second trigger element of the target event and takes the entity element as a second entity element of the target event, wherein the second trigger element corresponds to the second entity element.
The specific process in step 107 is similar to that in step 106, and is not described herein again.
It should be noted that, in the same corpus information, entity elements corresponding to the trigger word and the trigger phrase may be different, the trigger word and the corresponding entity element may be used as event elements of the target event, the trigger phrase and the corresponding entity element may also be used as event elements of the target event, and the trigger element of the target event includes at least one of the trigger word or the trigger phrase. The trigger word is thus treated as a second trigger element in step 105 and the trigger phrase is treated as a second trigger element in step 106.
It should be noted that, in the embodiment of the present application, only the step 106 is executed first and then the step 107 is executed as an example, and in other embodiments, the step 107 is executed first and then the step 106 is executed, or only the step 106 is executed and not the step 107 is executed, or only the step 107 is executed and not the step 106 is executed.
For example, the second corpus information is "the XX company performs a significant increase in performance in this year when the price of the raw material is reduced," the trigger word is "increase," and the trigger phrase is "performance increase. By performing step 106, the entity element "performance" corresponding to the trigger word "elevation" can be obtained, and the computer device takes "elevation" as the second trigger element and "performance" as the corresponding second entity element. By performing step 107, the entity element "XX company" corresponding to the trigger phrase "performance advancement" is obtained, and the computer device takes "performance advancement" as the second trigger element and "XX company" as the corresponding second entity element.
108. The computer device determines the second trigger element and the second entity element as a second event element of the target event.
The second trigger element and the second entity element obtained by the computer device are the trigger element and the entity element of the target event, and the computer device may determine the second trigger element and the second entity element as the second event element of the target event.
The second event element includes a second trigger element and a second entity element, the second entity element including at least one of an incident element or a victim element of the target event.
When the second entity element only comprises the event element, the second trigger element and the event element form a binary tag, and the binary tag is the event element of the extracted target event. For example, if the binary tag is < XX corporation, performance growth >, then "XX corporation" is the performance element and "performance growth" is the trigger element.
When the second entity element only comprises the victim element, the second trigger element and the victim element form a binary label, and the binary label is the event element of the extracted target event. For example, if the binary label is < invest, XX corporation >, then "invest" is the trigger element and "XX corporation" is the victim element.
When the second entity element comprises an event-applying element and an event-receiving element, the event-applying element, the second trigger element and the event-receiving element form a triple tag, and the triple tag is the event element of the extracted target event. For example, the triple tag is < company a, invest, company B >, then "company a" is the execute element, "invest" is the trigger element, and "company B" is the victim element.
In one possible implementation, the second event element is required to satisfy at least one of the following conditions:
and secondly, the part of speech of the second trigger element in the second corpus information is a verb.
First, the syntactic component of the second trigger element in the second corpus information is a predicate.
Thirdly, the length of the second event element is not less than a preset value. For example, the preset value is 2 words, the length of the second event element is not less than 2 words, and when the length of the second event element is less than 2 words, the second event element is filtered out and is not taken as the result of the extraction of the target event element.
Fourthly, the part of speech of the second entity element in the second corpus information is a noun. Since the second entity element represents a participant of the target event, the part of speech of the second entity element in the second corpus information should be a noun. If the part of speech of the second entity element in the second corpus information is not a noun, filtering the second entity element, and not taking the second entity element as the result of the extraction of the target event element.
Fifth, the second entity element does not belong to a preset type word that represents a specific type but cannot represent a specific object belonging to the specific type. When the second entity element belongs to the preset type word, the second entity element can only represent the characteristic type but not the characteristic object, the information amount of the second entity element is small, and the value of the event element serving as the target event is not high, so that when the second entity element belongs to the preset type word, the second entity element is filtered out and is not used as the extraction result of the target event element.
For example, for the extraction of financial event elements, the type word is set as stock, company, enterprise, group, organization, department, etc., and when the second entity element is "company", the second entity element does not represent a specific company, and thus the second entity element is filtered out.
According to the method provided by the embodiment of the application, the computer equipment determines a trigger word mode set corresponding to the target event according to the event type name of the target event, matches the second corpus information according to the trigger word mode set, determines a second trigger element in the second corpus information, obtains a second entity element corresponding to the second trigger element through syntactic analysis, and extracts the event element of the target event. Therefore, the event elements can be automatically extracted only by triggering the triggering words and the triggering phrases in the word pattern set, and the method is strong in operability and simple in process.
And moreover, the second event elements which do not meet the preset conditions are filtered, so that the quality of the extracted event elements is improved.
Fig. 2 is a flowchart of another event element extraction method provided in an embodiment of the present application. The execution subject of the embodiment of the present application is a computer device, and referring to fig. 2, the method includes:
201. the computer device determines a set of trigger words for the target event.
The trigger word set is used for determining whether the corpus information includes an event element of the target event, the trigger word set includes at least one trigger word, and the at least one trigger word is a verb. Before extracting the event elements, the computer needs to judge whether the corpus information includes the event elements of the target event or not according to the trigger word set corresponding to the target event, and when the corpus information includes the event elements, the computer executes the next extraction operation. Thus, the computer device first determines a set of trigger words for the target event.
In one possible implementation, the computer device performs word segmentation on the event type name of the target event to obtain a plurality of target words, and adds a trigger word belonging to a verb in the plurality of target words and a corresponding synonym to the trigger word set.
The method comprises the steps that the computer equipment divides words of an event type name to obtain a plurality of target words included in the event type name, performs part-of-speech analysis on the target words, determines the part-of-speech of each word in the event type name, and obtains trigger words belonging to verbs in the words. And the computer equipment acquires the synonym corresponding to the trigger Word according to the Word2Vec calculation Word vector similarity or the synonym forest and other methods, and adds the trigger Word and the corresponding synonym to the trigger Word set.
In addition, the synonym of the trigger word may also be used to describe the event type, so when the corpus information includes the trigger word and the synonym of the trigger word, the event described by the corpus information may be considered to belong to the event type, that is, the corpus information includes an event element corresponding to the event type of the target event, so that the computer device may add the trigger word and the corresponding synonym to the trigger word set.
In another possible implementation manner, the computer device performs word segmentation on the event type name of the target event to obtain a trigger word belonging to the verb, and adds the trigger word and the corresponding synonym to the trigger word set.
The computer equipment carries out Word segmentation on the event type name, directly obtains a trigger Word of a verb in the event type name, obtains a synonym corresponding to the trigger Word according to methods of Word2Vec calculation Word vector similarity or synonym forest and the like, and adds the trigger Word and the corresponding synonym to a trigger Word set.
202. And the computer equipment matches the triggering word set of the target event with the second corpus information.
The second corpus information is similar to the second corpus information in step 104, and is not described herein again.
And when the computer equipment obtains the trigger word set of the target event, matching the trigger word set with the second corpus information. That is, the computer device compares at least one triggering word in the triggering word set with the second corpus information, and determines whether the second corpus information includes a certain triggering word.
203. The computer device extracts the trigger words included in the trigger word set from the second corpus information as second trigger elements.
When it is determined that the second corpus information includes any trigger term in the trigger term set, it is determined that the event described in the second corpus information belongs to the event type, and the trigger term included in the second corpus information can be used for describing the event type, the computer device extracts the trigger term as a second trigger element of the event type, that is, a second trigger element of the target event.
204. And the computer equipment determines a second entity element corresponding to the second trigger element according to the syntactic component of the second trigger element in the second corpus information.
Step 204 is similar to step 106, and will not be described herein again.
205. The computer device determines the second trigger element and the second entity element as a second event element of the target event.
The second event element is similar to the second event element in step 108, and is not described in detail here.
According to the method provided by the embodiment of the application, the computer equipment determines the trigger word set corresponding to the target event according to the event type name of the target event, matches the second corpus information according to the trigger word set, determines the second trigger element in the second corpus information, obtains the second entity element corresponding to the second trigger element through syntactic analysis, and extracts the event element of the target event. Therefore, the event elements can be automatically extracted only by triggering the triggering words in the word set, and the method is strong in operability and simple in process.
In the embodiments provided in fig. 1 or fig. 2, when the trigger element of the target event is extracted, a syntax analysis method is adopted to extract the corresponding entity element, but due to the complexity of the corpus information, the syntax of some corpus information is more standard, the entity element can be extracted by the syntax analysis method, and the syntax of some corpus information is not standard enough, and the entity element cannot be extracted by the syntax analysis method. For example, the corpus includes first corpus information and second corpus information, the first corpus information is corpus information with a syntax that is not sufficiently normalized, and the second corpus information is corpus information with a syntax that is more normalized, according to the embodiment provided in fig. 1 or fig. 2, only the second entity element corresponding to the second trigger element of the second corpus information may be extracted, and the first entity element corresponding to the first trigger element of the first corpus information may not be extracted. For the processing of the first corpus information, see the embodiment provided in fig. 3 below.
Fig. 3 is a flowchart of another event element extraction method provided in an embodiment of the present application. The execution subject of the embodiment of the application is computer equipment, and referring to fig. 3, the method includes:
301. the computer device trains the entity element extraction model according to the second event element.
The computer equipment acquires a second event element of the target event, takes the second event element as a training sample, and trains an entity element extraction model corresponding to the target event, wherein the entity element extraction model is used for extracting the entity element of the target event from any corpus information. The process of extracting the second event element may refer to the corresponding embodiment of fig. 1 or fig. 2.
In one possible implementation, the process of training the entity element extraction model by the computer device, as shown in fig. 4, includes the following steps:
3011. and the computer equipment marks the multiple words in the second corpus information according to a second entity element in the second event element to obtain sample identification information corresponding to the multiple words.
Wherein the plurality of words includes at least a word in the second entity element.
In a first possible implementation manner, the entity element extraction model trained by the computer device is used for extracting the action elements and the affair elements corresponding to the entity elements. The plurality of words includes words in the event element, the trigger element, and the victim element in the second event element.
And then, acquiring a word and word mixed vector corresponding to each word, and inputting the word and word mixed vector into the entity element extraction model. Because the dimension of the input vector of the entity element extraction model is fixed, and the dimension of the word mixing vector corresponding to each word is also fixed, it is required to ensure that the number of the acquired words of the plurality of words is a first preset number, and the first preset number is a quotient of the dimension of the input vector and the dimension of the word mixing vector. The first preset number can be determined according to the sum of the word numbers corresponding to the action element, the trigger element and the action element in the general case. When a plurality of words in the second corpus information are labeled, the number of the labeled words is equal to the first preset number.
In a second possible implementation manner, the entity element extraction model trained by the computer device is only used for extracting the corresponding action elements of the entity elements. The plurality of words includes the execute element in the second event element and the word in the second trigger element.
And then, acquiring a word and word mixed vector corresponding to each word, and inputting the word and word mixed vector into the entity element extraction model. Because the dimension of the input vector of the entity element extraction model is fixed, and the dimension of the word mixing vector corresponding to each word is also fixed, it is necessary to ensure that the number of the acquired words of the plurality of words is a second preset number, and the second preset number is a quotient of the dimension of the input vector and the dimension of the word mixing vector. The second predetermined number may be determined according to the sum of the numbers of words corresponding to the action element and the trigger element in a general case. When a plurality of words in the second corpus information are labeled, it is required to ensure that the number of the labeled words is equal to the second preset number.
In a third possible implementation manner, the entity element extraction model trained by the computer device is only used for extracting the victim element corresponding to the entity element. The plurality of words includes a second trigger element in the second event element and a word in the victim element.
And then, acquiring a word and word mixed vector corresponding to each word, and inputting the word and word mixed vector into the entity element extraction model. Because the dimension of the input vector of the entity element extraction model is fixed, and the dimension of the word mixing vector corresponding to each word is also fixed, the number of the acquired words of the plurality of words needs to be ensured to be a third preset number, and the third preset number is the quotient of the dimension of the input vector and the dimension of the word mixing vector. The third preset number may be determined according to a sum of word numbers corresponding to the triggering element and the event element in a general case. When the plurality of words in the second corpus information are labeled, the number of the labeled words is equal to the third preset number.
In the first preset quantity, the second preset quantity and the third preset quantity, the first preset quantity is greater than the second preset quantity and the third preset quantity, and the second preset quantity and the third preset quantity may be equal or unequal.
Wherein the sample identification information is used for indicating whether the corresponding word belongs to the prefix or the suffix of the second entity element. Optionally, the sample identification information includes a first identification or a second identification, the first identification represents a prefix or an end of a prefix belonging to the second entity element, and the second identification represents a prefix not belonging to the second entity element and an end of a prefix not belonging to the second entity element. The computer device judges whether the plurality of characters in the second corpus information belong to the prefix or the suffix of the second entity element according to the second entity element, marks the characters belonging to the prefix or the suffix with a first mark, and marks the characters not belonging to the prefix or the suffix with a second mark.
For example, the first identifier is "1", the second identifier is "0", and for the second corpus information "the first company invests the second company", the second entity element includes an action element and a task element, wherein "the first company" is the action element, and "the second company" is the task element. And respectively labeling the characters in the second entity elements in the second corpus information, wherein the first time is labeled with the prefix, the second time is labeled with the suffix, and the results are shown in tables 2 and 3.
TABLE 2
First, the | A | | Driver | |
1 | 0 | 0 | 0 | |
0 | 0 | 0 | 1 |
TABLE 3
First, the | II | | Driver | |
1 | 0 | 0 | 0 | |
0 | 0 | 0 | 1 |
3012. And the computer equipment splices the word vectors corresponding to the plurality of characters and the corresponding word vectors to obtain word and word mixed vectors corresponding to the plurality of characters.
The computer equipment respectively obtains a word vector corresponding to each word in the plurality of words, determines the word to which each word belongs in the second corpus information, obtains a word vector of the word corresponding to each word, and splices the word vectors of the plurality of words and the corresponding word vectors to obtain a word and word mixed vector corresponding to the plurality of words.
The process of obtaining the word vector, and the word-word mixed vector is detailed in the following step 303, which is not described herein.
3013. And the computer equipment processes the word and word mixed vectors corresponding to the plurality of words based on the current entity element extraction model to obtain training identification information corresponding to the plurality of words.
The computer device obtains a current entity element extraction model, which may be an untrained initial model or a model that has been trained one or more times. And processing the word and word mixed vectors corresponding to the plurality of words based on the entity element extraction model to obtain training identification information corresponding to the plurality of words.
The training identification information is used for indicating whether the corresponding word predicted by the current entity element extraction model belongs to the prefix or the suffix of the entity element. Optionally, the training identification information includes a first identification or a second identification, the first identification represents a prefix or an end of a word belonging to the second entity element, and the second identification represents a prefix or an end of a word not belonging to the second entity element.
3014. And training the entity element extraction model by the computer equipment according to the training identification information and the sample identification information corresponding to the plurality of words to obtain the trained entity element extraction model.
When the computer equipment acquires training identification information and sample identification information corresponding to a plurality of words, the training identification information and the sample identification information are compared to obtain an error between the training identification information and the sample identification information, parameters of the entity element extraction model are adjusted according to the error to obtain an adjusted entity element extraction model, and then training can be continued by adopting event elements of other corpus information in a corpus on the basis of the adjusted entity element extraction model until the trained entity element extraction model reaches a convergence condition, and training is determined to be completed to obtain the trained entity element extraction model.
302. The computer device extracts a first trigger element of the target event from the first corpus information.
When the computer device acquires the first corpus information, according to the embodiment corresponding to fig. 1 or fig. 2, the first trigger element of the target event is extracted.
303. And the computer equipment splices the word vectors corresponding to the multiple characters in the first corpus information with the corresponding word vectors to obtain word and phrase mixed vectors corresponding to the multiple characters.
In a possible implementation manner, word segmentation is performed on the first corpus information to obtain a word vector of each word in the first corpus information, and the word vector of each word is spliced with the word vector of the corresponding word to obtain a word-word mixed vector corresponding to each word.
That is, the computer device divides the first corpus information into words by taking the word as a unit to obtain a plurality of words included in the first corpus information. The computer device processes the plurality of words to obtain a word vector corresponding to each word. And the computer equipment performs word segmentation on the first corpus information by taking the words as a unit to obtain a plurality of words included in the first corpus information. And processing the plurality of words by the computer equipment to obtain a word vector corresponding to each word. When the computer obtains a plurality of word vectors and a plurality of word vectors, determining the word to which each word belongs in the first corpus information, respectively, obtaining the word vector corresponding to the word, and splicing the word vector corresponding to the word with the corresponding word vector to respectively obtain the word and word mixed vector corresponding to the plurality of words.
Optionally, the dimension of the word vector corresponding to each word is the same, and after a plurality of word vectors are obtained, matrix transformation is performed on the word vectors, so that the dimension of the word vector is the same as the dimension of the corresponding word vector. Therefore, the word vectors and the corresponding word vectors are spliced to obtain the word mixed vectors corresponding to the words, and the dimension of each word mixed vector is the same.
The computer device may perform word segmentation on the first corpus information by using a preset word segmentation algorithm, including but not limited to a maximum matching word segmentation method, a semantic word segmentation method, and a statistical word segmentation method. The computer equipment can adopt Word2Vec to obtain a Word vector corresponding to each Word or a Word vector corresponding to each Word.
As shown in fig. 5, for the first corpus information "investor funding", the first corpus information is divided into words as a unit and words as a unit, to obtain vectors of each word and each word, each word is repeatedly encoded to obtain word vectors with the same number as the word vectors, the number of times of repetition is the number of the words forming the word, referring to fig. 5, the word vector 501 of "investor" is respectively spliced with the word vector 511 of "investor", the word vector 521 of "funding", the word vector 531 of "investor" to ", the word vector 502 of" incoming "is spliced with the word vector 512 of" incoming ", the word vector 503 of" funding "is respectively spliced with the word vector 513 of" funding "and the word vector 523 of" funding ", to obtain a mixed word vector corresponding to each word in the first corpus information.
304. And the computer equipment processes the word and word mixed vectors corresponding to the plurality of words based on the entity element extraction model, and determines the prefix and the suffix of the first entity element.
And when the computer equipment obtains word and word mixed vectors corresponding to the words in the first corpus information, inputting the word and word mixed vectors into the entity element extraction model, and determining the prefix and the suffix of the first entity element according to the output result.
In a possible implementation manner, the computer device processes the word and word mixed vector corresponding to the multiple words based on the entity element extraction model to obtain identification information corresponding to the multiple words, and determines the prefix and the suffix of the first entity element according to the identification information. The identification information is used for indicating whether the corresponding word belongs to the prefix or the suffix of the first entity element.
Optionally, the identification information includes a first identification or a second identification, the first identification represents a prefix or an end of a word belonging to the first entity element, and the second identification represents a prefix or an end of a word not belonging to the first entity element. The computer device determines at least two words corresponding to the first identifier in the first corpus information, and sequentially determines the prefix and the suffix of the first entity element according to the arrangement sequence of the at least two words corresponding to the first identifier in the first corpus information.
According to the arrangement sequence of at least two words corresponding to the first identification in the first corpus information, the computer device determines the first word corresponding to the first identification as the prefix of the first entity element, determines the next word corresponding to the first identification as the suffix of the first entity element, determines the next word corresponding to the first identification as the prefix of the first entity element, and so on, determines a plurality of prefixes and suffixes in an interval mode of one prefix and one suffix, and determines the last word corresponding to the first identification as the suffix of the first entity element. And, since the actor element is located before the victim element, the second pair of the prefix and the suffix can be determined as the prefix and the suffix of the actor element, and the first pair of the prefix and the suffix can be determined as the prefix and the suffix of the victim element.
In one possible implementation, the entity element extraction model may be a deep learning model composed of LSTM (Long Short-Term Memory), self-Attention, and CNN (Convolutional Neural Networks). As shown in fig. 6, the computer device inputs the word and word mixed vectors corresponding to a plurality of words into the LSTM long-short term memory network 601 for encoding, extracts the relevant features of the word in the first corpus information through the LSTM long-short term memory network 601, outputs a feature coding sequence, inputs the feature coding sequence into the Self Attention mechanism 602 for Self Attention to mine the beginning and end of the word, and then inputs the beginning and end of the word into the CNN convolutional neural network 603 to obtain the extraction result. Inputting the extraction result into a full connection layer 604, outputting the most relevant features forming the executing relationship or the accepting relationship with the trigger element, connecting a Softmax (logistic regression) layer 605 for classification, and outputting identification information corresponding to the plurality of words, wherein the identification information comprises a first identification or a second identification.
For example, the first flag is "1", the second flag is "0", the first corpora information "first company invested", the word mixture vectors corresponding to the plurality of words in the first corpus information are respectively input into the entity element extraction model, referring to fig. 6, the prefix of the fact element of the trigger element "invested" is "the first", and the suffix of the fact element of the trigger element "invested" is "the department".
305. And the computer equipment combines the prefix and the suffix of the first entity element and the characters between the prefix and the suffix of the first entity element according to the arrangement sequence in the first corpus information to obtain the first entity element.
And the computer equipment determines the prefix and the suffix of the first entity element and the characters between the prefix and the suffix of the first entity element, and combines the prefix and the suffix and the characters between the prefix and the suffix according to the arrangement sequence in the first corpus information to obtain the first entity element.
And, when the computer device determines a plurality of prefixes and a plurality of suffixes, the plurality of prefixes and suffixes are grouped in order of arrangement, a group of prefixes and suffixes being prefixes and suffixes of the same physical element, e.g., a first prefix and a first suffix are grouped together, a second prefix and a second suffix are grouped together, and so on. Determining characters between the prefix and the suffix of each group, and respectively combining the prefix and the suffix of each group and the characters between the corresponding prefix and the suffix to obtain a plurality of entity elements which are all used as first entity elements.
Steps 304-305 implement processing the first corpus information based on the entity element extraction model to obtain a first entity element in the first corpus information. Wherein the first entity element comprises at least one of an incident element or a victim element of the target event.
In a first possible implementation manner, the entity element extraction model is configured to extract the action element and the action element corresponding to the trigger element, and since the dimension of the input vector of the entity element extraction model is fixed, and the dimension of the word mixing vector corresponding to each word is also fixed, it is required to ensure that the number of the obtained words is a first preset number, where the first preset number is a quotient of the dimension of the input vector and the dimension of the word mixing vector. Therefore, a first preset number of words can be selected from the first corpus information, and it is further ensured that the first preset number of words includes the first trigger element.
In a second possible implementation manner, the entity element extraction model is only used for extracting the application element corresponding to the trigger element, and steps 304 to 305 include: the computer equipment obtains at least one character in a second preset distance before the first trigger element, processes the at least one character and a character and word mixed vector corresponding to each character in the first trigger element based on the entity element extraction model, and determines the character head and the character tail of the event-applying element. And the computer equipment combines the prefix and the suffix of the event-applying element and the characters between the prefix and the suffix of the event-applying element according to the arrangement sequence in the first corpus information to obtain the event-applying element.
The action element is a subject of the first trigger element, and is located before the first trigger element in the first corpus information, so that the action element is extracted from at least one character within a second preset distance before the first trigger element. The second preset distance may be represented by the number of words or words, for example, the second preset distance is 8 words or 15 words.
Because the dimension of the input vector of the entity element extraction model is fixed, and the dimension of the word mixing vector corresponding to each word is also fixed, it is necessary to ensure that the number of the acquired words of the plurality of words is a second preset number, and the second preset number is a quotient of the dimension of the input vector and the dimension of the word mixing vector. Therefore, a second preset number of words can be selected from the first corpus information, and the second preset number of words is ensured to include the first trigger element.
In a third possible implementation manner, the entity element extraction model is only used for extracting the victim element corresponding to the trigger element, and steps 304 to 305 include: and the computer equipment acquires at least one character within a third preset distance behind the first trigger element, processes the at least one character and the character and word mixed vector corresponding to each character in the first trigger element based on the entity element extraction model, and determines the prefix and the suffix of the subject element. And the computer equipment combines the prefix and the suffix of the subject element and the characters between the prefix and the suffix of the subject element according to the arrangement sequence in the first corpus information to obtain the subject element.
The subject element is an object of the first trigger element, and is located behind the first trigger element in the first corpus information, so that the subject element is extracted from at least one character within a third preset distance behind the first trigger element. The third preset distance may be represented by the number of words or words, for example, the third preset distance is 8 words or 15 words.
Because the dimension of the input vector of the entity element extraction model is fixed, and the dimension of the word mixing vector corresponding to each word is also fixed, the number of the acquired words of the plurality of words needs to be ensured to be a third preset number, and the third preset number is the quotient of the dimension of the input vector and the dimension of the word mixing vector. Therefore, a third preset number of words can be selected from the first corpus information, and the third preset number of words is ensured to include the first trigger element.
306. The computer device determines the first trigger element and the first entity element as a first event element of the target event.
The first trigger element and the first entity element obtained by the computer device are the trigger element and the entity element of the target event, and then the computer device may determine the first trigger element and the first entity element as the first event element of the target event.
It should be noted that, in the embodiment of the present application, only a process of extracting a first entity element based on an entity element extraction model to obtain a first event element of a target event is described in detail. In another embodiment, the embodiment corresponding to fig. 1 or fig. 2 is combined with the embodiment corresponding to fig. 3 to extract the event element of the target event.
Fig. 7 is a flowchart of another event element extraction method provided in an embodiment of the present application, and referring to fig. 7, the method includes:
701. and the computer equipment determines a trigger word mode set or a trigger word set corresponding to the event type according to the event type name of the target event.
702. And the computer equipment matches the corpus information according to the trigger word mode set or the trigger word set to obtain a second trigger element corresponding to the second corpus information.
703. And the computer equipment performs syntactic analysis on the second corpus information to obtain a second entity element corresponding to the second trigger element, and the second trigger element and the corresponding second entity element are the second event element of the target event.
704. The computer device trains a solid element extraction model according to the second event element.
705. And the computer equipment matches the first corpus information according to the trigger word mode set or the trigger word set to obtain a first trigger element corresponding to the first corpus information.
706. And the computer equipment inputs the word and word mixed vectors corresponding to the plurality of words in the first corpus information into the entity element extraction model.
707. And the computer equipment obtains a first entity element corresponding to the first trigger element according to the output result of the entity element extraction model. The first trigger element and the corresponding first entity element are the first event element of the target event.
And the first event element and the second event element are both event elements of the target event.
It should be noted that, in step 301, a process of training the entity element extraction model is described, and in steps 302 to 306, a process of extracting the first entity element by training the obtained entity element extraction model is described. After step 302 is executed, steps 302-306 may be executed at any time, i.e., the present application does not limit the timing for executing steps 302-307.
According to the method provided by the embodiment of the application, a first trigger element of a target event is extracted from first corpus information; processing the first corpus information based on the entity element extraction model to obtain a first entity element in the first corpus information; and determining the first trigger element and the first entity element as a first event element of the target event. The entity element extraction model is obtained by training according to a second event element extracted from the second corpus information, the second event element comprises a second trigger element and a second entity element of the target event, and the entity element of the target event comprises at least one of an action element or a subject element of the target event. Therefore, the second event element of the extracted target event is used as a sample, the entity element extraction model is trained, the event types corresponding to various event elements do not need to be marked in sequence, the operation is simple and convenient, manual marking is not needed, the waste of manpower and time is avoided, and the efficiency is improved. And the second event element extracted from the second corpus information is used as a sample, so that the sample coverage rate is high, the recall ratio of the trained entity element extraction model can be improved, and the flexibility of event element extraction is improved.
And the computer equipment extracts the second event element by a syntactic analysis method and extracts the first event element by an entity element extraction model, so that the event element extraction of the target event is realized.
Fig. 8 is a schematic structural diagram of an event element extraction apparatus according to an embodiment of the present application. Referring to fig. 8, the apparatus includes:
an element extraction module 801, configured to extract a first trigger element of a target event from the first corpus information;
the information processing module 802 is configured to process the first corpus information based on the entity element extraction model to obtain a first entity element in the first corpus information;
an event element determining module 803, configured to determine the first trigger element and the first entity element as a first event element of the target event;
the entity element extraction model is obtained by training according to a second event element extracted from the second corpus information, the second event element comprises a second trigger element and a second entity element of the target event, and the entity element of the target event comprises at least one of an action element or a subject element of the target event.
According to the device provided by the embodiment of the application, a first trigger element of a target event is extracted from first corpus information; processing the first corpus information based on the entity element extraction model to obtain a first entity element in the first corpus information; and determining the first trigger element and the first entity element as a first event element of the target event. The entity element extraction model is obtained by training according to a second event element extracted from the second corpus information, the second event element comprises a second trigger element and a second entity element of the target event, and the entity element of the target event comprises at least one of an action element or a subject element of the target event. Therefore, the second event element of the extracted target event is used as a sample, the entity element extraction model is trained, the event types corresponding to various event elements do not need to be marked in sequence, the operation is simple and convenient, manual marking is not needed, the waste of manpower and time is avoided, and the efficiency is improved. And the second event element extracted from the second corpus information is used as a sample, so that the sample coverage rate is high, the recall ratio of the trained entity element extraction model can be improved, and the flexibility of event element extraction is improved.
Optionally, referring to fig. 9, the apparatus further comprises:
the element extraction module 801 is further configured to extract a second event element of the target event from the second corpus information;
and the model training module 804 is configured to train an entity element extraction model according to the second event element, where the entity element extraction model is configured to extract an entity element from any corpus information.
Optionally, referring to fig. 9, the trigger element of the target event includes a trigger word, and the element extraction module 801 includes:
the first matching unit 8011 is configured to match the trigger word set of the target event with the second corpus information;
an extracting unit 8012, configured to extract trigger terms included in the trigger term set from the second corpus information, as second trigger elements;
the first determining unit 8013 is configured to determine, according to the syntax component of the second trigger element in the second corpus information, a second entity element corresponding to the second trigger element.
Optionally, referring to fig. 9, the apparatus further comprises:
a first word segmentation module 805, configured to perform word segmentation on the event type name of the target event to obtain a trigger word belonging to a verb;
a first adding module 806 for adding the trigger term and the corresponding synonym to the set of trigger terms.
Optionally, referring to fig. 9, the trigger element of the target event includes at least one of a trigger word or a trigger phrase, the trigger phrase includes at least two words, and the element extraction module 801 includes:
a second matching unit 8014, configured to match a trigger word pattern set of the target event with the second corpus information, where the trigger word pattern set includes at least one trigger word pattern, and the trigger word pattern includes a trigger word of the target event and an auxiliary word used for combining with the trigger word;
a combining unit 8015, configured to, when the second corpus information includes any trigger word and an auxiliary word corresponding to the trigger word, and no negative word exists between the trigger word and the auxiliary word, combine the trigger word and the auxiliary word to obtain a trigger phrase;
a second determining unit 8016, configured to determine, according to a syntax component of the trigger word in the second corpus information, an entity element corresponding to the trigger word as a second entity element;
the second determining unit 8016 is further configured to determine, as the second entity element, an entity element corresponding to the trigger phrase according to the syntactic components of the trigger term and the auxiliary term in the second corpus information.
Optionally, referring to fig. 9, the combining unit 8015 is further configured to, when the second corpus information includes a trigger word and an auxiliary word corresponding to the trigger word, no negative word exists between the trigger word and the auxiliary word, and a distance between the trigger word and the auxiliary word is not greater than a first preset distance, combine the trigger word and the auxiliary word to obtain the trigger phrase.
Optionally, referring to fig. 9, the apparatus further comprises:
a second word segmentation module 807, configured to perform word segmentation on the event type name of the target event to obtain a plurality of target words;
a second adding module 808, configured to add a trigger word belonging to a verb in the multiple target words and a corresponding synonym to the trigger word set;
the combination module 809 is configured to combine each target term in the multiple target terms and the corresponding synonym into a term set, so as to obtain multiple term sets;
the third adding module 810 is configured to select any term from each term set, combine the selected terms to obtain a trigger term pattern, and add the trigger term pattern to the trigger term pattern set.
Alternatively, referring to fig. 9, the second event element satisfies at least one of the following conditions:
the part of speech of the second trigger element in the second corpus information is a verb;
the syntactic component of the second trigger element in the second corpus information is a predicate;
the length of the second event element is not less than a preset value;
the part of speech of the second entity element in the second corpus information is a noun;
the second entity element does not belong to a preset type word, and the preset type word represents a specific type but cannot represent a specific object belonging to the specific type.
Optionally, referring to fig. 9, the model training module 804 includes:
a labeling unit 8041, configured to label, according to the second entity element, multiple words in the second corpus information to obtain sample identification information corresponding to the multiple words, where the multiple words at least include a word in the second entity element, and the sample identification information is used to indicate whether the corresponding word belongs to a prefix or an end of the second entity element;
the first splicing unit 8042 is configured to splice word vectors corresponding to the multiple words and corresponding word vectors to obtain word and word mixed vectors corresponding to the multiple words;
the first processing unit 8043 is configured to, based on the current entity element extraction model, process the word and phrase mixed vector corresponding to the multiple words to obtain training identification information corresponding to the multiple words, where the training identification information is used to indicate whether the corresponding word belongs to a prefix or an end of a word of an entity element;
the training unit 8044 is configured to train the entity element extraction model according to training identification information and sample identification information corresponding to the multiple words, so as to obtain a trained entity element extraction model.
Alternatively, referring to fig. 9, the information processing module 802 includes:
the second splicing unit 8021 is configured to splice word vectors corresponding to the multiple words in the first corpus information and corresponding word vectors to obtain word and word mixed vectors corresponding to the multiple words;
the second processing unit 8022 is configured to process, based on the entity element extraction model, the word and phrase mixed vector corresponding to the multiple words, and determine a prefix and an end of a word of the first entity element;
the merging unit 8023 is configured to merge the prefix and the suffix of the first entity element and the word between the prefix and the suffix of the first entity element according to the arrangement sequence in the first corpus information to obtain the first entity element.
Optionally, referring to fig. 9, the first entity element includes an event application element of the target event, and the information processing module 802 includes:
a word obtaining unit 8024, configured to obtain at least one word within a second preset distance before the first trigger element;
the second processing unit 8022 is further configured to process, based on the entity element extraction model, the word-word mixed vector corresponding to each word in the at least one word and the first trigger element, and determine a prefix and an end of the application element;
the merging unit 8023 is further configured to merge the prefix and the suffix of the event-applying element and the word between the prefix and the suffix of the event-applying element according to the arrangement sequence in the first corpus information, so as to obtain the event-applying element.
Optionally, referring to fig. 9, the first entity element includes a subject element of the target event, and the information processing module 802 includes:
a word obtaining unit 8024, configured to obtain at least one word within a third preset distance after the first trigger element;
the second processing unit 8022 is further configured to process, based on the entity element extraction model, the word-word mixed vector corresponding to each word in the at least one word and the first trigger element, and determine a prefix and a suffix of the subject element;
the merging unit 8023 is further configured to merge the prefix and the suffix of the subject element and the characters between the prefix and the suffix of the subject element according to the arrangement sequence in the first corpus information, so as to obtain the subject element.
Optionally, referring to fig. 9, the second concatenation unit 8021 is further configured to perform word segmentation on the first corpus information to obtain a word vector of each word in the first corpus information;
the second concatenation unit 8021 is further configured to perform word segmentation on the first corpus information, and obtain a word vector of each word in the first corpus information;
the second concatenation unit 8021 is further configured to concatenate the word vector of each word with the word vectors of the corresponding words, so as to obtain a word-word mixed vector corresponding to each word.
Optionally, referring to fig. 9, the second processing unit 8022 is further configured to process, based on the entity element extraction model, the word-word mixed vector corresponding to the multiple words to obtain identification information corresponding to the multiple words, where the identification information is used to indicate whether the corresponding word belongs to a prefix or an end of the first entity element;
the second processing unit 8022 is further configured to determine a prefix and a suffix of the first entity element according to the identification information.
Optionally, referring to fig. 9, the identification information includes a first identification or a second identification, where the first identification represents a prefix or an ending of a word belonging to the first entity element, and the second identification represents a prefix or an ending of a word not belonging to the first entity element, and the second processing unit 8022 is further configured to determine at least two words corresponding to the first identification in the first corpus information;
the second processing unit 8022 is further configured to determine, in sequence, a prefix and a suffix of the first entity element according to an arrangement order of the at least two words corresponding to the first identifier in the first corpus information.
It should be noted that: the event element extracting apparatus provided in the above embodiment is only illustrated by the division of the above functional modules when extracting the event element, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the computer device is divided into different functional modules to complete all or part of the above described functions. In addition, the event element extraction device and the event element extraction method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.
Fig. 10 is a schematic structural diagram of a terminal 1000 according to an exemplary embodiment of the present application. Terminal 1000 can be operative to perform steps performed by a computer device in the event element extraction methodology described above.
In general, terminal 1000 can include: a processor 1001 and a memory 1002.
In some embodiments, the apparatus 1000 may further optionally include: a peripheral interface 1003 and at least one peripheral. The processor 1001, memory 1002 and peripheral interface 1003 may be connected by a bus or signal line. Various peripheral devices may be connected to peripheral interface 1003 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1004, touch screen display 1005, camera 1006, audio circuitry 1007, and power supply 1009.
The peripheral interface 1003 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 1001 and the memory 1002. In some embodiments, processor 1001, memory 1002, and peripheral interface 1003 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1001, the memory 1002, and the peripheral interface 1003 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.
The Radio Frequency circuit 1004 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 1004 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1004 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1004 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 1004 may communicate with other devices via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 8G), wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1004 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.
The display screen 1005 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1005 is a touch display screen, the display screen 1005 also has the ability to capture touch signals on or over the surface of the display screen 1005. The touch signal may be input to the processor 1001 as a control signal for processing. At this point, the display screen 1005 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display screen 1005 can be one, providing a front panel of terminal 1000; in other embodiments, display 1005 can be at least two, respectively disposed on different surfaces of terminal 1000 or in a folded design; in some embodiments, display 1005 can be a flexible display disposed on a curved surface or a folded surface of terminal 1000. Even more, the display screen 1005 may be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The Display screen 1005 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.
The camera assembly 1006 is used to capture images or video. Optionally, the camera assembly 1006 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of terminal 1000 and the rear camera is disposed on the back of terminal 1000. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1006 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
The audio circuit 1007 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1001 for processing or inputting the electric signals to the radio frequency circuit 1004 for realizing voice communication. For stereo sound collection or noise reduction purposes, multiple microphones can be provided, each at a different location of terminal 1000. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1001 or the radio frequency circuit 1004 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuit 1007 may also include a headphone jack.
Those skilled in the art will appreciate that the configuration shown in FIG. 10 is not intended to be limiting and that terminal 1000 can include more or fewer components than shown, or some components can be combined, or a different arrangement of components can be employed.
Fig. 11 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1100 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1101 and one or more memories 1102, where the memory 1102 stores at least one program code, and the at least one program code is loaded and executed by the processors 1101 to implement the methods provided by the above method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.
The server 1100 may be used to perform the steps performed by the computer device in the event element extraction method described above.
The embodiment of the present application further provides a computer device for extracting an event element, where the computer device includes a processor and a memory, and the memory stores at least one program code, and the at least one program code is loaded and executed by the processor, so as to implement the operations in the event element extraction method of the foregoing embodiment.
The embodiment of the present application further provides a computer-readable storage medium, where at least one program code is stored in the computer-readable storage medium, and the at least one program code is loaded and executed by a processor to implement the operations in the event element extraction method of the foregoing embodiment.
The embodiment of the present application further provides a computer program, where the computer program includes at least one program code, and the at least one program code is loaded and executed by a processor to implement the operations in the event element extraction method of the foregoing embodiment.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only an alternative embodiment of the present application and should not be construed as limiting the present application, and any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.
Claims (22)
1. An event element extraction method, characterized in that the method comprises:
matching a trigger word pattern set of a target event with second corpus information, wherein the trigger word pattern set comprises at least one trigger word pattern, the trigger word pattern comprises a trigger word of the target event and an auxiliary word used for being combined with the trigger word, a trigger element of the target event comprises a trigger phrase, and the trigger phrase comprises at least two words;
when the second corpus information comprises any trigger word and an auxiliary word corresponding to the trigger word and no negative word exists between the trigger word and the auxiliary word, combining the trigger word and the auxiliary word to obtain a trigger phrase serving as a second trigger element;
determining an entity element corresponding to the trigger word as a second entity element according to the syntactic component of the trigger word in the second corpus information;
determining an entity element corresponding to the trigger phrase as the second entity element according to the syntactic components of the trigger phrase and the auxiliary phrase in the second corpus information;
training an entity element extraction model according to a second event element, wherein the entity element extraction model is used for extracting entity elements from any corpus information, and the second event element comprises the second trigger element and the second entity element of the target event;
extracting a first trigger element of the target event from first corpus information;
processing the first corpus information based on the entity element extraction model to obtain a first entity element in the first corpus information;
determining the first trigger element and the first entity element as a first event element of the target event;
wherein the entity element of the target event comprises at least one of an incident element or a victim element of the target event.
2. The method of claim 1, wherein the trigger element of the target event comprises a trigger word, the method further comprising:
matching the trigger word set of the target event with the second corpus information;
extracting trigger words included in the trigger word set from the second corpus information to serve as second trigger elements;
and determining a second entity element corresponding to the second trigger element according to the syntactic component of the second trigger element in the second corpus information.
3. The method according to claim 1, wherein when any trigger word and an auxiliary word corresponding to the trigger word are included in the second corpus information, and no negative word exists between the trigger word and the auxiliary word, combining the trigger word and the auxiliary word to obtain a trigger phrase comprises:
when the second corpus information includes the trigger word and an auxiliary word corresponding to the trigger word, no negative word exists between the trigger word and the auxiliary word, and the distance between the trigger word and the auxiliary word is not greater than a first preset distance, combining the trigger word and the auxiliary word to obtain the trigger phrase.
4. The method according to claim 1 or 2, characterized in that the method further comprises:
segmenting the event type name of the target event to obtain a plurality of target words;
adding trigger words belonging to verbs in the target words and corresponding synonyms into a trigger word set;
combining each target word in the target words and the corresponding synonym into a word set to obtain a plurality of word sets;
any word is selected from each word set, the selected multiple words are combined to obtain a trigger word mode, and the trigger word mode is added to the trigger word mode set.
5. The method of claim 1, wherein the second event element satisfies at least one of the following conditions:
the part of speech of the second trigger element in the second corpus information is a verb;
the syntactic component of the second trigger element in the second corpus information is a predicate;
the length of the second event element is not less than a preset value;
the part of speech of the second entity element in the second corpus information is a noun;
the second entity element does not belong to a preset type word that represents a specific type but cannot represent a specific object belonging to the specific type.
6. The method of claim 1, wherein training the entity element extraction model based on the second event element comprises:
marking a plurality of words in the second corpus information according to the second entity element to obtain sample identification information corresponding to the plurality of words, wherein the plurality of words at least comprise words in the second entity element, and the sample identification information is used for indicating whether the corresponding words belong to a prefix or a suffix of the second entity element;
splicing the word vectors corresponding to the plurality of characters with the corresponding word vectors to obtain word and word mixed vectors corresponding to the plurality of characters;
processing the word and word mixed vectors corresponding to the plurality of words based on the current entity element extraction model to obtain training identification information corresponding to the plurality of words, wherein the training identification information is used for indicating whether the corresponding words belong to the prefix or the suffix of the entity element;
and training the entity element extraction model according to the training identification information and the sample identification information corresponding to the plurality of words to obtain the trained entity element extraction model.
7. The method according to claim 1, wherein said processing the first corpus information based on the entity element extraction model to obtain a first entity element in the first corpus information comprises:
splicing word vectors corresponding to a plurality of characters in the first corpus information with corresponding word vectors to obtain word and phrase mixed vectors corresponding to the plurality of characters;
processing the word and word mixed vectors corresponding to the plurality of words based on the entity element extraction model, and determining a prefix and an suffix of the first entity element;
and combining the prefix and the suffix of the first entity element and the characters between the prefix and the suffix of the first entity element according to the arrangement sequence in the first corpus information to obtain the first entity element.
8. The method according to claim 7, wherein the first entity element includes a fact element of the target event, and the processing the first corpus information based on the entity element extraction model to obtain the first entity element in the first corpus information comprises:
acquiring at least one word in a second preset distance before the first trigger element;
processing the at least one word and the word mixed vector corresponding to each word in the first trigger element based on the entity element extraction model, and determining a prefix and an suffix of the event-applying element;
and combining the prefix and the suffix of the construction element and the characters between the prefix and the suffix of the construction element according to the arrangement sequence in the first corpus information to obtain the construction element.
9. The method according to claim 7, wherein the first entity element includes a subject element of the target event, and the processing the first corpus information based on the entity element extraction model to obtain the first entity element in the first corpus information includes:
acquiring at least one word within a third preset distance behind the first trigger element;
processing the at least one word and the word mixed vector corresponding to each word in the first trigger element based on the entity element extraction model, and determining a prefix and an suffix of the subject element;
and combining the prefix and the suffix of the subject element and the characters between the prefix and the suffix of the subject element according to the arrangement sequence in the first corpus information to obtain the subject element.
10. The method according to claim 7, wherein the splicing word vectors corresponding to a plurality of words in the first corpus information with corresponding word vectors to obtain word-word mixed vectors corresponding to the plurality of words comprises:
dividing the first corpus information into words to obtain a word vector of each word in the first corpus information;
performing word segmentation on the first corpus information to obtain a word vector of each word in the first corpus information;
and splicing the word vector of each word with the word vector of the corresponding word to obtain a word and word mixed vector corresponding to each word.
11. An event element extraction apparatus, characterized in that the apparatus comprises:
an element extraction module, configured to match a trigger word pattern set of a target event with second corpus information, where the trigger word pattern set includes at least one trigger word pattern, the trigger word pattern includes a trigger word of the target event and an auxiliary word used for combining with the trigger word, a trigger element of the target event includes a trigger phrase, and the trigger phrase includes at least two words;
the element extraction module is further configured to, when the second corpus information includes any trigger word and an auxiliary word corresponding to the trigger word, and no negative word exists between the trigger word and the auxiliary word, combine the trigger word and the auxiliary word to obtain a trigger phrase, which is used as a second trigger element;
the element extraction module is further configured to determine, according to a syntactic component of the trigger word in the second corpus information, an entity element corresponding to the trigger word as a second entity element;
the element extraction module is further configured to determine, according to syntactic components of the trigger word and the auxiliary word in the second corpus information, an entity element corresponding to the trigger phrase as the second entity element;
a model training module, configured to train an entity element extraction model according to a second event element, where the entity element extraction model is used to extract an entity element from any corpus information, and the second event element includes the second trigger element and the second entity element of the target event;
the element extraction module is further used for extracting a first trigger element of the target event from the first corpus information;
the information processing module is used for processing the first corpus information based on the entity element extraction model to obtain a first entity element in the first corpus information;
an event element determination module, configured to determine the first trigger element and the first entity element as a first event element of the target event;
wherein the entity element of the target event comprises at least one of an incident element or a victim element of the target event.
12. The apparatus of claim 11, wherein the trigger element of the target event comprises a trigger word, and wherein the element extraction module is further configured to:
matching the triggering word set of the target event with the second corpus information;
extracting trigger words included in the trigger word set from the second corpus information to serve as second trigger elements;
and determining a second entity element corresponding to the second trigger element according to the syntactic component of the second trigger element in the second corpus information.
13. The apparatus of claim 11, wherein the element extraction module is further configured to:
and when the second corpus information comprises the trigger word and an auxiliary word corresponding to the trigger word, no negative word exists between the trigger word and the auxiliary word, and the distance between the trigger word and the auxiliary word is not greater than a first preset distance, combining the trigger word and the auxiliary word to obtain a trigger phrase.
14. The apparatus of claim 11 or 12, further comprising:
the second word segmentation module is used for segmenting the event type name of the target event to obtain a plurality of target words;
the second adding module is used for adding the trigger words belonging to the verbs in the target words and the corresponding synonyms into the trigger word set;
the combination module is used for respectively combining each target word in the target words and the corresponding synonym into a word set to obtain a plurality of word sets;
and the third adding module is used for selecting any term from each term set, combining the selected multiple terms to obtain a trigger term mode, and adding the trigger term mode into the trigger term mode set.
15. The apparatus of claim 11, wherein the second event element satisfies at least one of the following conditions:
the part of speech of the second trigger element in the second corpus information is a verb;
the syntactic component of the second trigger element in the second corpus information is a predicate;
the length of the second event element is not less than a preset value;
the part of speech of the second entity element in the second corpus information is a noun;
the second entity element does not belong to a preset type word, the preset type word representing a specific type but not representing a specific object belonging to the specific type.
16. The apparatus of claim 11, wherein the model training module is configured to:
marking a plurality of words in the second corpus information according to the second entity element to obtain sample identification information corresponding to the plurality of words, wherein the plurality of words at least comprise words in the second entity element, and the sample identification information is used for indicating whether the corresponding words belong to a prefix or a suffix of the second entity element;
splicing the word vectors corresponding to the plurality of characters with the corresponding word vectors to obtain word and word mixed vectors corresponding to the plurality of characters;
processing the word and word mixed vectors corresponding to the multiple words based on the current entity element extraction model to obtain training identification information corresponding to the multiple words, wherein the training identification information is used for indicating whether the corresponding words belong to the prefix or the suffix of the entity element;
and training the entity element extraction model according to the training identification information and the sample identification information corresponding to the plurality of words to obtain the trained entity element extraction model.
17. The apparatus of claim 11, wherein the information processing module is configured to:
splicing word vectors corresponding to a plurality of characters in the first corpus information with corresponding word vectors to obtain word and phrase mixed vectors corresponding to the plurality of characters;
processing the word and word mixed vectors corresponding to the plurality of words based on the entity element extraction model, and determining a prefix and an suffix of the first entity element;
and combining the prefix and the suffix of the first entity element and the characters between the prefix and the suffix of the first entity element according to the arrangement sequence in the first corpus information to obtain the first entity element.
18. The apparatus of claim 17, wherein the first entity element comprises an event application element of the target event, and wherein the information processing module is configured to:
acquiring at least one word in a second preset distance before the first trigger element;
processing the at least one word and the word mixed vector corresponding to each word in the first trigger element based on the entity element extraction model, and determining a prefix and an suffix of the event-applying element;
and combining the prefix and the suffix of the event-applying element and the characters between the prefix and the suffix of the event-applying element according to the arrangement sequence in the first corpus information to obtain the event-applying element.
19. The apparatus of claim 17, wherein the first entity element comprises a victim element of the target event, and wherein the information processing module is configured to:
acquiring at least one word within a third preset distance behind the first trigger element;
processing the at least one word and the word mixed vector corresponding to each word in the first trigger element based on the entity element extraction model, and determining a prefix and an suffix of the subject element;
and combining the prefix and the suffix of the subject element and the characters between the prefix and the suffix of the subject element according to the arrangement sequence in the first corpus information to obtain the subject element.
20. The apparatus of claim 17, wherein the information processing module is configured to:
dividing the first corpus information into words to obtain a word vector of each word in the first corpus information;
performing word segmentation on the first corpus information to obtain a word vector of each word in the first corpus information;
and splicing the word vector of each word with the word vector of the corresponding word to obtain a word and word mixed vector corresponding to each word.
21. A computer device comprising a processor and a memory, the memory having stored therein at least one program code, the at least one program code loaded and executed by the processor to implement the event element extraction method according to any one of claims 1 to 10.
22. A computer-readable storage medium having stored therein at least one program code, the at least one program code being loaded and executed by a processor, to implement the event element extraction method according to any one of claims 1 to 10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010041054.8A CN111310461B (en) | 2020-01-15 | 2020-01-15 | Event element extraction method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010041054.8A CN111310461B (en) | 2020-01-15 | 2020-01-15 | Event element extraction method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111310461A CN111310461A (en) | 2020-06-19 |
CN111310461B true CN111310461B (en) | 2023-03-21 |
Family
ID=71160103
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010041054.8A Active CN111310461B (en) | 2020-01-15 | 2020-01-15 | Event element extraction method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111310461B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112287111B (en) * | 2020-12-18 | 2021-03-23 | 腾讯科技(深圳)有限公司 | Text processing method and related device |
CN113535942B (en) * | 2021-07-21 | 2022-08-19 | 北京海泰方圆科技股份有限公司 | Text abstract generating method, device, equipment and medium |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831236A (en) * | 2012-09-03 | 2012-12-19 | 苏州大学 | Method and system for extending Chinese event trigger word |
JP2013134625A (en) * | 2011-12-26 | 2013-07-08 | Fujitsu Ltd | Extraction device, extraction program and extraction method |
CN103530281A (en) * | 2013-10-15 | 2014-01-22 | 苏州大学 | Argument extraction method and system |
CN104331480A (en) * | 2014-11-07 | 2015-02-04 | 苏州大学 | System and method for extracting Chinese event trigger words |
CN104598535A (en) * | 2014-12-29 | 2015-05-06 | 中国科学院计算机网络信息中心 | Event extraction method based on maximum entropy |
CN105512209A (en) * | 2015-11-28 | 2016-04-20 | 大连理工大学 | Biomedicine event trigger word identification method based on characteristic automatic learning |
CN106933800A (en) * | 2016-11-29 | 2017-07-07 | 首都师范大学 | A kind of event sentence abstracting method of financial field |
CN106951530A (en) * | 2017-03-21 | 2017-07-14 | 苏州大学 | A kind of event type abstracting method and device |
CN109582949A (en) * | 2018-09-14 | 2019-04-05 | 阿里巴巴集团控股有限公司 | Event element abstracting method, calculates equipment and storage medium at device |
CN110134720A (en) * | 2019-05-17 | 2019-08-16 | 苏州大学 | It merges local feature and combines abstracting method with the event of deep learning |
CN110209807A (en) * | 2018-07-03 | 2019-09-06 | 腾讯科技(深圳)有限公司 | A kind of method of event recognition, the method for model training, equipment and storage medium |
CN110597994A (en) * | 2019-09-17 | 2019-12-20 | 北京百度网讯科技有限公司 | Event element identification method and device |
CN110633409A (en) * | 2018-06-20 | 2019-12-31 | 上海财经大学 | Rule and deep learning fused automobile news event extraction method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DK179049B1 (en) * | 2016-06-11 | 2017-09-18 | Apple Inc | Data driven natural language event detection and classification |
-
2020
- 2020-01-15 CN CN202010041054.8A patent/CN111310461B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013134625A (en) * | 2011-12-26 | 2013-07-08 | Fujitsu Ltd | Extraction device, extraction program and extraction method |
CN102831236A (en) * | 2012-09-03 | 2012-12-19 | 苏州大学 | Method and system for extending Chinese event trigger word |
CN103530281A (en) * | 2013-10-15 | 2014-01-22 | 苏州大学 | Argument extraction method and system |
CN104331480A (en) * | 2014-11-07 | 2015-02-04 | 苏州大学 | System and method for extracting Chinese event trigger words |
CN104598535A (en) * | 2014-12-29 | 2015-05-06 | 中国科学院计算机网络信息中心 | Event extraction method based on maximum entropy |
CN105512209A (en) * | 2015-11-28 | 2016-04-20 | 大连理工大学 | Biomedicine event trigger word identification method based on characteristic automatic learning |
CN106933800A (en) * | 2016-11-29 | 2017-07-07 | 首都师范大学 | A kind of event sentence abstracting method of financial field |
CN106951530A (en) * | 2017-03-21 | 2017-07-14 | 苏州大学 | A kind of event type abstracting method and device |
CN110633409A (en) * | 2018-06-20 | 2019-12-31 | 上海财经大学 | Rule and deep learning fused automobile news event extraction method |
CN110209807A (en) * | 2018-07-03 | 2019-09-06 | 腾讯科技(深圳)有限公司 | A kind of method of event recognition, the method for model training, equipment and storage medium |
CN109582949A (en) * | 2018-09-14 | 2019-04-05 | 阿里巴巴集团控股有限公司 | Event element abstracting method, calculates equipment and storage medium at device |
CN110134720A (en) * | 2019-05-17 | 2019-08-16 | 苏州大学 | It merges local feature and combines abstracting method with the event of deep learning |
CN110597994A (en) * | 2019-09-17 | 2019-12-20 | 北京百度网讯科技有限公司 | Event element identification method and device |
Non-Patent Citations (6)
Title |
---|
Chi Zhang et al.The research on event extraction of Chinese news based on subject elements.2016,1-5. * |
Liying Zhan et al.Survey on Event Extraction Technology in Information Extraction Research Area.2019,2121-2126. * |
Xuepan Gao et al.Event Extraction via Rules and Machine Learning.2019,41-46. * |
丁效.句子级中文事件抽取关键技术研究.2012,I138-1576. * |
张璐.面向中文文本的事件提取方法研究.2019,I138-1274. * |
高源.中文事件抽取关键技术研究.2016,I138-1299. * |
Also Published As
Publication number | Publication date |
---|---|
CN111310461A (en) | 2020-06-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111985240B (en) | Named entity recognition model training method, named entity recognition method and named entity recognition device | |
CN108595431B (en) | Voice interaction text error correction method, device, terminal and storage medium | |
WO2023125335A1 (en) | Question and answer pair generation method and electronic device | |
CN111414736A (en) | Story generation model training method, device, equipment and storage medium | |
CN109918669A (en) | Entity determines method, apparatus and storage medium | |
CN112035671B (en) | State detection method and device, computer equipment and storage medium | |
CN111310461B (en) | Event element extraction method, device, equipment and storage medium | |
CN112163428A (en) | Semantic tag acquisition method and device, node equipment and storage medium | |
CN114757208B (en) | Question and answer matching method and device | |
CN111324699A (en) | Semantic matching method and device, electronic equipment and storage medium | |
CN112989767A (en) | Medical term labeling method, medical term mapping device and medical term mapping equipment | |
CN111414737B (en) | Story generation model training method, device, equipment and storage medium | |
WO2024051730A1 (en) | Cross-modal retrieval method and apparatus, device, storage medium, and computer program | |
CN113515943A (en) | Natural language processing method and method, device and storage medium for acquiring model thereof | |
CN113641799B (en) | Text processing method and device, computer equipment and storage medium | |
CN114360528B (en) | Speech recognition method, device, computer equipment and storage medium | |
CN116994169A (en) | Label prediction method, label prediction device, computer equipment and storage medium | |
CN115017324A (en) | Entity relationship extraction method, device, terminal and storage medium | |
CN112711653A (en) | Man-machine interaction method and electronic equipment | |
CN116775803A (en) | Text processing method, device, computer equipment and storage medium | |
CN116069936B (en) | Method and device for generating digital media article | |
CN114495938B (en) | Audio identification method, device, computer equipment and storage medium | |
CN117807993A (en) | Word segmentation method, word segmentation device, computer equipment and storage medium | |
CN118298063A (en) | Image generation method, device, electronic equipment and storage medium | |
CN118467775A (en) | Image retrieval method, device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40024712 Country of ref document: HK |
|
GR01 | Patent grant | ||
GR01 | Patent grant |