WO2023040493A1 - 事件检测 - Google Patents

事件检测 Download PDF

Info

Publication number
WO2023040493A1
WO2023040493A1 PCT/CN2022/109834 CN2022109834W WO2023040493A1 WO 2023040493 A1 WO2023040493 A1 WO 2023040493A1 CN 2022109834 W CN2022109834 W CN 2022109834W WO 2023040493 A1 WO2023040493 A1 WO 2023040493A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
relationship
instance
text
graph
Prior art date
Application number
PCT/CN2022/109834
Other languages
English (en)
French (fr)
Inventor
黄伟鹏
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2023040493A1 publication Critical patent/WO2023040493A1/zh
Priority to US18/395,120 priority Critical patent/US20240143644A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/358Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Definitions

  • This description relates to the field of natural language processing, in particular to an event detection method and system.
  • Event detection or event extraction is an important application of artificial intelligence technology, which can efficiently obtain events that people care about from massive data. For example, timely extract target risk events from a large number of news and reports in the financial field to help investors effectively avoid investment risks.
  • existing event detection or extraction algorithms may also be updated for new events, thereby increasing the cost of technology updates or upgrades.
  • One aspect of this specification provides an event detection method, the method comprising: acquiring text to be processed; extracting one or more sets of instance data from the text to be processed based on an extraction model; wherein each set of instance data includes a first entity instance, The first entity type corresponding to the first entity instance, the second entity instance, the second entity type corresponding to the second entity instance, and a description of the relationship between the two entity types; based on one or more sets of instance data, Determine one or more extracted triples, and then obtain the extracted map; wherein, the extracted triples include the first entity type, the second entity type, and the relationship description between the two entity types in the instance data; obtain one or more The map ontology definition data of candidate events, and based on this, the ontology definition map corresponding to each candidate event is obtained; wherein, the map ontology definition data of the event includes the entity type used to define the entity and the relationship description used to define the relationship between the entity types ; Determine the similarity between the extracted graph and the ontology definition graph of one or more candidate events;
  • a passing event detection system includes: a text acquisition module, used to acquire the text to be processed; an extraction module, used to extract one or more sets of instance data from the text to be processed based on the extraction model ; Wherein, each set of instance data includes the first entity instance, the first entity type corresponding to the first entity instance, the second entity instance, the second entity type corresponding to the second entity instance, and the relationship between the two entity types The relationship description; the extraction map acquisition module is used to determine one or more extraction triples based on one or more sets of instance data, and then obtain the extraction map; wherein the extraction triples include the first entity type in the instance data , the second entity type and the relationship description between the two entity types; an ontology definition map acquisition module, which is used to obtain map ontology definition data of one or more candidate events and obtain an ontology definition map corresponding to each candidate event based on this; Wherein, the map ontology definition data of the event includes the entity type used to define the entity and the relationship description used to define the relationship
  • Another aspect of the specification provides a computer-readable storage medium, wherein the storage medium stores computer instructions, and when the computer instructions are executed by a processor, an event detection method is implemented.
  • the device includes at least one processor and at least one memory; the at least one memory is used to store computer instructions; the at least one processor is used to execute the At least some of the computer instructions implement the event detection method.
  • FIG. 1 is an exemplary flow chart of an event detection method according to some embodiments of the present specification
  • FIG. 2 is a schematic diagram of event graph ontology definition data according to some embodiments of the present specification
  • Fig. 3 is a schematic structural diagram of an extraction model according to some embodiments of the present specification.
  • Fig. 4 is a schematic diagram of determining the similarity between the extracted graphs and ontology-defined graphs of one or more candidate events according to some embodiments of the present specification.
  • system means for distinguishing different components, elements, parts, parts or assemblies of different levels.
  • the words may be replaced by other expressions if other words can achieve the same purpose.
  • the text data can be processed through an event detection model to detect and/or extract events from the text, thereby helping users quickly filter interested information.
  • the event detection model can detect the event "Company A lost the lawsuit against Company B" from the news report "...Company A sued Company B, and the first instance judged that Company A lost the lawsuit.".
  • event detection models are limited by the training corpus.
  • one solution is to continuously collect new training corpus, continue to train the event detection model, and improve its detection ability.
  • collecting new training corpus or retraining the model will consume a lot of manpower and time.
  • Some embodiments of this specification propose an event detection scheme, which detects events or event elements based on event graph ontology definition data.
  • the event graph ontology definition data mainly includes entity types and a general description based on the relationship between entity types. Therefore, to a certain extent, at least for new events in the same field, it is only necessary to define the event ontology definition data for the new event, and then the effective detection of the new event can be realized, which greatly reduces the cost of technology upgrade or replacement .
  • Fig. 1 is an exemplary flowchart of an event detection method according to some embodiments of this specification.
  • the process 100 may be executed by a processing device, or implemented by an event detection system set on the processing device.
  • the event detection system may include a text acquisition module, an extraction module, an extraction graph acquisition module, an ontology definition graph acquisition module, a similarity determination module and an event determination module.
  • the event detection method 100 may include the following steps.
  • Step 110 acquire text to be processed.
  • this step 110 may be performed by a text acquisition module.
  • Pending text is the text that needs to detect events.
  • the text to be processed may be chapter-level text.
  • the text to be recognized may be sentence-level text.
  • the text to be processed may be a news text "The equity of company A is frozen, how can the 12 billion new energy investment continue? . . . ".
  • the text acquisition module can directly acquire the text to be processed from the text information.
  • the text acquisition module can acquire the text to be processed from the text database.
  • the text acquisition module can also crawl the text to be processed from the web page text.
  • the text obtaining module can also obtain the text to be processed from the picture message based on the text recognition technology.
  • the text to be processed can also be obtained from voice information based on automatic speech recognition (Automatic Speech Recognition, ASR) technology.
  • ASR Automatic Speech Recognition
  • the text to be processed may include multiple characters or words, such as Chinese characters, Japanese characters, or Western words such as English.
  • Step 120 extract one or more sets of instance data from the text to be processed based on the extraction model.
  • this step 120 may be performed by an extraction module.
  • Instance data consists of entity types in the text to be processed, entity instances (that is, data instances corresponding to entities), and relationship descriptions between entity types.
  • the entity type is a broad abstraction of objective individuals and behaviors, which can refer to tangible objects in the physical world, such as people, law enforcement agencies, corporate entities, etc., or intangible objects, such as discourse, songs, movies, assets, Amount, etc., can also refer to behavioral actions, such as management verbs, punishment verbs, preservation verbs, etc.
  • Entity instance can be an example that actually exists under the abstract concept of entity type.
  • the company entity can be specifically company A, company B, company C, etc.
  • the assets can be specifically real estate, 20 billion yuan in equity, 100,000 yuan in cash, etc.
  • Preservation verbs It can be specifically seized, frozen, seized, etc.
  • the entity examples in the text to be processed "A company's equity is frozen, how can the 12 billion new energy investment continue" include: A company, equity, freeze..., and the corresponding entity types include: company entity, assets, preservation verb....
  • Entity instances can have relationships, and the relationship between entity instances can be defined by the relationship description between their corresponding entity types. For example, the relationship between the entity type of asset and the entity type of company subject is belonging, and the relationship between the entity type of punishment verb and the entity type of asset can be the action object. Furthermore, there may also be corresponding relationship descriptions between entity instances.
  • "equity” belongs to “Company A”, and the object of "freezing” is "equity”.
  • the relationship description can include verb-object relationship, subject-predicate relationship, definite relationship and modification relationship.
  • the relationship description between the two may be a verb-object relationship (Verb-Object, VOB).
  • VOB Verb-Object
  • the relationship between the first entity type "preservation verb" and the second entity type "asset” is described as a verb-object relationship.
  • the relationship description may be a subject-verb relationship (Subject-Verb, SBV).
  • a relationship between a first entity type "law enforcement agency” and a second entity type "punishment verb" is described as a predicate relationship.
  • the relationship description can be an attribute (ATT).
  • ATT attribute
  • the relationship between the first entity type "company entity” and the second entity type "asset” is described as a neutral relationship.
  • the relationship description may be a modifying relationship (Modify, MOD).
  • MOD modifies, MOD
  • each set of instance data may include a first entity instance, a first entity type, a second entity instance, a second entity type, and a description of a relationship between the two entity types.
  • the first entity instance may be equity, and the corresponding second entity instance may be freezing.
  • the first entity type and the second entity type are broad abstractions corresponding to the first entity instance and the second entity instance, respectively.
  • the first entity type corresponding to the first entity instance "equity” is an asset
  • the second entity type corresponding to the second entity instance "freeze” is a preservation verb.
  • the relationship between the first entity type "preservation verb" corresponding to the first entity instance "freeze” and the second entity type "asset” corresponding to the second entity instance "equity” is described as a verb-object relationship.
  • the first group of instance data can be expressed as [preservation verb: freeze, VOB, asset: equity].
  • the extraction module can extract multiple sets of instance data from the text to be processed.
  • the relationship between the first entity instance and the second entity instance is relative.
  • the first entity instance and the second entity instance may be exchanged as a new set of instance data.
  • the first entity instance in the second group of instance data, can be the second entity instance "equity" in the first group of instance data, and the second entity instance is the first entity instance in the first group of instance data " Freeze”; the corresponding first entity type and second entity type are "asset” and "preservation verb" respectively.
  • the description of the relationship between the two entity types changes accordingly.
  • the relationship between the first entity type "asset” and the second entity type "preservation verb" is described as a predicate relationship.
  • the second group of instance data can be expressed as [asset: equity, SBV, preservation verb: freeze].
  • entity instances and relationship descriptions of multiple sets of instance data may be partially identical.
  • the third group of instance data can be [asset: equity, MOD, amount involved: 12 billion], where the first entity instance is the same as the first entity instance in the second group of instance data.
  • the extraction module can use the extraction model to process the text to be processed, and obtain the annotation sequence and the relationship matrix of the text to be processed.
  • the labeling sequence is used to mark words or words belonging to entity instances in the text to be processed, and the entity type to which the words or words belong.
  • the relationship matrix is used to mark the description of the relationship between any two words or words in the text to be processed.
  • the extracted model includes one or more of the following models: BERT, Transformer, StanfordNLP, or LTP.
  • the extraction module can determine the entity instances and their entity types in the text to be processed based on the annotation sequence.
  • the extraction model processes "The equity of company A is frozen, how can the 12 billion new energy investment continue?", and obtains the sequence of annotations: "B-co”, “I-co” ... "B-pro”, “I-pro”!B-pre”, “I-pre”!O", "O", “O”, “O”.
  • the extraction module can obtain corresponding entity instances based on the entity labels "B-co”, “I-co”, “B-pro”, “I-pro”, “B-pre”, and “I-pre”: A Company, Equity, Freeze, and its Entity Types Company Principals, Assets, Preservation Verbs.
  • the extraction module can determine the relationship description between any two entity instances in the text to be processed based on the relationship matrix, and use it as the relationship description between two corresponding entity types.
  • each character or word in the text to be processed can correspond to a relational vector r, and the dimension of the relational vector can be the same as the total number of words or words in the text to be processed, and the elements in the relational vector reflect the corresponding A description of the relationship between a word or word and other words or words in the text to be processed.
  • the relationship vector may also include elements reflecting the description of the relationship between the corresponding word or word and itself. For example, the value of the element can be null by default.
  • the relational vectors of multiple words or words constitute the relational matrix.
  • the elements in the relationship matrix may include the aforementioned relationship descriptions such as VOB and MOD, and may also include null, where null means invalid or empty.
  • the extraction module can determine the relationship description between the two entity instances or the corresponding two entity types based on the first words of the two entity instances or the relationship description between the first words.
  • the extraction module can be based on the relationship vector [r 1 ] corresponding to the first word "A" of the entity instance "A company” in the relationship matrix and the element corresponding to the first word “share” of the entity instance "equity” , to determine the relationship between "Company A” and "Equity” is described as a fixed relationship. Further, the central relationship is described as the relationship between the entity type "company subject” corresponding to "A company” and the entity type “asset” corresponding to "equity”. For another example , the extraction module can determine the "frozen The relationship between " and "equity” is described as a verb-object relationship.
  • verb-object relationship is described as the relationship between the entity type "preservation verb” corresponding to "freeze” and the entity type “asset” corresponding to "equity”.
  • relationship matrix [r v1 ] corresponding to the relationship vector [r v1 ] corresponding to "being” is null, indicating that there is no clear relationship between the two.
  • Step 130 based on one or more sets of instance data, determine one or more extracted triplets, and then obtain an extracted map.
  • this step 130 may be performed by an extraction spectrum acquisition module.
  • An extraction triplet is a set of three elements extracted from instance data.
  • the extracted triples include a first entity type, a second entity type, and a relationship description between the two entity types in the instance data.
  • the extraction map acquisition module can extract the first triplet [preservation verb, VOB, asset] from the first group of instance data [preservation verb: freeze, VOB, asset: equity], and from the second instance data [asset: Extract the second set of triples [asset, SBV, preservation verb] from equity, SBV, preservation verb: freeze], and extract the third from the third group of instance data [asset: equity, MOD, amount involved: 12 billion] The triplet [asset, MOD, amount involved]....
  • the extraction map acquisition module can construct an extraction map based on one or more extraction triplets.
  • An extraction map is a network graph composed of entity instances in one or more triples and descriptions of the relationships between entity instance types.
  • entity types of one or more triples in the extracted graph may be represented by nodes, and relationship descriptions between entity types of one or more triples may be represented by edges connecting corresponding nodes.
  • the entity types "preservation verb” and “asset” in the first triple [preservation verb, VOB, asset] can be used as two nodes respectively, and the relationship between "preservation verb” and “asset” The relationship describes "vob" as an edge connecting the two nodes.
  • the same entity type in multiple triples may be represented by the same node in the extracted graph. For example, if the entity type "asset” in the third triplet [asset, MOD, amount involved] is the same as the entity type "asset” in the first triplet [preservation verb, VOB, asset], then the extracted map The entity type "asset” in the third triplet and the entity type "asset” in the first triplet can be represented by the same node.
  • the edges in the extracted graph have directionality, pointing from the first entity type to the second entity type.
  • the edge corresponding to the relationship description "VOB” in the first triple is "preservation verb” pointing to "asset”.
  • the relationship description in the second triple [Asset, SBV, Preservation Verb] is that the edge corresponding to "SBV” is “Asset” pointing to "Preservation Verb”.
  • Step 140 acquire graph ontology definition data of one or more candidate events, and obtain an ontology definition graph corresponding to each candidate event based on this.
  • this step 140 can be performed by an ontology definition map acquisition module.
  • Ontology definition graph refers to a graph composed of a series of entity instance types and the description of the relationship between entity instance types.
  • entity instance types in the ontology definition graph may be represented by nodes, and relationship descriptions between entity instance types may be represented by edges connecting nodes.
  • Multiple candidate events may respectively correspond to multiple event types.
  • the first candidate event, the second candidate event, ... the Nth candidate event may respectively correspond to the acquisition event type, the loss event type, ... the preservation event type.
  • each candidate event may correspond to an ontology definition graph.
  • the event ontology definition graph can be generated based on the event graph ontology definition data, such as schema.
  • the event map ontology definition data can be manually formulated or written according to the general elements of the event.
  • Fig. 2 is a schematic diagram of event graph ontology definition data according to some embodiments of the present specification.
  • the graph ontology definition data refers to the data that defines the entity types included in the ontology definition graph and the relationship descriptions between entity types.
  • each set of map ontology definition data corresponds to a candidate event type.
  • the ontology definition data of the acquisition map, the definition data of the loss map ontology...the preservation map ontology definition data correspond to the acquisition event type, the loss event type,...the preservation event type respectively.
  • the event graph ontology definition data includes entity types for defining entities and relationship descriptions for defining relationships between entity types.
  • the entity type in the graph ontology definition data may define entity instances belonging to the entity type by a vocabulary or extraction rules. It can be understood that the entity instances satisfying the extraction rules belong to the corresponding entity types.
  • the entity type corresponding to the enumeration class can be defined by a vocabulary.
  • the entity type "preservation verb" corresponds to a vocabulary: seizure, freezing, seizure, etc.
  • the entity type "Law Enforcement Agency” can define the extraction rule as text extraction based on the keyword "court” matching, and for the entity type "Amount Involved", the extraction can be based on the data format.
  • the graph ontology definition data may include relational descriptions, and the relational descriptions are combined with the entity types in the graph ontology definition data through the definition data, thereby defining the relational descriptions between different entity types.
  • the preservation map definition data contains 6 pieces of definition data, and the first definition data specifies the relationship description of the first entity type "preservation verb" and the second entity type "corporate entity”
  • the definition is "VOB”
  • Article 2 defines the data to specify the first entity type "company subject”, and the relationship description between the second entity type "preservation verb” is defined as "ATT”
  • Article 3 defines the data to specify the first entity type " Preservation verb", the relationship description between the second entity type "asset” is defined as "VOB”
  • ....Article 6 defines the relationship between the data specifying the first entity type "asset", and the second entity type "preservation verb”
  • the description is defined as "SBV".
  • the graph ontology definition data can be obtained based on a preset ontology definition data set. Specifically, the entity types and relationship descriptions in the event graph ontology definition data come from a preset ontology definition data set.
  • the preset ontology definition data set can be a collection of entity types and relationship descriptions formulated for a specific field, such as the financial field or the educational field. It can be considered that the ontology definition data set can include relatively comprehensive entity types and relationship descriptions in the corresponding field, so that the entity types and relationship descriptions of different events in the field can be found in the data set, or make the entity types in the data set and relationship descriptions can be used interchangeably.
  • map ontology definition data for different events entity types and relationship descriptions can be selected from the data set, and entity types can be further defined through vocabulary or extraction rules, and the relationship description between entity types can be specified by definition data.
  • Each event has its own graph ontology definition data.
  • the extraction model can also be trained based on the preset ontology definition data set, so that the entity instances and their relationship descriptions extracted in step 120 are directly mapped to the entity types and relationships in the graph ontology definition data of candidate events Description, to further improve the accuracy of subsequent graph matching.
  • the extraction model For a detailed description of training the extraction model, refer to FIG. 3 and related descriptions, and details are not repeated here.
  • the ontology definition graph acquisition module may also obtain ontology definition graphs of one or more candidate events based on the graph ontology definitions of one or more candidate events.
  • the first entity type and the second entity type can be used as nodes in the ontology definition graph corresponding to the candidate event type, and an edge will be established for the corresponding node based on the definition data .
  • the ontology definition map acquisition module can use the first entity type "preservation verb" and the second entity type "company subject” in the first definition data in the map ontology definition data of the preservation class candidate events as the preservation class
  • the ontology of the candidate event defines two nodes in the graph, and will establish an edge "VOB" between the node "Preservation Verb" and the node "Company Subject”.
  • the edges in the ontology definition graph of candidate events have directionality, pointing from the first entity type to the second entity type.
  • the relationship description definition "VOB" in the first definition data corresponds to the edge "preservation verb” pointing to "asset”.
  • Article 2 defines the relationship description in the data and defines that the edge corresponding to "SBV" is “asset” pointing to "preservation verb”.
  • the same entity definition in multiple pieces of definition data can be represented by the same node in the ontology definition graph of the candidate event.
  • step 130 refers to step 130, which will not be repeated here.
  • the ontology definition graph acquisition module can obtain ontology definition graphs of acquisition-type candidate events, ontology definition graphs of litigation-lost categories, . . . .
  • Step 150 determine the similarity between the extracted graph and ontology-defined graphs of one or more candidate events.
  • this step 150 may be performed by a similarity determination module.
  • the similarity determination module may use a graph matching model to process the extracted graph and the ontology definition graph of the candidate event to obtain the similarity between the two.
  • the graph matching model may include, but not limited to, a graph matching network (Graph Matching Network, GMN) model, a graph neural network (Graph Neural Network, GNN) model, a graph convolutional neural network (Graph Convolutional Network, GCN) And Graph Embedding Models (Graph Embedding Models, GEM), etc.
  • GMN graph Matching Network
  • GNN graph neural network
  • GCN graph convolutional neural network
  • GEM Graph Embedding Models
  • the GMN model can first obtain the (initial) representation vectors of each node and each edge of the ontology definition graph of the extraction map and candidate events, and obtain each representation vector and candidate events of the extraction map based on the Attention mechanism
  • the ontology of the graph defines the size of attention between each representation vector
  • the ontology of the aggregate extraction graph and candidate events defines each node of the graph and the representation vector of each edge and the attention between each representation vector Size
  • Fig. 4 is a schematic diagram of determining the similarity between the extracted graphs and ontology-defined graphs of one or more candidate events according to some embodiments of the present specification.
  • the similarity determination module can first obtain the nodes "company subject”, “preservation verb”, “asset” and the edge “E (company subject, VOB, preservation verb)” and edge “ E(Preservation Verb, SBV, Preservation Verb)”... and other corresponding representation vectors N1, N2, N3, E1, E2, ..., and the nodes "company subject” and “preservation verb” in the ontology definition graph 410 of preservation candidate events ", “assets”, “amount involved” and side “e (corporate subject, VOB, preservation verb)", side “e (preservation verb, SBV, preservation verb)”...
  • the GMN model (that is, the matching model 430) can aggregate and extract each node of the graph and the representation vectors N1, N2, N3, E1, E2, ... of each edge, and the ontology of the candidate event defines each node and each edge of the graph.
  • the representation vectors n1, n2, n3, n4, e1, e2... and attention vectors a1, a2, a3,... of the edges are used to obtain the cross-graph information, and then obtain the extracted map 420 based on the cross-graph information
  • the similarity 440 between it and the ontology definition map 410 of the preserved candidate event is 0.8.
  • the similarity determination module can obtain the similarity between the extraction map 420 and the ontology definition map of the acquisition candidate event, the ontology definition map of the loss candidate event, . . .
  • the extracted graph and the ontology definition graph of candidate events can also be processed based on GEM or GCN, and the vector representations of the two graphs are respectively obtained, and the similarity between the two graphs is determined by calculating the distance between the two vector representations.
  • Step 160 based on each similarity, determine an event corresponding to the text to be processed from the one or more candidate events.
  • this step 160 may be performed by an event determination module.
  • the event determining module may determine a candidate event corresponding to a maximum value among multiple similarities, and use the candidate event as an event corresponding to the text to be processed.
  • the similarities between the extraction map and the ontology definition map of preservation candidate events, the ontology definition map of acquisition candidate events, and the ontology definition map of loss candidate events are 0.8, 0.5, 0.4..., then the maximum similarity is determined
  • the preservation event corresponding to the value 0.8 is used as the event corresponding to the text to be processed.
  • the event determination module can be further used to determine event elements in the text to be processed.
  • the event detection system may determine one or more instance triplets based on the one or more sets of instance data.
  • the instance triplet includes the first entity instance, the second entity instance and the relationship descriptions between the two corresponding entity types in the instance data.
  • the first group of instance triples can include the text to be extracted "The equity of company A is frozen, how can the 12 billion new energy investment continue?" corresponding to the first group of instance data [preservation verb: freeze, VOB, assets: The relationship description "VOB" between the first entity instance "freeze”, the second entity instance "equity” and their corresponding two entity types in Equity], that is, the triplet of the first group of instances is [frozen, VOB , Assets]; similarly, the instance triplet in the second group is [equity, SBV, frozen], and the triplet in the third group is [equity, MOD, 12 billion], ... [A company, ATT, equity].
  • the event detection system may determine the event element of the event corresponding to the text to be processed based on the one or more sets of instance triplets.
  • Event elements include the elements that make up the event and the relationships between the elements.
  • the instance entities of each group of instance triples can be used as elements in the event, and the elements in the event can be expressed structurally based on the relationship between elements.
  • Fig. 3 is a schematic structural diagram of an extraction model according to some embodiments of the present specification.
  • the extraction module can use the extraction model to process the text to be processed, and obtain the annotation sequence and the relationship matrix of the text to be processed.
  • the extraction model 300 may include a feature extraction layer 310 , an annotation sequence layer 320 and a relationship identification layer 330 .
  • the feature extraction layer 310 can extract feature vectors of the text to be processed.
  • the feature extraction layer 310 may encode the text to be processed to obtain a feature vector that incorporates text information to be processed.
  • the text to be processed can be processed as follows: add [CLS] before the text to be processed; pass the delimiter [ between each sentence in the text to be processed SEP] to distinguish. For example, the text to be processed "The equity of company A is frozen, how can the 12 billion new energy investment continue” is processed into "[CLS] The equity of company A is frozen [SEP] How can the 12 billion new energy investment continue”.
  • the feature extraction layer 310 can respectively obtain corresponding character vectors and position vectors based on the text to be processed.
  • a character vector is a vector representing character feature information of the text to be processed.
  • the text to be processed "The equity of company A is frozen, how can the 12 billion new energy investment continue" includes 22 characters [w 1 ] ⁇ w 2 ⁇ ... ⁇ w a1 ⁇ w a2 ⁇ ....
  • the feature information of [w as ] can be represented by 22 character vectors [t 1 ] ⁇ t 2 ⁇ ... ⁇ t a1 ⁇ t a2 ⁇ .... ⁇ t as ⁇ .
  • the feature information of the character [A] can be represented by a character vector [2, 3, 3]. In practical application scenarios, the dimension of vector representation can be higher.
  • the character vector can be obtained by querying a word vector table or a word embedding model.
  • the word embedding model may include but not limited to: Word2vec model, word frequency-reverse document frequency model (Term Frequency–Inverse Document Frequency, TF-IDF) or SSWE-C (skip-gram based combined-sentiment word embedding ) model, etc.
  • the position vector is a vector reflecting the position of the character in the text to be processed, such as indicating that the character is the first character or the second character in the text to be processed.
  • the position vector of the text to be processed can be obtained by cosine-sine encoding.
  • a segment vector may also be included, reflecting the segment where the character is located. For example, the character [A] is located in the first sentence (paragraph) of the text to be processed.
  • the feature extraction layer 310 may first fuse various types of vectors of the text to be processed, such as splicing or superimposing, and then encode the fused vectors to obtain feature vectors.
  • the feature extraction layer 310 can obtain the feature vector [ T 1 ⁇ T 2 ⁇ ... ⁇ T a1 ⁇ T a2 ⁇ .... ⁇ T as ⁇ .
  • Exemplary feature extraction layers can be implemented by BERT models or Transformers.
  • annotation sequence layer 320 can obtain the annotation sequence based on the feature vector.
  • labeling sequence is the result of orderly arrangement of multiple labels corresponding to multiple words or words in the text to be processed.
  • labeling may include entity labeling, which is used to indicate whether the corresponding word or word belongs to an entity instance.
  • entity labeling can be further divided into company entity labeling and asset entity labeling, so as to further indicate the corresponding word or phrase.
  • entity type to which the word belongs can be used to mark words or phrases belonging to entity instances in the text to be processed, as well as the entity type to which the words or words belong.
  • the entity label may be at least one of Chinese characters, numbers, letters and symbols.
  • B can be used to represent the first word or first word of an entity instance
  • I can be used to represent a non-first word or non-first word of an entity instance.
  • the entity tagging B-co or I-co may mark the words or phrases whose entity type is "company subject" in the text to be processed.
  • the entity labeling B-pro or I-pro can mark the words or phrases whose entity type is "asset" in the text to be processed.
  • the labeling sequence layer 320 can, based on the eigenvectors [T 1 ][T 2 ]...[T a1 ][T a2 ]...[T as ], respectively write the text to be processed "The equity of company A is frozen , How can 12 billion new energy investment continue"
  • the entity instance "A company, equity, freeze" is marked as "B-co, I-co, B-pro, I-pro, B-pre, I-pre, " respectively means "the first character of the company subject, the non-first character of the company body, the first character of the asset, the non-first character of the asset, the first character of the preservation verb, and the non-first character of the preservation verb first word”.
  • annotations may also include non-entity annotations.
  • the non-substantial label can also be at least one of Chinese characters, numbers, letters and symbols.
  • Words or phrases in the text to be processed that are not entity instances can be annotated with the same non-entity annotation.
  • the annotation sequence layer 320 uses four "O"s to annotate the word "what to continue" that does not belong to the entity instance in the text to be processed.
  • words or phrases that do not belong to entity instances in the text to be processed may not be marked.
  • the labeling sequence layer 320 can obtain the probability that each word or word in the text to be processed belongs to a different entity type and the probability that it does not belong to any entity based on the feature vector, and then mark the entity type corresponding to the maximum probability or Non-entity annotations that do not belong to entities serve as annotations for the word or words.
  • the annotation sequence layer 320 can obtain the probability of "A" belonging to the first character of the company subject as 0.8, the probability of not belonging to the first character of the company subject as 0.5, and the probability of "A" belonging to the person based on the feature vector [T 1 ]. with probability 0.3 for the first character belonging to a person, 0.3 for a non-first character belonging to a person, 0.3 for a first character belonging to a frozen verb, and 0.2 for not belonging to an entity, then maximizing the probability
  • the value 0.8 corresponds to the entity type "corporate entity” and the entity label "B-co" of the first word is used as the entity label of "A".
  • annotation sequence layer 320 can obtain the annotation (entity annotation or non-entity annotation) of each word or phrase in the text to be processed "The equity of company A is frozen, how can the 12 billion new energy investment continue", and follow the The order of the words or phrases in the text to be processed is used to obtain the annotation sequence: "B-co”, “I-co” --B-pro”, “I-pro” --B-pre”, “ I-pre”... “O”, “O”, “O”, “O”.
  • the annotation sequence layer 320 may include but not limited to N-gram (N-Gram) model, conditional random field (Conditional Random Fields, CRF) model and hidden Markov model (Hidden Markov Model, HMM) one or more.
  • N-Gram N-gram
  • CRF Conditional Random Fields
  • HMM hidden Markov model
  • the relationship recognition layer 330 can obtain a relationship matrix based on the feature vector and the annotation sequence.
  • the relationship matrix can be used to mark any two words or the relationship description between words in the text to be processed.
  • each element in the relationship matrix can mark two words or a description of the relationship between words.
  • the dimensions of the relationship matrix are determined based on the number of characters in the text to be processed. For example, if the text to be processed contains N characters or words, the dimension of the relationship matrix is N ⁇ N.
  • each column of 1 ⁇ N vectors in the relationship matrix can mark the relationship description between one word or word in the text to be processed and all other words or words, and the vectors of N columns of 1 ⁇ N can mark the text to be processed respectively Describe the relationship between the N characters or phrases and all other characters or phrases.
  • the relationship identification layer 330 may first embed each label in the label sequence into a corresponding feature vector through a word embedding network to obtain a label vector.
  • the relationship recognition layer 330 can first label the annotations in the sequence: "B-co”, “I-co” --B-pro”, “I-pro” --B-pre”, “I-pre”!O”,”O”,”O”,”O” respectively embed the feature vector [T 1 ] ⁇ T 2 ⁇ ... ⁇ T a1 ⁇ T a2 ⁇ .... ⁇ T as ⁇ , get the mark Vector ⁇ e 1 ⁇ e 2 ⁇ ... ⁇ e a1 ⁇ e a2 ⁇ .... ⁇ e as ⁇ .
  • the relationship identification layer 330 can obtain the relationship matrix through a multi-head nonlinear activation layer (Multi-sigmoid layer) based on the label vector.
  • Multi-sigmoid layer a multi-head nonlinear activation layer
  • the extraction model can be implemented based on an end-to-end model, such as a BERT-based multi-head selection model, Stanford Chinese grammar analysis tool StanfordNLP or Harbin Institute of Technology Chinese language analysis Tool LTP implementation.
  • the extraction model can be trained using training samples. Specifically, the training samples with labels are input into the extraction model, and the parameters of the extraction model are updated through training.
  • the labels of the training samples can be determined based on entity types and relationship descriptions in a preset ontology definition dataset.
  • the label of the training sample "X company's equity is frozen” may include the known labeling sequence and relationship matrix of the training sample, where the known labeling sequence indicates that "X company", "equity” and “frozen” correspond to
  • the entity types of the preset ontology define the entity types "company subject”, “asset” and “preservation verb” respectively in the dataset, and the known relationship matrix indicates the relationship between “company subject” and “asset” and describes "ATT ", the relationship between "asset” and “preservation verb” describes "SBV", etc.
  • the extraction model can be used to process the training samples to obtain the label sequence and relationship matrix predicted by the model, and adjust the parameters of the extraction model with reference to the known label sequence and relationship matrix to reduce the difference between the prediction result and the known label .
  • step 140 For a detailed description of the preset ontology definition data set, reference may be made to step 140, which will not be repeated here.
  • the embodiment of this specification also provides a computer-readable storage medium.
  • the storage medium stores computer instructions, and after the computer reads the computer instructions in the storage medium, the computer implements the foregoing event detection method.
  • the embodiment of this specification also provides an event detection device, which is characterized in that the device includes at least one processor and at least one memory; the at least one memory is used to store computer instructions; the at least one processor is used to execute the At least some of the computer instructions implement the aforementioned event detection method.
  • the possible beneficial effects of the embodiments of this specification include but are not limited to: (1) Events or event elements are detected based on event graph ontology definition data. For new events, compatible detection can be achieved at a relatively low cost. (2) Training the extraction model based on the preset ontology definition data set, so that the extraction model can map the instance data in the text to be extracted to the entity type and relationship description in the event graph ontology definition data, thereby improving the accuracy of subsequent graph matching (3)
  • the relationship description between entity types is further abstracted and generalized, which improves the compatibility of new event detection.
  • the possible beneficial effects may be any one or a combination of the above, or any other possible beneficial effects.
  • aspects of this specification can be illustrated and described by several patentable categories or situations, including any new and useful process, machine, product or combination of substances, or any combination of them Any new and useful improvements.
  • various aspects of this specification may be entirely executed by hardware, may be entirely executed by software (including firmware, resident software, microcode, etc.), or may be executed by a combination of hardware and software.
  • the above hardware or software may be referred to as “block”, “module”, “engine”, “unit”, “component” or “system”.
  • aspects of this specification may be embodied as a computer product comprising computer readable program code on one or more computer readable media.
  • a computer storage medium may contain a propagated data signal embodying a computer program code, for example, in baseband or as part of a carrier wave.
  • the propagated signal may have various manifestations, including electromagnetic form, optical form, etc., or a suitable combination.
  • a computer storage medium may be any computer-readable medium, other than a computer-readable storage medium, that can be used to communicate, propagate, or transfer a program for use by being coupled to an instruction execution system, apparatus, or device.
  • Program code residing on a computer storage medium may be transmitted over any suitable medium, including radio, electrical cable, fiber optic cable, RF, or the like, or combinations of any of the foregoing.
  • the computer program codes required for the operation of each part of this manual can be written in any one or more programming languages, including object-oriented programming languages such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB.NET, Python etc., conventional procedural programming languages such as C language, Visual Basic, Fortran2003, Perl, COBOL2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages.
  • the program code may run entirely on the user's computer, as a stand-alone software package, or partly on the user's computer and partly on a remote computer, or entirely on the remote computer or processing device.
  • the remote computer can be connected to the user computer through any form of network, such as a local area network (LAN) or wide area network (WAN), or to an external computer (such as through the Internet), or in a cloud computing environment, or as a service Use software as a service (SaaS).
  • LAN local area network
  • WAN wide area network
  • SaaS service Use software as a service
  • numbers describing the quantity of components and attributes are used. It should be understood that such numbers used in the description of the embodiments use the modifiers "about”, “approximately” or “substantially” in some examples. grooming. Unless otherwise stated, “about”, “approximately” or “substantially” indicates that the stated figure allows for a variation of ⁇ 20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that can vary depending upon the desired characteristics of individual embodiments. In some embodiments, numerical parameters should take into account the specified significant digits and adopt the general digit reservation method. Although the numerical ranges and parameters used in some embodiments of this specification to confirm the breadth of the range are approximations, in specific embodiments, such numerical values are set as precisely as practicable.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

一种事件检测方法和系统,方法包括:获取待处理文本;基于抽取模型从待处理文本中抽取一组或多组实例数据;基于一组或多组实例数据,确定一个或多个抽取三元组,进而得到抽取图谱;获取一个或多个候选事件的图谱本体定义数据,并基于此得到对应每个候选事件的本体定义图谱;确定抽取图谱分别与一个或多个候选事件的本体定义图谱的相似度;基于各相似度,从一个或多个候选事件中确定待处理文本对应的事件。

Description

事件检测 技术领域
本说明书涉及自然语言处理领域,特别涉及一种事件检测方法和系统。
背景技术
事件检测或事件抽取是人工智能技术的一项重要应用,其可以从海量数据中高效的获取人们关心的事件。例如,从大量的金融领域的新闻、报道中及时抽取目标风险事件,帮助投资人有效规避投资风险。然而,随着新事件的出现,已有的事件检测或抽取算法可能也要针对新事件进行更新,进而增加技术更新或升级的成本。
因此,希望提供一种事件检测方法和系统,能够有效从海量数据中识别已知事件外,当出现新事件时,也能通过花费较少的成本实现对新事件的兼容。
发明内容
本说明书一个方面提供一种事件检测方法,所述方法包括:获取待处理文本;基于抽取模型从待处理文本中抽取一组或多组实例数据;其中,每组实例数据包括第一实体实例、所述第一实体实例对应的第一实体类型、第二实体实例、所述第二实体实例对应的第二实体类型以及两个实体类型之间的关系描述;基于一组或多组实例数据,确定一个或多个抽取三元组,进而得到抽取图谱;其中,抽取三元组包括实例数据中的第一实体类型、第二实体类型以及两个实体类型之间的关系描述;获取一个或多个候选事件的图谱本体定义数据,并基于此得到对应每个候选事件的本体定义图谱;其中,事件的图谱本体定义数据包括用于定义实体的实体类型以及用于定义实体类型间关系的关系描述;确定抽取图谱分别与一个或多个候选事件的本体定义图谱的相似度;基于各相似度,从一个或多个候选事件中确定待处理文本对应的事件。
本说明书另一个方面提供一种通过事件检测系统,所述系统包括:文本获取模块,用于获取待处理文本;抽取模块,用于基于抽取模型从待处理文本中抽取一组或多组实例数据;其中,每组实例数据包括第一实体实例、所述第一实体实例对应的第一实体类型、第二实体实例、所述第二实体实例对应的第二实体类型以及两个实体类型之间的关系描述;抽取图谱获取模块,用于基于一组或多组实例数据,确定一个或多个抽取三元组,进而得到抽取图谱;其中,抽取三元组包括实例数据中的第一实体类型、第二实体类型以及两个实体类型之间的关系描述;本体定义图谱获取模块,用于获取一个或多个 候选事件的图谱本体定义数据并基于此得到对应每个候选事件的本体定义图谱;其中,事件的图谱本体定义数据包括用于定义实体的实体类型以及用于定义实体类型间关系的关系描述;相似度确定模块,用于确定抽取图谱分别与一个或多个候选事件的本体定义图谱的相似度;事件确定模块,用于基于各相似度,从一个或多个候选事件中确定待处理文本对应的事件。
本说明书另一个方面提供一种计算机可读存储介质,其特征在于,所述存储介质存储计算机指令,当所述计算机指令被处理器执行时实现事件检测方法。
本说明书另一个方面提供一种事件检测装置,其特征在于,所述装置包括至少一个处理器以及至少一个存储器;所述至少一个存储器用于存储计算机指令;所述至少一个处理器用于执行所述计算机指令中的至少部分指令以实现事件检测方法。
附图说明
本说明书将以示例性实施例的方式进一步说明,这些示例性实施例将通过附图进行详细描述。这些实施例并非限制性的,在这些实施例中,相同的编号表示相同的结构。
图1是根据本说明书一些实施例所示的事件检测方法的示例性流程图;
图2是根据本说明书一些实施例所示的事件的图谱本体定义数据的示意图;
图3是根据本说明书一些实施例所示的抽取模型的结构示意图;
图4是根据本说明书一些实施例所示的确定抽取图谱分别与一个或多个候选事件的本体定义图谱的相似度的示意图。
具体实施方式
为了更清楚地说明本说明书实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单的介绍。显而易见地,下面描述中的附图仅仅是本说明书的一些示例或实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图将本说明书应用于其它类似情景。除非从语言环境中显而易见或另做说明,图中相同标号代表相同结构或操作。
应当理解,本说明书中所使用的“系统”、“装置”、“单元”和/或“模组”是用于区分不同级别的不同组件、元件、部件、部分或装配的一种方法。然而,如果其他词语可实现相同的目的,则可通过其他表达来替换所述词语。
如本说明书和权利要求书中所示,除非上下文明确提示例外情形,“一”、“一个”、“一种”和/或“该”等词并非特指单数,也可包括复数。一般说来,术语“包括”与“包含”仅提示包括已明确标识的步骤和元素,而这些步骤和元素不构成一个排它性的罗列,方法或者设备也可能包含其它的步骤或元素。
本说明书中使用了流程图用来说明根据本说明书的实施例的系统所执行的操作。应当理解的是,前面或后面操作不一定按照顺序来精确地执行。相反,可以按照倒序或同时处理各个步骤。同时,也可以将其他操作添加到这些过程中,或从这些过程移除某一步或数步操作。
在信息迸发的时代,每天都会有大量信息出现,加之信息的表达形式灵活多变,因此,如何使已有的事件检测方案能较好的兼容新事件检测是一个值得研究的问题。在一些实施例中,可以通过事件检测模型对文本数据进行处理,以从文本中检测和/或抽取出事件,从而帮助用户快速筛选感兴趣信息。例如,事件检测模型可以从新闻报道“…A公司起诉B公司,一审判决A公司败诉…”中检测出事件“A公司败诉于B公司”。然而,事件检测模型受限于训练语料。当出现新的事件时,事件检测模型可能因为未曾“见过”新事件的关键词或触发词,进而难以从新闻报道等数据中发现新的事件。为此,一种解决方案是不断搜集新的训练语料,对事件检测模型进行继续训练,提高其检测能力。然而搜集新的训练语料,或者对模型进行再训练,都会耗费较多的人力和时间成本。
本说明书的一些实施例提出了一种事件检测方案,其基于事件图谱本体定义数据检测事件或事件要素,事件图谱本体定义数据中主要包括实体类型和基于实体类型之间关系的概括描述。因此,在一定程度上可以做到至少对于相同领域的新出现的事件,只需要为新事件定义事件本体定义数据,便可实现对新事件的有效检测,大大降低了技术升级或更新换代的成本。
图1是根据本说明书一些实施例所示的事件检测方法的示例性流程图。在一些实施例中,流程100可以由处理设备执行,或者由设置于处理设备上的事件检测系统实现。所述事件检测系统可以包括文本获取模块、抽取模块、抽取图谱获取模块、本体定义图谱获取模块、相似度确定模块以及事件确定模块。
如图1所示,事件检测方法100可以包括以下步骤。
步骤110,获取待处理文本。
具体地,该步骤110可以由文本获取模块执行。
待处理文本是需要检测事件的文本。在一些实施例中,待处理文本可以是篇章级文本。例如,新闻、论文、研报、时事评论等。在一些实施例中,待识别文本可以是句子级文本。例如,前述任意篇章级文本中包括的语句。示例性地,待处理文本可以是新闻文本“A公司的股权被冻结,120亿新能源投资何以为继?…”。
在一些实施例中,文本获取模块可以直接从文字形式的信息中获取待处理文本。例如,文本获取模块可以从文本数据库中获取待处理文本。又例如,文本获取模块还可以从网页文本中爬取待处理文本。
在一些实施例中,文本获取模块还可以基于文字识别技术从图片消息获取的待处理文本。在一些实施例中,待处理文本还可以基于自动语音识别(Automatic Speech Recognition,ASR)技术从语音信息中获取。
在一些实施例中,待处理文本可以包括多个字或词,如,中文字符、日文字符或英文等西文单词等。
步骤120,基于抽取模型从待处理文本中抽取一组或多组实例数据。
具体地,该步骤120可以由抽取模块执行。
实例数据由待处理文本中的实体类型、实体实例(即实体对应的数据实例)以及实体类型之间的关系描述构成。
其中,实体类型是对客观个体和行为的广泛抽象,其可以指物理世界中的有形物体,如人、执法机关、公司主体等,也可以指无形的对象,如话语、歌曲、电影、资产、金额等,还可以指行为动作,如管理动词、惩罚动词、保全动词等。实体实例可以是实体类型的抽象概念下实际存在的例子,如公司主体可以具体是公司A、公司B、公司C等,资产可以具体是不动产、200亿元股权、10万元现金等,保全动词可以具体是查封、冻结、扣押等。示例性地,待处理文本“A公司的股权被冻结,120亿新能源投资何以为继”中的实体实例包括:A公司、股权、冻结…,对应的实体类型包括:公司主体、资产、保全动词…。
实体实例之间可以具有关系,实体实例之间的关系可以由其对应的实体类型之间的关系描述来定义。例如,资产这一实体类型与公司主体这一实体类型之间的关系是属于,惩罚动词这一实体类型与资产这一实体类型之间的关系可以是作用对象。进而,实体实例之间也可以具有相应的关系描述。前述待处理文本中,“股权”属于“A公司”、“冻结”的作用对象是“股权”等。
在一些实施例中,可以定义更加抽象和适用范围广的关系描述。例如,关系描述可以包括动宾关系、主谓关系、定中关系和修饰关系。其中,当第一实体类型是动词类实体,第二实体类型是名词类实体时,两者的关系描述可以为动宾关系(Verb-Object,VOB)。例如,第一实体类型“保全动词”和第二实体类型“资产”之间的关系描述为动宾关系。当第一实体类型是名词类型,第二实体类型是动词类型时,关系描述可以是主谓关系(Subject-Verb,SBV)。例如,第一实体类型“执法机关”和第二实体类型“惩罚动词”之间的关系描述为主谓关系。当第一实体类型是领属、范围、质料、形式、性质、数量、用途、时间、处所等修饰词等,第二实体类型是中心语时,关系描述可以是定中关系(Attribute,ATT)。例如,第一实体类型“公司主体”和第二实体类型“资产”之间的关系描述为定中关系。当第一实体类型是被修饰词,第二实体类型是修饰词时,关系描述可以是修饰关系(Modify,MOD)。例如,第一实体类型“资产”和第二实体类型“涉及金额”之间的关系描述为修饰关系。
在一些实施例中,每组实例数据可以包括第一实体实例、第一实体类型、第二实体实例、第二实体类型以及两个实体类型之间的关系描述。
继续上述示例,第1组实例数据中,第一实体实例可以是股权,对应的第二实体实例可以是冻结。第一实体类型和第二实体类型分别是第一实体实例和第二实体实例对应的广泛抽象。例如,第1组实例数据中,第一实体实例“股权”对应的第一实体类型是资产,第二实体实例“冻结”对应的第二实体类型是保全动词。第一实体实例“冻结”对应的第一实体类型“保全动词”和第二实体实例“股权”对应的第二实体类型“资产”之间的关系描述为动宾关系。第1组实例数据可以表示为【保全动词:冻结,VOB,资产:股权】。
在一些实施例中,抽取模块可以从待处理文本中抽取多组实例数据。
可以理解,第一实体实例和第二实体实例之间的关系是相对的。在一些实施例中,第一实体实例和第二实体实例可以交换,作为新的一组实例数据。继续上述示例,第2组实例数据中,第一实体实例可以是第1组实例数据中的第二实体实例“股权”,第二实体实例是是第1组实例数据中的第一实体实例“冻结”;对应的第一实体类型和第二实体类型分别为“资产”和“保全动词”。相应地,两个实体类型之间的关系描述相应发生改变。例如,第2组实例数据中,第一实体类型“资产”和第二实体类型“保全动词”之间的关系描述为主谓关系。第2组实例数据可以表示为【资产:股权,SBV,保全动词:冻结】。
在一些实施例中,多组实例数据的实体实例和关系描述可以部分相同。继续上述示例,第3组实例数据可以是【资产:股权,MOD,涉及金额:120亿】,其中,第一实 体实例和第2组实例数据中的第一实体实例相同。
具体地,抽取模块可以利用抽取模型处理待处理文本,得到待处理文本的标注序列以及关系矩阵。
标注序列用于标记待处理文本中属于实体实例的字或词,以及所述字或词所属的实体类型。关系矩阵用于标记待处理文本中任意两个字或词之间的关系描述。
在一些实施例中,抽取模型包括以下模型中的一种或多:BERT、Transformer、StanfordNLP或LTP。
关于抽取模型的详细描述可以参见图3及其相关描述,在此不再赘述。
抽取模块可以基于标注序列确定待处理文本中的实体实例及其实体类型。
如图3所示,抽取模型处理“A公司的股权被冻结,120亿新能源投资何以为继?”,获得标注序列:“B-co”、“I-co”…“B-pro”、“I-pro”…“B-pre”、“I-pre”…“O”、“O”、“O”、“O”。抽取模块可以基于其中的实体标注“B-co”、“I-co”、“B-pro”、“I-pro”、“B-pre”、“I-pre”获取对应的实体实例:A公司,股权,冻结,及其实体类型公司主体、资产、保全动词。
抽取模块可以基于关系矩阵确定待处理文本中任意两个实体实例之间的关系描述,并将其作为对应两个实体类型之间的关系描述。如图3所示,待处理文本中每个字或词可以对应一个关系向量r,该关系向量的维度可以与待处理文本的总字数或总词数相同,关系向量中的元素反映向量对应的字或词与待处理文本中其他各字或各词的关系描述,关系向量还可以包括反映其对应的字或词与自身关系描述的元素,如元素值可以默认为null。多个字或词的关系向量组成了所述关系矩阵。可以理解,关系矩阵中的元素可以包括前述的VOB、MOD等关系描述,也可以包括null,其中null表示无效或空。在一些实施例中,抽取模块可以基于两个实体实例各自的首字或首词之间的关系描述,确定这两个实体实例或对应的两个实体类型之间的关系描述。
如图3所示,抽取模块可以基于关系矩阵中的实体实例“A公司”的首字“A”对应的关系向量【r 1】中和实体实例“股权”的首字“股”对应的元素,确定“A公司”和“股权”之间的关系描述为定中关系。进一步地,将定中关系作为“A公司”对应的实体类型“公司主体”和“股权”对应的实体类型“资产”之间的关系描述。又例如,抽取模块可以基于关系矩阵中的实体实例“冻结”的首字“冻”对应的关系向量【r v1】中和实体实例“股权”的首字“股”对应的元素,确定“冻结”和“股权”之间的关系描述为动宾关系。进一步地,将动宾关系作为“冻结”对应的实体类型“保全动词”和“股权”对应的实体类型“资产”之间的关系描述。 又例如,关系矩阵中“冻”对应的关系向量【r v1】中和“被”对应的元素为null,表示两者无明确关系。
步骤130,基于一组或多组实例数据,确定一个或多个抽取三元组,进而得到抽取图谱。
具体地,该步骤130可以由抽取图谱获取模块执行。
抽取三元组是从实例数据抽取的三个元素的集合。在一些实施例中,抽取三元组包括实例数据中的第一实体类型、第二实体类型以及两个实体类型之间的关系描述。
例如,抽取图谱获取模块可以从第1组实例数据【保全动词:冻结,VOB,资产:股权】抽取第1个三元组【保全动词,VOB,资产】,从第2个实例数据【资产:股权,SBV,保全动词:冻结】中抽取第2组三元组【资产,SBV,保全动词】,从第3组实例数据【资产:股权,MOD,涉及金额:120亿】中抽取第3个三元组【资产,MOD,涉及金额】…。
进一步地,抽取图谱获取模块可以基于一个或多个抽取三元组,构建抽取图谱。
抽取图谱是由一个或多个三元组中实体实例以及实体实例类型之间关系描述构成的网络图。在一些实施例中,抽取图谱中一个或多个三元组的实体类型可以用节点表示,一个或多个三元组的实体类型之间的关系描述可以用连接对应节点的边表示。
例如,抽取图谱中可以以第1个三元组【保全动词,VOB,资产】中的实体类型“保全动词”和“资产”分别为两个节点,以“保全动词”和“资产”之间的关系描述“VOB”为连接所述两个节点的边。
在一些实施例中,多个三元组中相同的实体类型可以在抽取图谱中用同一个节点表示。例如,第3个三元组【资产,MOD,涉及金额】中实体类型“资产”和第1个三元组【保全动词,VOB,资产】中的实体类型“资产”相同,则在抽取图谱中第3个三元组中实体类型“资产”和第1个三元组中实体类型“资产”可以用同一个节点表示。
在一些实施例中,抽取图谱中的边具有方向性,由第一实体类型指向第二实体类型。例如,第1个三元组中的关系描述“VOB”对应的边是“保全动词”指向“资产”。第2个三元组【资产,SBV,保全动词】中的关系描述“SBV”对应的边是“资产”指向“保全动词”。
步骤140,获取一个或多个候选事件的图谱本体定义数据,并基于此得到对应每个候选事件的本体定义图谱。
具体地,该步骤140可以由本体定义图谱获取模块执行。
本体定义图谱是指由一系列实体实例类型以及实体实例类型之间关系描述构成的图。在一些实施例中,本体定义图谱中的实体实例类型可以用节点表示,实体实例类型之间的关系描述可以用连接节点的边表示。
多个候选事件可以分别对应多种事件类型。例如,第一候选事件、第二候选事件、…第N候选事件可以分别对应收购事件类型、败诉事件类型、…保全事件类型。
在一些实施例中,每个候选事件可以对应一个本体定义图谱。事件的本体定义图谱可以基于事件的图谱本体定义数据,如schema,生成。而事件的图谱本体定义数据可以根据事件的一般要素,人工制定或编写。
图2是根据本说明书一些实施例所示的事件的图谱本体定义数据的示意图。
图谱本体定义数据是指对本体定义图谱包括的实体类型、实体类型之间关系描述进行定义的数据。
相应地,每组图谱本体定义数据和一个候选事件类型相对应。如图2所示,收购类图谱本体定义数据、败诉类图谱本体定义数据…保全类图谱本体定义数据,分别对应收购事件类型、败诉事件类型、…保全事件类型。
在一些实施例中,事件的图谱本体定义数据包括用于定义实体的实体类型以及用于定义实体类型间关系的关系描述。
在一些实施例中,图谱本体定义数据中的实体类型可以由词表或抽取规则定义属于该实体类型的实体实例。可以理解,满足抽取规则的实体实例即属于对应的实体类型。具体的,对应于枚举类的实体类型,则可以由词表进行定义。如实体类型“保全动词”对应有词表:扣押、冻结、查封等。对于无法枚举的实体类型,则可以通过正则表达式,关键词匹配、限制数据格式等抽取规则定义。如实体类型“执法机关”,可以定义抽取规则为基于关键词“法院”匹配进行文本抽取,又如实体类型“涉及金额”,可以基于数据格式进行抽取。
图谱本体定义数据可以包括关系描述,并通过定义数据将关系描述与图谱本体定义数据中的实体类型结合,进而定义不同实体类型之间的关系描述。以图2中保全类图谱定义数据为例,保全类图谱定义数据中包含6条定义数据,第1条定义数据指定第一实体类型“保全动词”,第二实体类型“公司主体”的关系描述定义为“VOB”;第2条定义数据指定第一实体类型“公司主体”,第二实体类型“保全动词”之间关系描述定义为“ATT”; 第3条定义数据指定第一实体类型“保全动词”,第二实体类型“资产”之间的关系描述定义为“VOB”;….第6条定义数据指定第一实体类型“资产”,第二实体类型“保全动词”之间的关系描述定义为“SBV”。
图谱本体定义数据可以基于预设的本体定义数据集获取。具体地,事件的图谱本体定义数据中的实体类型以及关系描述来自预设的本体定义数据集。
预设的本体定义数据集可以是针对某一特定领域,如金融领域或教育领域,制定的实体类型以及关系描述的集合。可以认为,本体定义数据集可以包括了对应领域较为全面的实体类型及关系描述,使得该领域中的不同事件的实体类型和关系描述都可以在该数据集中找到,或者使得该数据集中的实体类型和关系描述可以通用。在为不同事件制定图谱本体定义数据时,可以从该数据集中选取实体类型以及关系描述,通过词表或抽取规则对实体类型做进一步定义,以及通过定义数据指定实体类型间的关系描述则可以获得各事件自己的图谱本体定义数据。
在一些实施例中,还可以基于所述预设的本体定义数据集训练抽取模型,进而使得步骤120抽取的实体实例及其关系描述直接映射为候选事件的图谱本体定义数据中的实体类型及关系描述,进一步提高后续的图匹配准确度。关于训练抽取模型的详细描述可以参见图3及其相关描述,在此不再赘述。
在一些实施例中,本体定义图谱获取模块还可以基于一个或多个候选事件的图谱本体定义获取一个或多个候选事件的本体定义图谱。
具体地,对于每一个候选事件类型的图谱本体定义数据,可以将其中的第一实体类型和第二实体类型作为对应候选事件类型的本体定义图谱中的节点,将基于定义数据为对应节点建立边。
如图2所示,本体定义图谱获取模块可以将保全类候选事件的图谱本体定义数据中第1条定义数据中的第一实体类型“保全动词”和第二实体类型“公司主体”作为保全类候选事件的本体定义图谱中的两个节点,并将在节点“保全动词”与节点“公司主体”建立边“VOB”。
与抽取图谱类似地,候选事件的本体定义图谱中的边具有方向性,由第一实体类型指向第二实体类型。继续以图2保全类候选事件对应的本体定义图谱为例,第1条定义数据中的关系描述定义“VOB”对应的边是“保全动词”指向“资产”。第2条定义数据中的关系描述定义“SBV”对应的边是“资产”指向“保全动词”。
与抽取图谱类似地,多条定义数据中相同的实体定义可以在候选事件的本体定义图谱中用同一个节点表示,详细描述可以参见步骤130,在此不再赘述。
类似地,本体定义图谱获取模块可以获取收购类候选事件的本体定义图谱、败诉类本体定义图谱、….。
步骤150,确定抽取图谱分别与一个或多个候选事件的本体定义图谱的相似度。
具体地,该步骤150可以由相似度确定模块执行。
在一些实施例中,相似度确定模块可以利用图匹配模型处理所述抽取图谱以及所述候选事件的本体定义图谱,得到两者之间的相似度。
在一些实施例中,图匹配模型可以包括但不限于图匹配网络(Graph Matching Network,GMN)模型、图神经网络(Graph Neural Network,GNN)模型、图卷积神经网络(Graph Convolutional Network,GCN)和图嵌入模型(Graph Embedding Models,GEM)等。
以GMN模型为例,GMN模型可以先分别获取抽取图谱和候选事件的本体定义图谱的每个节点以及每条边的(初始)表示向量,基于Attention机制获取抽取图谱的每个表示向量和候选事件的本体定义图谱的每个表示向量之间的注意力大小,然后聚合抽取图谱和候选事件的本体定义图谱的每个节点以及每条边的表示向量和所述每个表示向量之间的注意力大小,获取包含了抽取图谱和候选事件的本体定义图谱的节点、边以及相互关系的跨图信息(cross-graph),再基于跨图信息获取所述相似度。
图4是根据本说明书一些实施例所示的确定抽取图谱分别与一个或多个候选事件的本体定义图谱的相似度的示意图。
以图4为例,相似度确定模块可以先分别获取抽取图谱420中的节点“公司主体”、“保全动词”、“资产”和边“E(公司主体,VOB,保全动词)”、边“E(保全动词,SBV,保全动词)”…等对应的表示向量N1、N2、N3、E1、E2、…,以及保全类候选事件的本体定义图谱410中的节点“公司主体”、“保全动词”、“资产”、“涉事金额”和边“e(公司主体,VOB,保全动词)”、边“e(保全动词,SBV,保全动词)”…等对应的表示向量n1、n2、n3、n4、e1、e2…,然后分别获取N1和n1、n2、n3、n4、e1、e2…之间的注意力大小,获取注意力向量a1;类似地,可以分别获取N2和n1、n2、n3、n4、e1、e2…之间的注意力大小,获取注意力向量a2;…..。进一步地,GMN模型(即匹配模型430)可以聚合抽取图谱的每个节点以及每条边的表示向量N1、N2、N3、E1、E2、…,候选 事件的本体定义图谱的每个节点以及每条边的表示向量n1、n2、n3、n4、e1、e2…,和注意力向量a1、a2、a3、….,获取跨图信息(cross-graph),然后基于跨图信息获取抽取图谱420和保全类候选事件的本体定义图谱410之间的相似度440为0.8。
类似地,相似度确定模块可以获取抽取图谱420和收购类候选事件的本体定义图谱、败诉类候选事件的本体定义图谱…的相似度。
在一些实施例中,还可以基于GEM或GCN处理抽取图谱和候选事件的本体定义图谱,分别得到两个图谱的向量表示,再通过计算两个向量表示的距离确定两者的相似度。
步骤160,基于各相似度,从所述一个或多个候选事件中确定所述待处理文本对应的事件。
具体地,该步骤160可以由事件确定模块执行。
在一些实施例中,事件确定模块可以确定多个相似度中最大值对应的候选事件,将该候选事件作为待处理文本对应的事件。
例如,抽取图谱和保全类候选事件的本体定义图谱、收购类候选事件的本体定义图谱、败诉类候选事件的本体定义图谱…的相似度分别为0.8、0.5、0.4….,则确定相似度最大值0.8对应的保全类事件作为待处理文本对应的事件。
在一些实施例中,事件确定模块还可以进一步用于确定待处理文本中的事件要素。
具体地,事件检测系统可以基于所述一组或多组实例数据,确定一个或多个实例三元组。所述实例三元组包括实例数据中的第一实体实例、第二实体实例及其分别对应的两个实体类型之间的关系描述。例如,第1组实例三元组可以包括待抽取文本“A公司的股权被冻结,120亿新能源投资何以为继?…”对应的第1组实例数据【保全动词:冻结,VOB,资产:股权】中的第一实体实例“冻结”、第二实体实例“股权”及其分别对应的两个实体类型之间的关系描述“VOB”,即第1组实例三元组为【冻结,VOB,资产】;类似地,第2组实例三元组为【股权,SBV,冻结】、第3组实例三元组为【股权,MOD,120亿】,…【A公司,ATT,股权】。
进一步地,事件检测系统可以基于所述一组或多组实例三元组,确定待处理文本对应事件的事件要素。
事件要素包括组成事件的元素和元素之间的关系。在一些实施例中,可以将每组实例三元组的实例实体作为事件中的元素,将事件中的元素基于元素之间的关系进行结构 化表达。
例如,先分别基于第1组实例三元组【冻结,VOB,股权】、第2组实例三元组【股权,SBV,冻结】、第3组实例三元组【股权,MOD,120亿】,…【A公司,ATT,股权】获取元素“冻结”、“股权”、“120亿”、“A公司”…,再基于元素之间的关系“VOB”、“SBV”等,将元素结构化表达为“A公司股权冻结,120亿…”,进而得到该保全类事件的事件要素。
图3是根据本说明书一些实施例所示的抽取模型的结构示意图。
在一些实施例中,抽取模块可以利用抽取模型处理待处理文本,得到待处理文本的标注序列以及关系矩阵。
如图3所示,抽取模型300可以包括特征提取层310、标注序列层320和关系识别层330。
具体地,特征提取层310可以提取待处理文本的特征向量。
在一些实施例中,特征提取层310可以对待处理文本进行编码,得到融合了待处理文本信息的特征向量。
在一些实施例中,在特征提取层310对待处理文本进行编码之前,可以对待处理文本做如下处理:在待处理文本之前添加[CLS];在待处理文本中每句话之间通过分隔符[SEP]分割,以进行区分。例如,待处理文本“A公司的股权被冻结,120亿新能源投资何以为继”处理后为“[CLS]A公司的股权被冻结[SEP]120亿新能源投资何以为继”。
在一些实施例中,特征提取层310可以基于待处理文本分别得到对应的字符向量和位置向量。
字符向量(token embedding)是表征待处理文本的字符特征信息的向量。如图3所示,待处理文本“A公司的股权被冻结,120亿新能源投资何以为继”包括的22个字符【w 1】【w 2】…【w a1】【w a2】….【w as】的特征信息可以分别用22个字符向量【t 1】【t 2】…【t a1】【t a2】….【t as】表征。示例性地,字符【A】的特征信息可以用字符向量[2,3,3]表征。在实际应用场景中,向量表示的维度可以更高。在一些实施例中,字符向量可以通过查询词向量表或词嵌入模型获取。在一些实施例中,词嵌入模型可以包括但不限于:Word2vec模型、词频-逆向文件频率模型(Term Frequency–Inverse Document Frequency,TF-IDF)或SSWE-C(skip-gram based combined-sentiment word embedding)模型等。
位置向量(position embedding)是反映该字符在待处理文本中位置的向量,如指示该字符是待处理文本中的第1个字符,或第2个字符等。在一些实施例中,待处理文本的位置向量可以通过余弦正弦编码获取。在一些实施例中,还可以包括分段向量(segment embedding),反映字符所在的分段。如字符【A】位于待处理文本的第1句(分段)中。
进一步地,特征提取层310可以先将待处理文本的各类向量进行融合,如拼接或叠加,再对融合后的向量进行编码,得到特征向量。
如图3所示,特征提取层310可以基于字符向量【t 1】【t 2】…【t a1】【t a2】….【t as】和位置向量(未示出),获取特征向量【T 1】【T 2】…【T a1】【T a2】….【T as】。
示例性的特征提取层可以由BERT模型或Transformer实现。
进一步地,标注序列层320可以基于特征向量获取标注序列。
标注序列是与待处理文本中多个字或多个词分别对应的多个标注按照顺序排列的结果。在一些实施例中,标注可以包括实体标注,用于指示对应的字或词是否属于实体实例,进一步的,实体标注可以进一步分为公司主体实体标注、资产实体标注,以便进一步指示对应的字或词所属的实体类型。由此,标注序列可以用于标记待处理文本中属于实体实例的字或词,以及所述字或词所属的实体类型。
在一些实施例中,所述实体标注可以是汉字、数字、字母和符号等中的至少一种。例如,可以用B表示实体实例的首字或首词,I表示实体实例的非首字或非首词。又例如,实体标注B-co或I-co可以标记待处理文本中实体类型为“公司主体”的字或词。又例如,实体标注B-pro或I-pro可以标记待处理文本中实体类型为“资产”的字或词。
如图3所示,标注序列层320可以基于特征向量【T 1】【T 2】…【T a1】【T a2】….【T as】,分别将待处理文本“A公司的股权被冻结,120亿新能源投资何以为继”中的实体实例“A公司、股权、冻结…”标记为“B-co、I-co,B-pro、I-pro,B-pre、I-pre,…”,分别表示“公司主体的第一个字、公司主体的非第一个字,资产的第一个字、资产的非第一个字,保全动词的第一个字、保全动词的非第一个字”。
在一些实施例中,标注还可以包括非实体标注。非实体标注也可以是汉字、数字、字母和符号等中的至少一种。待处理文本中不属于实体实例的字或词可以用相同的非实体标注进行标注。如图3所示,标注序列层320用4个“O”标注待处理文本中不属于实体实例的词“何以为继”。在一些实施例中,待处理文本中不属于实体实例的字或词也可 以不进行任何标注。
具体地,标注序列层320可以基于特征向量,获取待处理文本中每个字或词分别属于不同实体类型的概率和不属于任何实体的概率,然后将概率最大值对应的实体类型的实体标注或者不属于实体的非实体标注作为所述字或词的标注。
以图3为例,标注序列层320可以基于特征向量【T 1】,获取“A”属于公司主体第一个字的概率为0.8、属于公司主体非第一个字的概率为0.5、属于人的第一个字的概率为0.3、属于人的非第一个字的概率为0.3、属于冻结动词的第一个字的概率为…..,不属于实体的概率为0.2,然后将概率最大值0.8对应的实体类型“公司主体”第一个字的实体标注“B-co”作为“A”字的实体标注。
与此类似地,标注序列层320可以获取待处理文本“A公司的股权被冻结,120亿新能源投资何以为继”中每个字或词的标注(实体标注或非实体标注),并按照所述字或词在待处理文本中的顺序排列,从而获取标注序列:“B-co”、“I-co”…“B-pro”、“I-pro”…“B-pre”、“I-pre”…“O”、“O”、“O”、“O”。
在一些实施例中,标注序列层320可以包括但不限于N元(N-Gram)模型、条件随机场(Conditional Random Fields,CRF)模型和隐马尔可夫模型(Hidden Markov Model,HMM)中的一种或多种。
更进一步地,关系识别层330可以基于特征向量和标注序列,获取关系矩阵。
关系矩阵可以用于标记待处理文本中任意两个字或词之间的关系描述。在一些实施例中,关系矩阵中的每一个元素可以标记两个字或词之间的关系描述。所述关系矩阵的维度基于待处理文本的字符个数确定。例如,待处理文本包含N个字或词,则关系矩阵的维度为N×N。其中,关系矩阵中的每一列1×N的向量,可以标记待处理文本中其中一个字或词和其余所有字或词之间的关系描述,N列1×N的向量可以分别标记待处理文本中N个字或词和其余所有字或词之间的关系描述。
在一些实施例中,关系识别层330可以先通过词嵌入网络将标注序列中每一个标注嵌入对应的特征向量中,获取标记向量。
如图3所示,关系识别层330可以先将标注序列中的标注:“B-co”、“I-co”…“B-pro”、“I-pro”…“B-pre”、“I-pre”…“O”、“O”、“O”、“O”分别嵌入特征向量【T 1】【T 2】…【T a1】【T a2】….【T as】,获取标记向量【e 1】【e 2】…【e a1】【e a2】….【e as】。
进一步地,关系识别层330可以基于标记向量,通过多头非线性激活层 (Multi-sigmoid layer),获取关系矩阵。关于关系矩阵的更多内容还可以参见步骤120的相关描述。
上述实施例给出了抽取模型的一种实现结构,在又一些实施例中,抽取模型可以基于端到端的模型实现,如基于BERT的多头选择模型、斯坦福中文语法分析工具StanfordNLP或者哈工大中文语言分析工具LTP实现。
在一些实施例中,所述抽取模型可以使用训练样本训练得到。具体的,将带有标签的训练样本输入抽取模型,通过训练更新抽取模型的参数。
在一些实施例中,训练样本的标签可以基于预设的本体定义数据集中的实体类型以及关系描述确定。继续上述示例,训练样本“X公司股权被冻结”的标签可以包括该训练样本已知的标注序列及关系矩阵,其中,已知的标注序列指示“X公司”、“股权”和“冻结”对应的实体类型分别为预设的本体定义数据集中的实体类型“公司主体”、“资产”和“保全动词”,已知的关系矩阵指示“公司主体”和“资产”之间的关系描述“ATT”,“资产”和“保全动词”之间的关系描述“SBV”等。
进一步,可以利用抽取模型处理所述训练样本,得到模型预测的标注序列以及关系矩阵,参照已知的标注序列和关系矩阵调整抽取模型的参数,以减小预测结果与已知标签之间的差异。
关于预设的本体定义数据集的详细描述可以参见步骤140,在此不再赘述。
本说明书实施例还提供一种计算机可读存储介质。所述存储介质存储计算机指令,当计算机读取存储介质中的计算机指令后,计算机实现前述的事件检测方法。
本说明书实施例还提供一种事件检测装置,其特征在于,所述装置包括至少一个处理器以及至少一个存储器;所述至少一个存储器用于存储计算机指令;所述至少一个处理器用于执行所述计算机指令中的至少部分指令以实现前述的事件检测方法。
本说明书实施例可能带来的有益效果包括但不限于:(1)基于事件图谱本体定义数据检测事件或事件要素,对于新出现的事件,只需要花费较低的成本即可实现兼容检测。(2)基于预设的本体定义数据集训练抽取模型,从而可以使得抽取模型将待抽取文本中的实例数据映射为事件图谱本体定义数据中实体类型和关系描述,从而提高后续图匹配的准确性(3)实体类型之间的关系描述进一步抽象和概括,提高了新事件检测的兼容性。
需要说明的是,不同实施例可能产生的有益效果不同,在不同的实施例里,可能产 生的有益效果可以是以上任意一种或几种的组合,也可以是其他任何可能获得的有益效果。
上文已对基本概念做了描述,显然,对于本领域技术人员来说,上述详细披露仅仅作为示例,而并不构成对本说明书的限定。虽然此处并没有明确说明,本领域技术人员可能会对本说明书进行各种修改、改进和修正。该类修改、改进和修正在本说明书中被建议,所以该类修改、改进、修正仍属于本说明书示范实施例的精神和范围。
同时,本说明书使用了特定词语来描述本说明书的实施例。如“一个实施例”、“一实施例”、和/或“一些实施例”意指与本说明书至少一个实施例相关的某一特征、结构或特点。因此,应强调并注意的是,本说明书中在不同位置两次或多次提及的“一实施例”或“一个实施例”或“一个替代性实施例”并不一定是指同一实施例。此外,本说明书的一个或多个实施例中的某些特征、结构或特点可以进行适当的组合。
此外,本领域技术人员可以理解,本说明书的各方面可以通过若干具有可专利性的种类或情况进行说明和描述,包括任何新的和有用的工序、机器、产品或物质的组合,或对他们的任何新的和有用的改进。相应地,本说明书的各个方面可以完全由硬件执行、可以完全由软件(包括固件、常驻软件、微码等)执行、也可以由硬件和软件组合执行。以上硬件或软件均可被称为“数据块”、“模块”、“引擎”、“单元”、“组件”或“系统”。此外,本说明书的各方面可能表现为位于一个或多个计算机可读介质中的计算机产品,该产品包括计算机可读程序编码。
计算机存储介质可能包含一个内含有计算机程序编码的传播数据信号,例如在基带上或作为载波的一部分。该传播信号可能有多种表现形式,包括电磁形式、光形式等,或合适的组合形式。计算机存储介质可以是除计算机可读存储介质之外的任何计算机可读介质,该介质可以通过连接至一个指令执行系统、装置或设备以实现通讯、传播或传输供使用的程序。位于计算机存储介质上的程序编码可以通过任何合适的介质进行传播,包括无线电、电缆、光纤电缆、RF、或类似介质,或任何上述介质的组合。
本说明书各部分操作所需的计算机程序编码可以用任意一种或多种程序语言编写,包括面向对象编程语言如Java、Scala、Smalltalk、Eiffel、JADE、Emerald、C++、C#、VB.NET、Python等,常规程序化编程语言如C语言、Visual Basic、Fortran2003、Perl、COBOL2002、PHP、ABAP,动态编程语言如Python、Ruby和Groovy,或其他编程语言等。该程序编码可以完全在用户计算机上运行、或作为独立的软件包在用户计算机上运行、或部分在用户计算机上运行部分在远程计算机运行、或完全在远程计算机或处理 设备上运行。在后种情况下,远程计算机可以通过任何网络形式与用户计算机连接,比如局域网(LAN)或广域网(WAN),或连接至外部计算机(例如通过因特网),或在云计算环境中,或作为服务使用如软件即服务(SaaS)。
此外,除非权利要求中明确说明,本说明书所述处理元素和序列的顺序、数字字母的使用、或其他名称的使用,并非用于限定本说明书流程和方法的顺序。尽管上述披露中通过各种示例讨论了一些目前认为有用的发明实施例,但应当理解的是,该类细节仅起到说明的目的,附加的权利要求并不仅限于披露的实施例,相反,权利要求旨在覆盖所有符合本说明书实施例实质和范围的修正和等价组合。例如,虽然以上所描述的系统组件可以通过硬件设备实现,但是也可以只通过软件的解决方案得以实现,如在现有的处理设备或移动设备上安装所描述的系统。
应当注意的是,为了简化本说明书披露的表述,从而帮助对一个或多个发明实施例的理解,前文对本说明书实施例的描述中,有时会将多种特征归并至一个实施例、附图或对其的描述中。但是,这种披露方法并不意味着本说明书对象所需要的特征比权利要求中提及的特征多。实际上,实施例的特征要少于上述披露的单个实施例的全部特征。
一些实施例中使用了描述成分、属性数量的数字,应当理解的是,此类用于实施例描述的数字,在一些示例中使用了修饰词“大约”、“近似”或“大体上”来修饰。除非另外说明,“大约”、“近似”或“大体上”表明所述数字允许有±20%的变化。相应地,在一些实施例中,说明书和权利要求中使用的数值参数均为近似值,该近似值根据个别实施例所需特点可以发生改变。在一些实施例中,数值参数应考虑规定的有效数位并采用一般位数保留的方法。尽管本说明书一些实施例中用于确认其范围广度的数值域和参数为近似值,在具体实施例中,此类数值的设定在可行范围内尽可能精确。
针对本说明书引用的每个专利、专利申请、专利申请公开物和其他材料,如文章、书籍、说明书、出版物、文档等,特此将其全部内容并入本说明书作为参考。与本说明书内容不一致或产生冲突的申请历史文件除外,对本说明书权利要求最广范围有限制的文件(当前或之后附加于本说明书中的)也除外。需要说明的是,如果本说明书附属材料中的描述、定义、和/或术语的使用与本说明书所述内容有不一致或冲突的地方,以本说明书的描述、定义和/或术语的使用为准。
最后,应当理解的是,本说明书中所述实施例仅用以说明本说明书实施例的原则。其他的变形也可能属于本说明书的范围。因此,作为示例而非限制,本说明书实施例的替代配置可视为与本说明书的教导一致。相应地,本说明书的实施例不仅限于本说明书 明确介绍和描述的实施例。

Claims (12)

  1. 一种事件检测方法,所述方法包括:
    获取待处理文本;
    基于抽取模型从待处理文本中抽取一组或多组实例数据;其中,每组实例数据包括第一实体实例、所述第一实体实例对应的第一实体类型、第二实体实例、所述第二实体实例对应的第二实体类型以及两个实体类型之间的关系描述;
    基于所述一组或多组实例数据,确定一个或多个抽取三元组,进而得到抽取图谱;其中,抽取三元组包括实例数据中的第一实体类型、第二实体类型以及两个实体类型之间的关系描述;
    获取一个或多个候选事件的图谱本体定义数据,并基于此得到对应每个候选事件的本体定义图谱;其中,事件的图谱本体定义数据包括用于定义实体的实体类型以及用于定义实体类型间关系的关系描述;
    确定所述抽取图谱分别与所述一个或多个候选事件的本体定义图谱的相似度;
    基于各相似度,从所述一个或多个候选事件中确定所述待处理文本对应的事件。
  2. 如权利要求1所述的方法,还包括:为事件定义其图谱本体定义数据;所述事件的图谱本体定义数据中的实体类型以及关系描述来自预设的本体定义数据集。
  3. 如权利要求1或2所述的方法,所述抽取模型使用训练样本训练得到,所述训练样本的标签基于预设的本体定义数据集中的实体类型以及关系描述确定。
  4. 如权利要求1所述的方法,所述关系描述包括以下关系中的一种或多种:动宾关系、主谓关系、定中关系和修饰关系。
  5. 如权利要求1所述的方法,所述基于抽取模型从待处理文本中抽取一组或多组实例数据,包括:
    利用抽取模型处理待处理文本,得到待处理文本的标注序列以及关系矩阵;
    基于标注序列确定待处理文本中的实体实例及其实体类型;
    基于关系矩阵确定待处理文本中任意两个实体实例之间的关系描述,并将其作为对应两个实体类型之间的关系描述。
  6. 如权利要求5所述的方法,标注序列用于标记待处理文本中属于实体实例的字或词,以及所述字或词所属的实体类型;所述关系矩阵用于标记待处理文本中任意两个字或词之间的关系描述。
  7. 如权利要求1或5所述的方法,所述抽取模型包括以下模型中的一种或多:BERT、Transformer、StanfordNLP或LTP。
  8. 如权利要求1所述的方法,所述确定所述抽取图谱分别与所述一个或多个候选事件的本体定义图谱的相似度,包括对于任一候选事件的本体定义图谱:
    利用图匹配模型处理所述抽取图谱以及所述候选事件的本体定义图谱,得到两者之间的相似度。
  9. 如权利要求1所述的方法,还包括:
    基于所述一组或多组实例数据,确定一个或多个实例三元组;其中,实例三元组包括实例数据中的第一实体实例、第二实体实例及其分别对应的两个实体类型之间的关系描述;
    基于所述一组或多组实例三元组,确定待处理文本对应事件的事件要素。
  10. 一种事件检测系统,所述系统包括:
    文本获取模块,用于获取待处理文本;
    抽取模块,用于基于抽取模型从待处理文本中抽取一组或多组实例数据;其中,每组实例数据包括第一实体实例、所述第一实体实例对应的第一实体类型、第二实体实例、所述第二实体实例对应的第二实体类型以及两个实体类型之间的关系描述;
    抽取图谱获取模块,用于基于所述一组或多组实例数据,确定一个或多个抽取三元组,进而得到抽取图谱;其中,抽取三元组包括实例数据中的第一实体类型、第二实体类型以及两个实体类型之间的关系描述;
    本体定义图谱获取模块,用于获取一个或多个候选事件的图谱本体定义数据并基于此得到对应每个候选事件的本体定义图谱;其中,事件的图谱本体定义数据包括用于定义实体的实体类型以及用于定义实体类型间关系的关系描述;
    相似度确定模块,用于确定所述抽取图谱分别与所述一个或多个候选事件的本体定义图谱的相似度;
    事件确定模块,用于基于各相似度,从所述一个或多个候选事件中确定所述待处理文本对应的事件。
  11. 一种计算机可读存储介质,其特征在于,所述存储介质存储计算机指令,当所述计算机指令被处理器执行时实现如权利要求1~9中任一项所述的方法。
  12. 一种事件检测装置,其特征在于,所述装置包括至少一个处理器以及至少一个存储器;所述至少一个存储器用于存储计算机指令;所述至少一个处理器用于执行所述计算机指令中的至少部分指令以实现如权利要求1~9中任意一项所述的方法。
PCT/CN2022/109834 2021-09-14 2022-08-03 事件检测 WO2023040493A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/395,120 US20240143644A1 (en) 2021-09-14 2023-12-22 Event detection

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111075599.1 2021-09-14
CN202111075599.1A CN113779358B (zh) 2021-09-14 2021-09-14 一种事件检测方法和系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/395,120 Continuation US20240143644A1 (en) 2021-09-14 2023-12-22 Event detection

Publications (1)

Publication Number Publication Date
WO2023040493A1 true WO2023040493A1 (zh) 2023-03-23

Family

ID=78843676

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/109834 WO2023040493A1 (zh) 2021-09-14 2022-08-03 事件检测

Country Status (3)

Country Link
US (1) US20240143644A1 (zh)
CN (1) CN113779358B (zh)
WO (1) WO2023040493A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116109142A (zh) * 2023-04-03 2023-05-12 航科广软(广州)数字科技有限公司 基于人工智能的危险废物监管方法、系统及装置
CN116501898A (zh) * 2023-06-29 2023-07-28 之江实验室 适用于少样本和有偏数据的金融文本事件抽取方法和装置

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779358B (zh) * 2021-09-14 2024-05-24 支付宝(杭州)信息技术有限公司 一种事件检测方法和系统
CN114528418B (zh) * 2022-04-24 2022-10-14 杭州同花顺数据开发有限公司 一种文本处理方法、系统和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180082183A1 (en) * 2011-02-22 2018-03-22 Thomson Reuters Global Resources Machine learning-based relationship association and related discovery and search engines
CN110968700A (zh) * 2019-11-01 2020-04-07 数地科技(北京)有限公司 一种融合多类事理与实体知识的领域事件图谱构建方法和装置
CN111177315A (zh) * 2019-12-19 2020-05-19 北京明略软件系统有限公司 知识图谱的更新方法、装置及计算机可读存储介质
CN113191497A (zh) * 2021-05-28 2021-07-30 国家电网有限公司 一种面向变电站踏勘选址的知识图谱构建方法和系统
CN113779358A (zh) * 2021-09-14 2021-12-10 支付宝(杭州)信息技术有限公司 一种事件检测方法和系统

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101306667B1 (ko) * 2009-12-09 2013-09-10 한국전자통신연구원 지식 그래프 정제 장치 및 방법
US10210246B2 (en) * 2014-09-26 2019-02-19 Oracle International Corporation Techniques for similarity analysis and data enrichment using knowledge sources
US11973777B2 (en) * 2018-07-09 2024-04-30 Siemens Aktiengesellschaft Knowledge graph for real time industrial control system security event monitoring and management
CN109657037A (zh) * 2018-12-21 2019-04-19 焦点科技股份有限公司 一种基于实体类型和语义相似度的知识图谱问答方法及系统
CN110275965B (zh) * 2019-06-27 2021-12-21 卓尔智联(武汉)研究院有限公司 假新闻检测方法、电子装置及计算机可读存储介质
CN111291161A (zh) * 2020-02-20 2020-06-16 平安科技(深圳)有限公司 法律案件知识图谱查询方法、装置、设备及存储介质
CN111368175B (zh) * 2020-05-27 2020-08-28 支付宝(杭州)信息技术有限公司 一种事件抽取方法和系统及实体分类模型
CN112241457A (zh) * 2020-09-22 2021-01-19 同济大学 一种融合扩展特征的事理知识图谱事件检测方法
CN112632224B (zh) * 2020-12-29 2023-01-24 天津汇智星源信息技术有限公司 基于案例知识图谱的案件推荐方法、装置和电子设备
CN113239210B (zh) * 2021-05-25 2022-09-27 河海大学 基于自动化补全知识图谱的水利文献推荐方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180082183A1 (en) * 2011-02-22 2018-03-22 Thomson Reuters Global Resources Machine learning-based relationship association and related discovery and search engines
CN110968700A (zh) * 2019-11-01 2020-04-07 数地科技(北京)有限公司 一种融合多类事理与实体知识的领域事件图谱构建方法和装置
CN111177315A (zh) * 2019-12-19 2020-05-19 北京明略软件系统有限公司 知识图谱的更新方法、装置及计算机可读存储介质
CN113191497A (zh) * 2021-05-28 2021-07-30 国家电网有限公司 一种面向变电站踏勘选址的知识图谱构建方法和系统
CN113779358A (zh) * 2021-09-14 2021-12-10 支付宝(杭州)信息技术有限公司 一种事件检测方法和系统

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116109142A (zh) * 2023-04-03 2023-05-12 航科广软(广州)数字科技有限公司 基于人工智能的危险废物监管方法、系统及装置
CN116501898A (zh) * 2023-06-29 2023-07-28 之江实验室 适用于少样本和有偏数据的金融文本事件抽取方法和装置
CN116501898B (zh) * 2023-06-29 2023-09-01 之江实验室 适用于少样本和有偏数据的金融文本事件抽取方法和装置

Also Published As

Publication number Publication date
US20240143644A1 (en) 2024-05-02
CN113779358A (zh) 2021-12-10
CN113779358B (zh) 2024-05-24

Similar Documents

Publication Publication Date Title
WO2023040493A1 (zh) 事件检测
CN106776711B (zh) 一种基于深度学习的中文医学知识图谱构建方法
US9183274B1 (en) System, methods, and data structure for representing object and properties associations
US8452772B1 (en) Methods, systems, and articles of manufacture for addressing popular topics in a socials sphere
CN111680173A (zh) 统一检索跨媒体信息的cmr模型
WO2021139247A1 (zh) 医学领域知识图谱的构建方法、装置、设备及存储介质
CN112069298A (zh) 基于语义网和意图识别的人机交互方法、设备及介质
WO2021121198A1 (zh) 基于语义相似度的实体关系抽取方法、装置、设备及介质
US10706045B1 (en) Natural language querying of a data lake using contextualized knowledge bases
CN107180026B (zh) 一种基于词嵌入语义映射的事件短语学习方法及装置
WO2021190662A1 (zh) 医学文献排序方法、装置、电子设备及存储介质
CN111858940A (zh) 一种基于多头注意力的法律案例相似度计算方法及系统
CN109522396B (zh) 一种面向国防科技领域的知识处理方法及系统
CN114764566B (zh) 用于航空领域的知识元抽取方法
CN113849657A (zh) 一种智慧监管黑匣子的结构化资料处理方法
CN114153994A (zh) 医保信息问答方法及装置
CN115827819A (zh) 一种智能问答处理方法、装置、电子设备及存储介质
WO2022267460A1 (zh) 基于事件的情感分析方法、装置、计算机设备及存储介质
CN117454884B (zh) 历史人物信息纠错方法、系统、电子设备和存储介质
KR20220074576A (ko) 마케팅 지식 그래프 구축을 위한 딥러닝 기반 신조어 추출 방법 및 그 장치
CN117112727A (zh) 适用于云计算业务的大语言模型微调指令集构建方法
CN113807102B (zh) 建立语义表示模型的方法、装置、设备和计算机存储介质
WO2021063089A1 (zh) 规则匹配方法、规则匹配装置、存储介质及电子设备
CN114417008A (zh) 一种面向建设工程领域的知识图谱构建方法及系统
KR20220074572A (ko) 딥러닝 기반 신조어 추출 방법 및 그 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22868870

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE