CN114528418A - Text processing method, system and storage medium - Google Patents
Text processing method, system and storage medium Download PDFInfo
- Publication number
- CN114528418A CN114528418A CN202210433223.1A CN202210433223A CN114528418A CN 114528418 A CN114528418 A CN 114528418A CN 202210433223 A CN202210433223 A CN 202210433223A CN 114528418 A CN114528418 A CN 114528418A
- Authority
- CN
- China
- Prior art keywords
- entity
- text
- processed
- type
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a text processing method, a system and a storage medium, wherein the method comprises the following steps: acquiring a text to be processed; extracting a first entity from the text to be processed by using a first extraction model, and extracting a second entity meeting a predefined relationship from the text to be processed based on the first entity to obtain at least one type A entity triple; each type A entity triple comprises a first entity, a second entity and a predefined relationship between the first entity and the second entity; extracting a plurality of third entities from the text to be processed by using a second extraction model, and determining the open relationship between any two third entities to obtain a plurality of B-type entity triples; each type B entity triple comprises two third entities and an open relation between the two third entities; and acquiring target entity triples from the type A entity triples and the type B entity triples based on the screening rules.
Description
Technical Field
The present disclosure relates to the field of natural language processing technologies, and in particular, to a text processing method, system, and storage medium.
Background
The text is an important way for people to acquire knowledge and information, and with the rapid development of internet technology, the number of texts is increased explosively. In order to make a computer better understand text and thus help a human being process massive text information, entity triples composed of two entities and a relationship between the two entities can be used to represent information in the text, so as to build a knowledge graph and a knowledge base based on massive text information. However, the extraction model for obtaining entity triples based on text information is limited by predefined relationships and/or corpora between entities, resulting in low applicability, and requires a large amount of human resources to predefine the relationship types and/or the labeled corpora between the entities.
Accordingly, it is desirable to provide a text processing method, system, and storage medium that can improve both the efficiency and accuracy of text processing.
Disclosure of Invention
One aspect of the present specification provides a text processing method, including: acquiring a text to be processed; extracting a first entity from the text to be processed by using a first extraction model, and extracting a second entity meeting a predefined relationship from the text to be processed based on the first entity to obtain at least one type A entity triple; each type A entity triple comprises a first entity, a second entity and a predefined relationship between the first entity and the second entity; extracting a plurality of third entities from the text to be processed by using a second extraction model, and determining the open relationship between any two third entities to obtain a plurality of B-type entity triples; each type B entity triple comprises two third entities and an open relation between the two third entities; and acquiring target entity triples from the type A entity triples and the type B entity triples based on the screening rules.
Another aspect of the specification provides a text processing system, the system comprising: the text acquisition module is used for acquiring a text to be processed; the class A extraction module is used for extracting a first entity from the text to be processed by using a first extraction model, and extracting a second entity meeting a predefined relationship from the text to be processed based on the first entity so as to obtain a class A entity triple; each type A entity triple comprises a first entity, a second entity and a predefined relationship between the first entity and the second entity; the class B extraction module is used for extracting a plurality of third entities from the text to be processed by utilizing a second extraction model and determining the open relationship between any two third entities so as to obtain a plurality of class B entity triples; each type B entity triple comprises two third entities and an open relation between the two third entities; and the screening module is used for acquiring the target entity triples from the type-A entity triples and the type-B entity triples based on the screening rule.
Another aspect of the present specification provides a computer-readable storage medium characterized in that the storage medium stores computer instructions that, when executed by a processor, implement a text processing method.
Drawings
The present description will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments like numerals are used to indicate like structures, wherein:
FIG. 1 is a diagram of an application scenario for a text processing system, according to some embodiments of the present description;
FIG. 2 is an exemplary block diagram of a text processing system according to some embodiments of the present description;
FIG. 3 is an exemplary flow diagram of a text processing method according to some embodiments of the present description;
FIG. 4 is an exemplary flow diagram illustrating a method for obtaining at least one type A entity triplet using a first decimation model according to some embodiments of the present description;
FIG. 5 is a schematic diagram of a first decimation model according to some embodiments herein;
FIG. 6 is an exemplary flow diagram illustrating the acquisition of a plurality of group B entity triples using a second decimation model according to some embodiments of the present description;
FIG. 7 is a schematic diagram of a second decimation model according to some embodiments herein;
FIG. 8 is a schematic diagram of a structure of an entity extraction layer in accordance with some embodiments of the present description;
FIG. 9 is a schematic diagram of a text processing method according to some embodiments of the present description.
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples or embodiments of the present description, and that for a person skilled in the art, the present description can also be applied to other similar scenarios on the basis of these drawings without inventive effort. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.
It should be understood that "system", "device", "unit" and/or "module" as used in this specification is a method for distinguishing different components, elements, parts or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.
As used in this specification and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.
Flow charts are used in this description to illustrate operations performed by a system according to embodiments of the present description. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.
In the information burst age, a large amount of information appears every day, and the expression form of the information is flexible and changeable, so that a problem worthy of research is how to enable a computer to better understand texts, thereby helping human beings to process massive text information. In some embodiments, textual information may be represented using entity triples consisting of two entities and relationships between the two entities, so that a computer may construct a knowledge graph and build a knowledge base, etc., based on a vast amount of textual information.
In some embodiments, the extraction model may extract a second entity based on a predefined relationship after extracting a first entity in the entity triplet, thereby forming the entity triplet. However, the extraction model based on predefined relationships is limited by the number and type of predefined relationships. When new entity triples appear, the extraction model may have difficulty extracting new entity triples from text such as news reports because new relationships between entities have not been "seen". For example, after the extraction model is trained based on the corpus of the predefined relationship "competition", two entities with the relationship "competition" can be extracted from the text to form an entity triple, but two entities with the relationship "cooperation" cannot be extracted to form a new entity triple. In some embodiments, the extraction model may also determine an open relationship between two entities after extracting two entities in the entity triplet, thereby forming the entity triplet. However, if the extraction accuracy of the extraction model based on the open relationship is to be improved, a large amount of corpora are required to be trained, which consumes much labor and time.
Some embodiments of the present disclosure provide a text processing scheme, in which entity triples are jointly extracted by using an extraction model based on a predefined relationship (i.e., a first extraction model) and an extraction model based on an open relationship (i.e., a second extraction model), and an extraction result is used for training the extraction model, so that extraction accuracy can be improved, and human resources and time costs for training the extracted model can be reduced.
FIG. 1 is a diagram of an application scenario for a text processing system, shown in some embodiments in accordance with the present description. As shown in fig. 1, the application scenario 100 may include:
The network 140 may connect the various components of the system and/or connect the system with external resource components. Network 140 enables communication between the various components and with other components outside the system to facilitate the exchange of data and/or information. In some embodiments, the network 140 may be any one or more of a wired network or a wireless network. The network connection between the parts can be in one way or in multiple ways. In some embodiments, the network may be a point-to-point, shared, centralized, etc. variety of topologies or a combination of topologies. In some embodiments, network 140 may include one or more network access points. For example, the network 140 may include wired or wireless network access points, such as base stations and/or network switching points 140-1, 140-2, …, through which one or more components of the access point system may connect to the network 140 to exchange data and/or information.
In some embodiments, storage 120 may be included in processor 110, user terminal 130, and possibly other system components. In some embodiments, the processor 110 may be included in the user terminal 130, as well as other possible system components.
FIG. 2 is a block diagram of a text processing system according to some embodiments of the present description.
In some embodiments, a text acquisition module 210, a class a extraction module 220, a class B extraction module 230, a filtering module 240, and a training module 250 may be included in the text processing system 200.
The text obtaining module 210 may be configured to obtain a text to be processed.
The class a extraction module 220 may be configured to extract a first entity from the to-be-processed text by using a first extraction model, and extract a second entity satisfying a predefined relationship from the to-be-processed text based on the first entity, so as to obtain at least one class a entity triple. In some embodiments, each class a entity triplet may include a first entity, a second entity, and a predefined relationship between the first entity and the second entity. In some embodiments, the class a extraction module 220 may be used to perform one or more of the following operations: acquiring a first entity and a first joint code of a text to be processed; acquiring an entity tagging sequence of the text to be processed corresponding to each predefined relationship based on the first joint coding; and extracting a second entity corresponding to each predefined relationship according to the entity marking sequence of the text to be processed corresponding to each predefined relationship. In some embodiments, the entity labels may be used to indicate words and/or phrases in the text to be processed that correspond to the predefined relationship. In some embodiments, the first entity and/or the second entity may be financial entities. In some embodiments, the types of financial entities may include companies, people, industries, metrics, values, and addresses. In some embodiments, the first extraction model may include one or more of the following models: BERT, Transformer, Stanford NLP, or LTP.
The class B extraction module 230 may be configured to extract a plurality of third entities from the text to be processed by using a second extraction model, and determine an open relationship between any two third entities to obtain a plurality of class B entity triples. In some embodiments, each class B entity triplet may include two third entities and an open relationship between the two third entities. In some embodiments, the class B extraction module 230 may be used to perform one or more of the following operations: adding a first label and a second label to each third entity in the text to be processed to obtain a label text, and obtaining a corresponding label text expression vector based on the label text; acquiring a corresponding label coding vector based on the label text representation vector; acquiring second joint codes corresponding to any two third entities according to the label coding vectors; and acquiring the open relation between any two third entities based on the second joint codes. In some embodiments, the first tag and the second tag are used to indicate a first word and a last word, respectively, of a third entity. In some embodiments, the class B extraction module 230 may be used to perform one or more of the following operations: obtaining at least one first label vector corresponding to at least one first label in the label coding vectors; acquiring a first label fusion vector based on any two first label vectors corresponding to any two third entities; and acquiring second joint codes corresponding to any two third entities based on the first label fusion vector and the label coding vector. In some embodiments, the third entity may be a financial entity. In some embodiments, the types of financial entities may include companies, people, industries, metrics, values, and addresses.
In some embodiments, the second extraction model may include one or more of the following models: BERT, Transformer, Stanford NLP, or LTP.
The screening module 240 may be configured to obtain target entity triples from the class a entity triples and the class B entity triples based on a screening rule. In some embodiments, the screening rules may include a combination of one or more of the following: acquiring a target entity triplet based on the timeliness of the text to be processed corresponding to the type A entity triplet and/or the type B entity triplet; acquiring a target entity triple based on the occurrence frequency of the type A entity triple and/or the type B entity triple in the text to be processed; and/or obtaining the target entity triple according to the scoring result of the type A entity triple and/or the type B entity triple by the scoring model.
The training module 250 may be configured to train the first extraction model and/or the second extraction model by using the text to be processed as a training sample and the target entity triplet as a training label.
FIG. 3 is an exemplary flow diagram of a text processing method according to some embodiments of the present description.
In some embodiments, the text processing method 300 may be performed by a processing device or implemented by a text processing system disposed on a processing device.
As shown in fig. 3, text processing method 300 may include:
step 310, obtaining a text to be processed. In particular, this step 310 may be performed by the text acquisition module 210.
The text to be processed may be text for which entity triples need to be extracted. For example, the text to be processed may be text information in a financial scenario. As another example, the pending text may be textual information in a robot customer service scenario. For convenience of explanation, the text processing method is described in the specification in conjunction with a financial scenario.
In some embodiments, the pending text may include chapter-level text. Illustratively, pending text may include securities research reports, related industry research reports, audit reports, credit reports, announcements, news and current comments, and the like. In some embodiments, the pending text may comprise sentence-level text. Illustratively, the text to be processed may include sentences included in any of the aforementioned chapter-level texts.
In some embodiments, the text obtaining module 210 may obtain the text to be processed directly from the information in the form of words. For example, the text obtaining module 210 may obtain the text to be processed from a text database. For another example, the text obtaining module 210 may also crawl the text to be processed from the text of the web page.
In some embodiments, the text acquisition module 210 may also acquire the text to be processed from the picture message based on word recognition technology. In some embodiments, the text to be processed may also be obtained from the Speech information based on Automatic Speech Recognition (ASR) techniques.
In some embodiments, the text acquisition module 210 may pre-process the text to be processed. In some embodiments, the pre-processing may include, but is not limited to, a combination of one or more of segmentation, deduplication, filtering, and the like.
The segmentation may be dividing the text to be processed in the form of a long text into a plurality of texts to be processed in the form of short texts. For example, the segmentation may divide the aforementioned chapter-level securities research report into multiple sentence-level texts, "A located in the F district of E city, D.C." and "A main competitors are B and C, while B is a C processing agency" ….
In some embodiments, the text obtaining module 210 may determine the length of the short text after segmentation according to the processing efficiency of the first extraction model and/or the second extraction model on the texts with different lengths, so as to improve the text processing efficiency. For the related description of the first extraction model and the second extraction model, refer to fig. 4 and fig. 5 and the related description thereof, which are not described herein again.
Some embodiments of the present description perform deduplication based on a short text after segmentation, which may improve deduplication rate and reduce the same text in a text to be processed.
Deduplication may be the process of removing duplicate text in the text to be processed.
The repeated text may be text of the same and/or similar content. In some embodiments, the text obtaining module 210 may obtain a text vector corresponding to each text in the text to be processed by using a word embedding model, then calculate a similarity between different text vectors, and finally take the text corresponding to the text vector with the similarity greater than a threshold as a repeated text. For a description of the word embedding model, reference may be made to fig. 4 and its related description, which are not described herein again. In some embodiments, the similarity between text vectors may be characterized by the distance between the text vectors. In some embodiments, the distance may include, but is not limited to: euclidean distance, manhattan distance, chebyshev distance, minkowski distance, mahalanobis distance, included angle cosine distance, etc.
For example, the text obtaining module 210 may obtain, based on the foregoing securities study report, a short text 1 "a house is located in F area of E city in D province, and obtain, based on a certain news, a short text 2" a house is located in F area of E city in D province, and the text obtaining module 210 may determine that the short text 1 and the short text 2 are repeated texts, and may remove at least one short text in the texts to be processed.
Some embodiments of the present description perform deduplication on a text to be processed, so that it is possible to avoid extracting the same entity triplet based on the same text, that is, to avoid generating interference on subsequent screening of target entity triples based on the occurrence frequency of the entity triplet, thereby improving the accuracy of text processing. For a detailed description of the screening of the triples of the target entity, reference may be made to the related description of step 340, which is not described herein again.
The filtering can remove invalid text in the text to be processed. The invalid text may be text that does not conform to the target scene.
The target scene may be an application scene of the text to be processed desired by the user. Illustratively, if a user desires to obtain entity triples in a financial scenario to build a financial relationship knowledge graph, the target scenario may be a financial scenario. For example, the invalid text may be a researcher disclaimer in the foregoing securities study report. As another example, the invalid text may be a web site link and an advertisement in a web page. As another example, the invalid text may be a space, a messy code, an incorrect character, and the like in the text to be processed.
In some embodiments, the text obtaining module 210 may identify and remove invalid texts from the texts to be processed, so as to obtain the filtered texts to be processed. In some embodiments, the text obtaining module 210 may obtain the filtered text to be processed using a classification model. Specifically, the classification model may map the input text to a value or a probability, and then obtain a classification result based on the value or the probability. Further, the text obtaining module 210 may use the text with the classification result of "financial scene" as the filtered text to be processed, and use the text with the classification result of other texts as invalid text, and remove the invalid text from the text to be processed.
Some embodiments of the present description filter the text to be processed, so as to reduce interference of invalid texts on the extraction result, thereby improving the extraction accuracy and the extraction efficiency.
And step 320, extracting a first entity from the text to be processed by using the first extraction model, and extracting a second entity meeting the predefined relationship from the text to be processed based on the first entity to obtain at least one type A entity triple. In particular, this step 320 may be performed by the class A extraction module 220.
The entity may be a specific individual in the real world. In some embodiments, the entity may be a financial entity. The financial entity may be an entity in a financial application scenario. For example, the entity may be a company A, a company B, a company C, and so on. Also for example, the entity may be Zhang three (stockholder), Li four (board of directors), Wang two (representative of French person), and so on. For another example, the entity may be a pig farming industry, a medical and beauty industry, a real estate industry, and the like.
Entity types can be a broad abstraction of an objective individual. In some embodiments, the types of financial entities may include companies, people, industries, metrics, values, and addresses.
In some embodiments, an entity may be an instance of what actually exists under the abstraction of an entity type. For example, the entity type "company" may specifically be an entity "a (company)", "b (company)", "c (company)", etc., the entity type "person" may specifically be an entity "zhangsan (shareholder)", "li si (director)", "wang di (legal representative)", etc., the entity type "industry" may specifically be an entity "pig farming industry", "medical and American industry", "real estate industry", etc., the entity type "index" may specifically be an entity "total cost in the month", "total annual sales", "total annual profit", etc., the entity type "numerical value" may specifically be an entity "100 ten thousands", "one hundred million", etc., and the entity type "address" may specifically be an entity "F district in E city in D", "street number in G district H", etc.
Illustratively, the entity of the text to be processed "house located in zone F of E city, D province" includes: in district F of E city, department of Hi and D province, the corresponding entity types include: a company and an address. As yet another example, the entities of the pending text "the main competitors of the company a are the generation factories of b and c, b being simultaneously c" include: the corresponding entity types of the department A, the department B and the department C comprise: company, company and company.
Entities may have relationships between them, which may be described by relationships between their corresponding entity types. For example, the relationship between the entity type "company" and the entity type "address" may be "located", and the relationship between the corresponding entity "house" and the entity "D province, E city, F district" may be "located".
An entity triplet may consist of two entities in the text to be processed and a relationship between the two entities. Illustratively, entity triples may be represented by a structure of [ entity, relationship, entity ]. Yet another example, entity triples may also be represented by the structure of [ entity, relationship ].
In some embodiments, one or more sets of entity triples may correspond to the text to be processed.
For example, the entity triple corresponding to the text to be processed "a is located in zone F of E city, D province" may include: entity triples [ Jia Si, D province, E city, F district, located ], etc., in group 1. For another example, the entity triplet corresponding to the pending text "the generation factories in which the main competitors of the company a are b and c, and b is c at the same time" may include: group 2 entity triplets [ class a, class b, competition ], group 3 entity triplets [ class a, class c, competition ], group 4 entity triplets [ class c, class b, employment ], and the like.
In some embodiments, the entities and relationships in the sets of entity triples may be partially identical. Continuing with the example above, the entity "house" in the 1 st and 2 nd set of entity triples is the same, and the relationship "race" in the 2 nd and 3 rd set of entity triples is the same.
A class a entity triplet may be an entity triplet containing a predefined relationship.
The predefined relationship may be a predefined relationship based on the entity type. In some embodiments, the predefined relationship may be determined manually based on the relationship between the entity types. For example, in a financial application scenario, the relationship between entity type "company" and entity type "company" may include competition, collaboration, hired, etc., and the relationship between entity type "person" may include: employment, control, etc., the relationship between … and the entity type "address" may include: at a location, at a registry, etc., the predefined relationship corresponding to the entity type "company" may include competition, cooperation, employment, controlled, at a location, at a registry, etc.
In some embodiments, each class a entity triplet may include a first entity, a second entity, and a predefined relationship between the first entity and the second entity.
The first entity may be an entity extracted from the text to be processed.
In some embodiments, the class A extraction module 220 may extract the first entity from the text to be processed using a first extraction model. Specifically, the class a extraction module 220 may process the text to be processed by using the first extraction model to obtain a text tagging sequence of the text to be processed. The text labeling sequence of the text to be processed can be used for marking the words or the words belonging to the entity in the text to be processed and the entity types belonging to the words or the words. In some implementations, the first extraction model includes one or more of the following models: BERT, Transformer, Stanford NLP, or LTP.
For a detailed description of the first extraction model, refer to fig. 8 and its related description, which are not repeated herein.
As shown in fig. 5, the first extraction model processes that "the main competitor of the first department is second …", and obtains a text annotation sequence: the "O", "B-co", "I-co", "O" … "B-co" … A class extraction module 220 may obtain the corresponding first entity based on the entity labels "B-co", "I-co" and "B-co" therein: a house, b …, and their corresponding entity type companies, company ….
The second entity may be an entity extracted from the text to be processed based on the predefined relationship corresponding to the first entity.
In some embodiments, the class a extraction module 220 may extract from the text to be processed based on the predefined relationship corresponding to the first entity using the first extraction model. For a detailed description of the first extraction model extracting the second entity, refer to fig. 4 and its related description, which are not repeated herein.
As shown in fig. 5, the first extraction model extracts the result of the second entity as null from the pending text "department main competitor is ethy …" based on the predefined relationship "cooperation" corresponding to the first entity "department"; the second entity "B" may be extracted from the pending text "A main competitor is B" based on the predefined relationship "Contention" corresponding to the first entity "A", …. Further, the class a extraction module 220 may obtain the group 1 class a entity triplets [ class a, class b, competition ] from the text "class a" to be processed, the main competitor is class b ".
It is to be understood that the relationship between the first entity and the second entity is relative.
In some embodiments, the first entity and the second entity may be exchanged as a new set of class A entity triples. For example, the first extraction model extracts the result of the second entity as null from the text to be processed "a master competitor is" b … "based on the predefined system" cooperation "corresponding to the first entity" b "; based on the predefined relationship "competence" corresponding to the first entity "B", a second entity "A shift", …, may be extracted from the pending text "A shift's primary competitor is B". Further, the class a extraction module 220 may obtain the group 2 class a entity triplet [ b, a, b ] from the text "b", the main competitor of the a, is b … ".
In some embodiments, the relationship between the exchanged first entity and second entity may change.
For example, the first extraction model may extract a first entity "a shift", "b", and "c" from the to-be-processed text "a shift is a main competitor is b and c, and b is a foundry of c at the same time", and then extract a second entity "b" from the to-be-processed text based on the predefined relationship "employment" corresponding to the first entity "c", thereby obtaining a 3 rd group of entity triples of type a [ c, b, employment ]; and taking the result of the second entity from the text to be processed as null based on the predefined relationship 'employment' corresponding to the first entity 'B', and taking the second entity 'C' from the text to be processed based on the predefined relationship 'employed' corresponding to the first entity 'B', so as to obtain a 4 th group of entity triples of class A [ B, C, employed ].
For another example, the first extraction model may extract a first entity "a department" and a "D province E city F district" from a to-be-processed text "the department is located in the D province city E city F district", and then extract a second entity "D province city E city F district" from the to-be-processed text based on a predefined relationship "located" corresponding to the first entity "the department", so as to obtain a 5 th group of entity triples [ the department, the D province city E city F district ] located in; the first entity "D province E city F district" may not have a corresponding predefined relationship, or the result of taking the second entity from the text to be processed based on all predefined relationships corresponding to the first entity "D province E city F district" is null, that is, "D province city E city F district" and "house" cannot be used as the first entity and the second entity to form a type a entity triplet, respectively.
A class B entity triplet may be an entity triplet containing an open relationship.
An open relationship may be a relationship that is not predefined. In some embodiments, the open relationship may be obtained based on any two third entities.
The third entity may be an entity extracted from the text to be processed. In some embodiments, the class B extraction module 230 may extract the third entity from the text to be processed using a second extraction model. For a detailed description of the extraction of the third entity, reference may be made to the related description of the extraction of the first entity in step 320, which is not described herein again.
As shown in fig. 7, the second extraction model processes "the main competitor of the first department is second …", and obtains the text annotation sequence: "O", "B-co", "I-co", "O" … "B-co", "O" …, the class B extraction module 230 can obtain the corresponding third entity based on the entity labels "B-co", "I-co" and "B-co" therein: the first part and the second part are ….
Further, in some embodiments, the second extraction model may determine an open relationship between any two third entities based on the any two third entities and the text to be processed.
Specifically, the class B extraction module 230 may process the text to be processed by using the second extraction model, so as to obtain a relationship labeling sequence of the text to be processed corresponding to any two third entities. The relation labeling sequence can be used for labeling characters and/or words corresponding to the open relation in the text to be processed. Further, the second extraction model may determine an open relationship between any two third entities in the text to be processed based on the relationship labeling sequence. For a detailed description of the second extraction model, refer to fig. 6 and its related description, which are not repeated herein.
In some embodiments, each class B entity triplet may include two third entities and an open relationship between the two third entities. For example, the class B extraction module 230 may obtain the group 1 class B entity triplet [ class a, class B, competition ] based on the third entity "class a" and "class B" and the open relationship "competition" therebetween.
It will be appreciated that the relationship between the two third entities is relative.
In some embodiments, the location of the two third entities may be exchanged as a new set of class B entity triples. For example, the positions of the two third entities in the group 2 group B entity triplet may be exchanged, and is denoted as [ B, a, c ].
In some embodiments, after the locations of the two third entities are exchanged, the open relationship between the two third entities may be changed accordingly. For example, the second extraction model may extract a third entity "a department", "B", and "c" from a to-be-processed text "a department main competitor is B and c, and B is a generation processing factory of c at the same time", and then may obtain an open relationship "generation processing" of "B" and "c" based on the third entity "B" and "c", thereby obtaining a 3 rd group of class B entity triples [ B, c, generation processing ]; based on the third entities "c" and "B", the open relationship "processing" of "c" and "B" can be obtained, thereby obtaining the 4 th group of group B entity triplets [ c, B, processing ].
And step 340, acquiring target entity triples from the class A entity triples and the class B entity triples based on the screening rule. In particular, this step 340 may be performed by the screening module 240.
The target entity triples may be entity triples that satisfy the extraction requirements.
The screening rules may be rules for determining target entity triples.
In some embodiments, the filtering rule may include obtaining a target entity triplet based on timeliness of the text to be processed corresponding to the type a entity triplet and/or the type B entity triplet.
The timeliness can reflect the influence of the freshness of the text to be processed on the screening result. In some embodiments, the timeliness of the text to be processed may be assessed using the timeliness indicator. In some embodiments, the timeliness indicator may be determined based on the time indicator and the effectiveness indicator.
The time index may reflect the recency of the text to be processed. In some embodiments, the time indicator may include, but is not limited to, one or more of a publication time indicator, an occurrence time indicator, and an acquisition time indicator, among others.
The posting time index may be the interval between the posting time of the pending text and the current time. The release time of the text to be processed can be the release time of news, the time of uploading a securities research report to a website, the notice time of an audit report and the like. In some embodiments, the filtering module 240 may obtain the publication time of the pending text by accessing database time information and/or crawling website information.
The occurrence time index may be an interval between the occurrence time and the current time of the event described by the pending text. Wherein, the occurrence time of the event described by the text to be processed can be the occurrence time of the event reported by the news. In some embodiments, the filtering module 240 may identify a text in a time format from the text to be processed, so as to obtain an occurrence time of an event described by the text to be processed.
The acquisition time index may be an interval between the time when the text processing system acquires the text to be processed and the current time. For example, the time at which the to-be-processed text is obtained may be the time at which the text processing system 200 crawls the to-be-processed text from a website. In some embodiments, the filtering module 240 may record the time for acquiring the text to be processed directly when acquiring the text to be processed.
Illustratively, the current time is 20 days at 1 month at 2022, the text to be processed may be some news which is published at 2 days at 1 month at 2022, and specifically includes "1 month at 2022, 1 day at 1 month at which the house finishes purchasing" and the text processing system 200 crawls the news from the newsfeed website as the text to be processed at 10 days at 1 month at 2022, the publishing time, the occurrence time and the obtaining time of the described event of the text to be processed may be 2 days at 1 month at 2022, 1 month at 2022 and 10 days at 1 month at 2022, and the corresponding publishing time index, the occurrence time index and the obtaining time index may be 18d, 17d and 10d, respectively.
It will be appreciated that the larger the value of the time indicator, the older the text to be processed.
In some embodiments, different pending texts may obtain different time indexes, for example, the news may obtain corresponding 3 time indexes, and if the occurrence time of an event is not described in the stock research report, only the corresponding release time index (e.g., 20 d) and the corresponding acquisition time index (e.g., 9d) may be obtained.
In some embodiments, the filtering module 240 may set weights for different time indicators and average a plurality of time indicators based on the weights to obtain a final time indicator. For example, the screening module 240 may set the weights of 0.4, 0.5, and 0.1 for the release time index, the occurrence time index, and the acquisition time index, respectively, and continuing the above example, the time index corresponding to the news is (18 × 0.4+17 × 0.5+10 × 0.1)/3 =5.6d, and the time index corresponding to the securities research report is (20 × 0.4+9 × 0.1)/2=4.5 d.
The effect index may reflect the duration of the impact of the text to be processed on the screening result. It can be understood that the larger the effect index is, the longer the duration of the influence of the text to be processed on the screening result is.
In some embodiments, the persistence indicator may be determined based on different types of text to be processed. Illustratively, the corresponding performance metrics for the research report, monthly audit report, and news may be 60d, 30d, and 30d, respectively.
In some embodiments, the timeliness indicator may be a ratio of the effectiveness indicator and the time indicator. For example, the timeliness index of the news may be 30/5.6=5.4, and the timeliness index of the securities study report may be 60/4.5= 13.
In some embodiments, the screening module 240 may sort the timeliness of the texts to be processed corresponding to the type a entity triplets and/or the type B entity triplets from large to small based on the timeliness index, and use the type a entity triplets and/or the type B entity triplets with a sorting order smaller than the first sorting threshold as the target entity triplets. For example, the category 1 entity triplet of category a [ class a, class B, competition ], category 2 entity triplet of category a [ class B, class a, competition ], category 3 entity triplet of category a [ class c, class B, employment ], category 4 entity triplet of category a [ class B, class c, employed ], category 5 entity triplet of category a [ class a, class D, province E, city F, located in …, category 1 entity triplet of category B [ class a, class B, competition ], category 2 category B entity …, category 4 category B entity triplet of category 4 [ class c, class B, processing ] corresponds to the time ordered category 1 category entity triplet of category a [ class a, class B, competition = category 2 entity triplet of category a [ class a ], class a, competition = category 3 entity triplet of category a [ class a ], class B = category 4 entity triplet of category a, class B, class a = category 1, and B, competition = group 2, group B entity triplet … = group 4, group B entity triplet [ c, B, process ] > group 5, group a entity triplet [ a, D, province, E, city, F, located ], and the screening module 240 may use, as the target entity triplet, group 1 to group 4, group a entity triplets and group 1 to group 4, group B entity triplets that are in parallel ordering of 1 and have an ordering order smaller than the first ordering threshold 2.
In some embodiments, the screening module 240 may use the type a entity triplet and/or the type B entity triplet corresponding to the text to be processed whose timeliness index is greater than the timeliness threshold as the target entity triplet. For example, the timeliness indexes of the texts to be processed corresponding to the aforementioned 1 st to 4 th groups of entity triples of type a and 1 st to 4 th groups of entity triples of type B are all 4.5D, the timeliness index of the texts to be processed corresponding to the 5 th group of entity triples of type a (a, D, province, E, city, region F) is 3D, and the screening module 240 may use the entity triples of type a in the 1 st to 4 th groups and the entity triples of type B in the 1 st to 4 th groups, of which the timeliness index is greater than the timeliness threshold 4, as the target entity triples.
In some embodiments, the first ranking threshold and timeliness threshold may be determined based on the number of texts to be processed and the number of training times of the extraction model (first extraction model and/or second extraction model). It can be understood that the larger the number of texts to be processed, the more training times of the extraction model, the smaller the first ordering threshold value, and the larger the timeliness threshold value.
Some embodiments of the present description determine a target entity triplet based on timeliness of a text to be processed corresponding to an entity triplet (i.e., a type a entity triplet and/or a type B entity triplet), which may enable the target entity triplet to have real-time performance.
In some embodiments, the target entity triples are obtained based on the number of occurrences of the type a entity triples and/or the type B entity triples in the text to be processed.
In some embodiments, the screening module 240 may use, as the target entity triplet, the type a entity triplet and/or the type B entity triplet that have the same occurrence number of the type a entity triplet and/or the type B entity triplet in the text to be processed, which is greater than the frequency threshold.
For example, the type a entity triplets and/or the type B entity triplets in the pending text "a si located in area F of E, d.province" and "a si main competitors are B and c, and B is a generation process plant of c at the same time" include: group 1, group a, group B, competition, group 2, group a, group 4, group a, group B, group c, competition, group 3, group a, group B, group c, group B, group a, group B, group c, group B, group a, group B, group B, group B, group B, group B, group B, group B, group, the third, fourth, and fifth triples may be selected by the filtering module 240 as the target entities, where the third, fourth, and fifth triples have an occurrence frequency of 1 (group 3, group a entity triplet), [ the third, fourth, and fifth triples have an occurrence frequency of 1 (group 4, group a entity triplet), [ the third, fourth, and fifth triples have an occurrence frequency of 1 (group 5, group a entity triplet), [ the fourth, and fifth triples have an occurrence frequency of 1 (group 3, group B entity triplet), and [ the third, fourth, and fifth triples have an occurrence frequency of 1 (group 4, group B entity triplet), and the filtering module 240 may use the third, fourth, and fifth triples, sixth, fourth, and fifth triplets, sixth, and fifth triplets, and sixth, and sixth triplets, and fifth triplets, sixth, and sixth, fourth, sixth, fourth, sixth, fourth, sixth, fourth, sixth, fifth, fourth, sixth, fourth, sixth, fourth, sixth, fourth, sixth, fourth.
In some embodiments, the time threshold may be determined based on the number of texts to be processed and the number of training times of the extraction model (the first extraction model and/or the second extraction model). It can be understood that the larger the number of texts to be processed, the more training times of the extraction model, and the larger the time threshold value.
Some embodiments of the present description may make the target entity triplet practical by determining the target entity triplet based on the number of occurrences of the entity triplet (i.e., the type a entity triplet and/or the type B entity triplet) in the pending text.
In some embodiments, the target entity triples are obtained according to the scoring result of the scoring model on the type a entity triples and/or the type B entity triples.
In some embodiments, the inputs of the scoring model may include type a entity triplets and/or type B entity triplets, and the outputs may be the scoring results corresponding to the type a entity triplets and/or type B entity triplets.
In some embodiments, the scoring model may include, but is not limited to, a Text Rank model, a Logistic regression model, a naive bayes classification model, a gaussian distributed bayes classification model, a decision tree model, a random forest model, a KNN classification model, a neural network model, and the like.
For example, the scoring model may process the aforementioned 1-5 groups of entity triples of type a and 1-4 groups of entity triples of type B, and the scoring results corresponding to the groups are respectively obtained as follows: 0.8, 0.4, 0.3, 0.7, and 0.8, 0.6, and 0.2.
In some embodiments, the inputs to the scoring model may also include timeliness indicators and the number of occurrences of the type a entity triplets and/or the type B entity triplets in the text to be processed.
Further, the scoring model may take the type a entity triples and/or the type B entity triples with the scoring result exceeding the scoring threshold and/or with the scoring result ordering less than the second ordering threshold as the target entity triples.
For example, the scoring model may assign a score result that exceeds a score threshold of 0.5 for a type a entity triplet and/or a type B entity triplet: the group 1 type A entity triplets [ A, B, competition ], the group 2 type A entity triplets [ B, A, competition ], the group 3 type A entity triplets [ C, B, employment ], the group 5 type A entity triplets [ A, D province, E city F district, and the group 3 type B entity triplets [ B, A, competition ] and the group 2 type B entity triplets [ B, A, B, competition ] are taken as target entity triplets.
For another example, the scoring model may order entity type a entity triples and/or type B entity triples with a scoring result less than a second ordering threshold of 4: and sequencing the 1 st group of entity triples of the type A, the second, the competition, the 2 nd group of entity triples of the type A, the second, the first, the competition, the 1 st group of entity triples of the type B, the second, the competition, the 2 nd group of entity triples of the type B, the second, the first, the competition, the 5 th group of entity triples of the type A, the D province E city F area of the 2 nd group of the entity triples, the third and the fourth groups of entity triples of the type B, the second, the third and the fourth of the 3 rd group of entity triples of the type B, the third and the fourth groups of entity triples are arranged in the 1 st group of entity triples of the type A, the second, the competition and the 2 nd group of entity triples of the type B, the second, the third and the fourth generation processing as target entity triples.
In some embodiments, the second ranking threshold and the score threshold may be determined based on the number of texts to be processed and the number of training times of the extraction model (the first extraction model and/or the second extraction model). It can be understood that the larger the number of texts to be processed, the more training times of the extraction model, the smaller the second sorting threshold value, and the larger the score threshold value.
Some embodiments of the present description obtain the target entity triplet based on the scoring model, so that the obtained target entity triplet can be evaluated from multiple dimensions, and the accuracy of the extraction result is improved.
In some embodiments, the training module 250 may train the scoring model individually based on a number of first training samples with first training labels. Specifically, a first training sample with a first training label is input into the scoring model, and parameters of the scoring model are updated through training. In some embodiments, the first training sample may be a sample entity triplet. In some embodiments, the first training label may be a manually labeled true (1) or false (0). For example, when the sample entity triplet in the first training sample is indeed the target entity triplet, the first training label may be true or 1, and when the sample entity triplet in the first training sample is not the target entity triplet, the first training label may be false or 0.
In some embodiments, the screening module 240 may further send the type a entity triplets and/or the type B entity triplets to the user terminal 130 through the network 140, and further, the screening module 240 may receive the user-selected type a entity triplets and/or the type B entity triplets from the user terminal 130 through the network 140 as target entity triplets. In some embodiments, the number of times the filtering module 240 determines the target entity triplet in conjunction with the user interaction may be determined based on the number of texts to be processed and the number of training times of the extraction model. It is to be appreciated that the filtering module 240 may increase the number of times that the type a entity triplets and/or the type B entity triplets are sent to the user terminal 130 when the amount of text to be processed is small and/or the number of times the model is trained is small. In some embodiments, the screening module 240 may also determine the timing for sending the type a entity triplets and/or the type B entity triplets to the user terminal 130 based on the user's selection.
It can be understood that, compared with the manual extraction of sample entity triples from a large amount of sample texts as training labels, the manual labeling of the embodiment only needs to judge whether the extracted type a entity triples and/or type B entity triples are target entity triples, so that the human resources and the time cost are saved. Some embodiments of the present description filter target entity triples in combination with user interaction, and may also appropriately guide a filtering result based on user settings, thereby improving accuracy of an extraction result while saving human resources.
FIG. 9 is a schematic diagram of a text processing method according to some embodiments of the present description. As shown in fig. 9, after the text processing system 100 obtains the type a entity triplet and the type B entity triplet from the text to be processed by using the first extraction model and the second extraction model, the target entity triplet is screened from the type a entity triplet and the type B entity triplet based on the screening rule, and the first extraction model and/or the second extraction model may be trained based on the target entity triplet.
In some embodiments, the training module 250 may train the first extraction model and/or the second extraction model using the text to be processed as a training sample and the target entity triplet as a training label.
For example, the training module 250 may use a text to be processed "a house is located in the F area of E, D province" as a second training sample 1, and use a corresponding target entity triple [ a house, D province, E, F area ] as a second training label of the second training sample 1; the text to be processed, namely the generation processing factory where the main competitor of the first department is B and C and the B is C at the same time, is taken as a second training sample 2, and the corresponding target entity triples [ A, B, Competition ], [ A, C, Competition ], [ B, A, Competition ], [ C, A, Competition ] and [ B, C, generation processing ] are taken as second training labels … of the second training sample 2.
In some embodiments, the training module 250 may train the first extraction model alone.
Specifically, a second training sample with a second training label is input into an initial first extraction model, the initial first extraction model is used for processing the second training sample to obtain an A-type entity triplet output by the initial first extraction model, and parameters of the initial first extraction model are adjusted according to the difference between the second training label and the A-type entity triplet until the trained middle first extraction model meets a preset condition to obtain the trained first extraction model, wherein the preset condition can be that a loss function is smaller than a threshold value, convergence is achieved, or a training period reaches the threshold value.
In some embodiments, the training module 250 may train the second extraction model separately.
Specifically, a second training sample with a second training label is input into an initial second extraction model, the initial second extraction model is used for processing the second training sample to obtain a class B entity triplet output by the initial second extraction model, and a parameter of the initial second extraction model is adjusted according to a difference between the second training label and the class B entity triplet until a middle second extraction model of training meets a preset condition to obtain a trained second extraction model, wherein the preset condition can be that a loss function is smaller than a threshold value, convergence is achieved, or a training period reaches the threshold value.
In some embodiments, the training module 250 may jointly train the first extraction model, the second extraction model, and the scoring model.
Specifically, a second training sample with a second training label is respectively input into an initial first extraction model and an initial second extraction model, the second training sample is processed by using the initial first extraction model, the initial second extraction model and the initial scoring model to obtain a target entity triple output by the initial scoring model, parameters of the initial first extraction model, the initial second extraction model and the initial scoring model are adjusted according to the difference between the second training label and the target entity triple until the trained middle first extraction model, the trained middle second extraction model and the middle scoring model meet preset conditions, and the trained first extraction model, the trained second extraction model and the trained scoring model are obtained, wherein the preset conditions can be that a loss function is smaller than a threshold value, convergence is achieved, or a training period reaches the threshold value.
In some embodiments, the initial first and second decimation models may be models trained based on a small number of manually labeled sample entity triples.
Some embodiments of the present description combine a first extraction model with a predefined relationship and a second extraction model with an open relationship to extract a target entity triplet from a text to be processed, and train the first extraction model and the second extraction model using the target entity triplet as training data, so that on one hand, the first extraction model and the second extraction model can learn each other, and thus, an extraction result output by the trained models can simultaneously give consideration to higher accuracy and a higher application range, and on the other hand, unsupervised learning can be performed on the first initial extraction model and the second initial extraction model trained on the basis of a small amount of manually labeled sample entity triplets, thereby saving human resources and time cost of labeling.
FIG. 4 is an exemplary flow diagram illustrating a method for obtaining at least one triple of a class A entity using a first decimation model according to some embodiments of the present description. In particular, FIG. 4 may be performed by the class A extraction module 220.
As shown in fig. 5, the first extraction model may include: a first entity extraction layer 510, a first joint coding layer 520, a first annotation sequence layer 530, and an entity identification layer 540.
As shown in fig. 4, the method 400 for obtaining at least one triple of a class a entity using a first extraction model may include:
In some embodiments, the first joint encoding layer 520 may obtain the first joint encoding based on the feature vector and the first entity vector of the text to be processed.
The feature vector of the text to be processed may be a vector characterizing features of the text to be processed. In some embodiments, the first entity extraction layer 510 may obtain a feature vector of the text to be processed based on the text to be processed. For a detailed description of the feature vector of the text to be processed, refer to fig. 8 and its related description, which are not repeated herein.
As previously mentioned, the first entity may be an entity extracted from the text to be processed. In some embodiments, the first entity extraction layer 510 may extract the first entity in the text to be processed. Specifically, the first entity extraction layer 510 may obtain a text tagging sequence corresponding to the text to be processed, and then extract the first entity based on the text tagging sequence. As shown in FIG. 5, the first entity extraction layer 510 may obtain the corresponding first entity based on the text annotation sequences "O", "B-co", "I-co", "O" … "B-co" …: the first part and the second part are …. For a detailed description of extracting the first entity, reference may be made to the related description of step 320, which is not described herein again.
The first entity vector may be a vector of features characterizing the first entity. In some embodiments, the first extraction model may obtain a first entity vector corresponding to the first entity based on a word and/or word feature vector corresponding to the first entity in the feature vector of the text to be processed.
In some embodiments, the first extraction model may pool the word and/or word feature vectors corresponding to the first entity in the feature vectors of the text to be processed, so as to obtain the first entity vector. Pooling may be achieved by reducing the size of data by representing a particular region of data by an average, minimum, and/or maximum of a plurality of data of the particular region, etc. Accordingly, in some embodiments, pooling may include, but is not limited to, average pooling, minimum pooling, maximum pooling, and the like.
For example, the first extraction model may perform average pooling on elements at the same position in a plurality of word and/or word feature vectors corresponding to the first entity, so as to obtain a first entity vector having the same dimension as each word and/or word feature vector. As shown in fig. 5, the first extraction model may obtain a feature vector [ T ] corresponding to the text to be processeda1】【Ta2】【Tb1】【Tb2… word feature vector [ T ] corresponding to first entity' jiasia1And [ T ]a2Then, for the word feature vector [ T ] corresponding to the first entity "housea1And [ T ]a2Averaging the elements at the same position to obtain the first entity vector of 'A si' [ T ]a[ MEANS FOR solving PROBLEMS ] is provided. For example, [ T ]a1】=【2,4,6】,【Ta2[ 4,6,8 ] then [ T ]a[ 3,5,7 ]. For a detailed description of the word and/or the word feature vector, reference may be made to fig. 8 and its related description, which are not repeated herein.
The first joint encoding may be a vector fusing the features of any one of the first entities and the features of the text to be processed. In some embodiments, the first joint encoding layer 520 may encode any one of the first entity vector and the feature vector of the text to be processed, to obtain the first joint encoding.
In some embodiments, the first joint coding layer 520 may fuse any one of the first entity vectors with each word and/or word feature vector in the feature vectors of the text to be processed, so as to obtain the first joint coding.
In some embodiments, the manner of fusion may include, but is not limited to, a combination of one or more of addition, averaging, weighted summation, and the like. For example, as shown in fig. 5, the first joint coding layer 520 may correspond a first entity vector [ T ] corresponding to a first entity "houseaRespectively and [ T ] in feature vector of text to be processeda1】【Ta2】【Tb1】【Tb2… are added to obtain a first joint code [ U ]a1】【Ua2】【Ub1】【Ub2…, wherein [ U ]a1】=【Ta1】+【Ta】、【Ua2】=【Ta2】+【Ta】、【Ub1】=【Tb1】+【Ta】、【Ub2】=【Tb2】+【Ta】…。
Some embodiments of the present description encode the first entity vector and the feature vector of the text to be processed to obtain the first joint code, so that the first joint code simultaneously includes information of the first entity, information of the text to be processed, and relationship information between the first entity and the text to be processed, and accuracy of extracting a second entity corresponding to each predefined relationship in a subsequent first extraction model can be improved.
In some embodiments, the first joint encoding layer 520 may be a feed-forward neural network. The feedforward neural network can fuse the first entity vector and the feature vector of the text to be processed through an activation function to obtain a first joint code.
It is to be understood that the first entity extracted from the extracted text to be processed may be a second entity of other first entities corresponding to a predefined relationship, and therefore, in some embodiments, after the first entity extraction layer 510 extracts a plurality of first entities from the text to be processed, it may be determined whether the current first entity may use other first entities as second entities to form a category B entity triple based on a front-back order and a character distance between each first entity and other first entities. For example, when the sequence of the first entity is at the end and the distances between the first entity and the other first entities exceed a preset distance (e.g., 10 characters), it may be determined that the first entity does not have a corresponding second entity in the text to be processed.
Further, when the judgment result is negative, the second entity corresponding to the predefined relationship is abandoned based on the first entity, and when the judgment result is positive, the second entity corresponding to the predefined relationship is extracted based on the first entity, and the subsequent steps are continued.
Some embodiments of the present description may determine in advance whether each first entity has a second entity corresponding to a predefined relationship, so as to improve the extraction efficiency.
And step 420, acquiring an entity tagging sequence of the text to be processed corresponding to each predefined relationship based on the first joint coding.
As previously described, the predefined relationship may be a predefined relationship based on the entity type.
In some embodiments, each first entity may correspond to at least one predefined relationship. For example, continuing with the foregoing example, the predefined relationship corresponding to the entity type "company" of the first entity "company" may include competition, cooperation, employment, controlled, located, registered, etc., and the predefined relationship corresponding to the first entity "company" may also include competition, cooperation, employment, controlled, located, registered, etc.
The entity label sequence may be a result of arranging a plurality of entity labels respectively corresponding to a plurality of words or a plurality of words in the text to be processed in order. In some embodiments, the entity label may be used to indicate whether a corresponding word or word in the text to be processed belongs to the second entity. Further, in some embodiments, entity labels may be used to indicate words and/or phrases that correspond to predefined relationships. Illustratively, the entity labels may be further classified into "competitive" relationship entity labels, "cooperative" relationship entity labels, and the like, based on the type of the predefined relationship to which the first entity corresponds, so as to further indicate the predefined relationship to which the corresponding word or phrase corresponds. Thus, the entity tagging sequence can be used for tagging the words or words belonging to the second entity in the text to be processed and the predefined relationship corresponding to the words or words.
In some embodiments, the entity labels may be at least one of Chinese characters, numbers, letters, symbols, and the like. For example, a first word or initial word of a second entity may be represented by B and a non-first word or non-initial word of the second entity may be represented by I. Also for example, the predefined relationships may be denoted as "cooperative" and "competitive" by r1 and r2, respectively.
In some embodiments, each predefined relationship may correspond to a set of entity annotation sequences. For example, the entity annotation B-r1 or I-r1 may mark words or words belonging to the second entity whose predefined relationship in the text to be processed is "collaboration". As another example, the entity labels B-r1 or I-r2 may mark words or words belonging to the second entity whose predefined relationship is "competitive" in the text to be processed.
In some embodiments, the first annotation sequence layer 530 can annotate the first joint code to obtain the entity annotation sequence of the text to be processed corresponding to each predefined relationship. As shown in FIG. 5, the first annotation sequence layer 530 can be based on a first joint coding [ U ]a1】【Ua2】【Ub1】【Ub2…, mark "B" in the text to be processed "A main competitors are B and C, and B is C generation processing factory" as "B-r 2" in the entity annotation sequence 2, which represents "the first word of the second entity corresponding to the predefined relationship' competition".
In some embodiments, the entity annotations may also include non-predefined relationship annotations. The non-predefined relationship labels may also be at least one of Chinese characters, numbers, letters, symbols, and the like. Words or phrases in the text to be processed that do not belong to the second entity corresponding to the predefined relationship may be marked with the same non-predefined relationship labels. As shown in fig. 5, the first annotation sequence layer 530 marks the word "a major competitor in the text to be processed that does not belong to the second entity corresponding to the predefined relationship" collaboration "with" O "in the entity annotation sequence 1 as" b … ". In some embodiments, words or phrases in the text to be processed that do not belong to the second entity corresponding to the predefined relationship may also be left without any labeling.
Specifically, the first labeling sequence layer 530 may obtain, based on the first joint encoding, a probability that each word or phrase in the text to be processed belongs to the second entity corresponding to each predefined relationship and a probability that each word or phrase does not belong to the second entity corresponding to any predefined relationship, and then use a label corresponding to a maximum value of the probabilities as an entity label of the word or phrase.
Taking fig. 5 as an example, the first annotation sequence layer 530 can be based on [ U ] in the first joint codinga1The probability that the 'A' belongs to the first word of the second entity corresponding to the predefined relationship 'cooperation' is 0.2, the probability that the 'A' belongs to the non-first word of the second entity corresponding to the predefined relationship 'cooperation' is 0.2, and the probability that the 'A' does not belong to the second entity corresponding to the predefined relationship 'cooperation' is 0.6, and then the mark 'O' of the non-predefined relationship mark corresponding to the maximum probability value of 0.6 is used as the entity mark of the 'A' word.
Similarly, as shown in fig. 5, the first annotation sequence layer 530 can obtain the entity annotations in the entity annotation sequence 1 of each word or word in the to-be-processed text "the main competitors of the first department are b and c, and b is a generational processing factory of c at the same time", and arrange the words or words in the to-be-processed text according to the order of the words or words, so as to obtain the entity annotation sequence 1: "O", "O" … "O", "O"; and entity labels in the entity label sequence 2 are arranged according to the sequence of the characters or words in the text to be processed, so that the entity label sequence 2 is obtained: "O", "O" … "O", "B-r 2".
In some embodiments, the first annotation sequence layer 530 can include, but is not limited to, one of an N-Gram (N-Gram) Model, a Conditional Random Field (CRF) Model, and a Hidden Markov Model (HMM).
And 430, extracting a second entity corresponding to each predefined relationship according to the entity tagging sequence of the text to be processed corresponding to each predefined relationship.
In some embodiments, the entity identification layer 540 may extract the second entity corresponding to each predefined relationship.
Specifically, the entity identification layer 540 may use the word and/or word corresponding to the entity label corresponding to each predefined relationship in the entity label sequence as the word and/or word of the second entity corresponding to the predefined relationship. For example, as shown in fig. 5, the entity identification layer 540 may determine, based on the entity labels "B-r 2" in the entity label sequence 2, that the second entity corresponding to the predefined relationship "competition" in the text to be processed is "B", i.e., the second entity forming a "competition" relationship with the first entity "a" in the text to be processed may be "B".
It is to be appreciated that in some embodiments, the predetermined relationship may not have a corresponding second entity. For example, as shown in fig. 5, the entity identification layer 540 may determine that the predefined relationship "collaborate" has no corresponding second entity in the text to be processed, i.e., there is no second entity forming a "collaborative" relationship with the first entity "a" in the text to be processed, based on the non-predefined relationship labels "O", "O" … "O", "O" in the entity label sequence 1.
Further, the class a extraction module 220 may combine the first entity, the predefined relationship corresponding to the first entity, and the second entity corresponding to each predefined relationship in the text to be processed into a class a entity triple. For example, the class a extraction module 220 may combine the first entity "class", the predefined relationship "competition" corresponding to the first entity "class", and the second entity "second" corresponding to the predefined relationship "competition" into the group 1 class a entity triple [ class a, class b, competition ].
The above embodiment provides an implementation structure of the first extraction model, and in some embodiments, the first extraction model may be implemented on an end-to-end model, such as a BERT-based multi-head selection model, Stanford NLP, or LTP.
Some embodiments of the present description utilize the first extraction model to obtain the type a entity triplet based on the predefined relationship, so that the first entity and the second entity in the type a entity triplet are necessarily close to or satisfy the predefined relationship, and the accuracy of the type a entity triplet may be improved.
FIG. 6 is an exemplary flow diagram illustrating the acquisition of multiple triples of class B entities using a second decimation model according to some embodiments of the present description. In particular, fig. 6 may be performed by the class B extraction module 230.
As shown in fig. 7, the second extraction model may include: a second entity extraction layer 710, a label coding layer 720, a second joint coding layer 730, and a second label sequence layer 740.
As shown in fig. 6, the method 600 for obtaining at least one triple of a class B entity may include:
As previously mentioned, the third entity may be an entity extracted from the text to be processed.
In some embodiments, the second entity extraction layer 710 may extract a third entity in the text to be processed. Specifically, the second entity extraction layer 710 may obtain a text labeling sequence corresponding to the text to be processed, and then extract a third entity based on the text labeling sequence.
As shown in FIG. 7, the second entity extraction layer 710 can obtain the corresponding third entity based on the text annotation sequences "O", "B-co", "I-co", "O" … "B-co" …: the first part and the second part are …. For a detailed description of extracting the third entity, reference may be made to the related description of step 330, which is not described herein again.
The first tag and the second tag may be used to indicate a first word and a last word of a third entity, respectively. In some embodiments, the first label and/or the second label may be a number (e.g., 1, 2), a chinese character, a letter (e.g., a, b), or other symbol, and combinations thereof. For example, the first label and the second label of the first third entity in the text to be processed may be "label 1" and "label 2", respectively, and the first label and the second label of the second third entity may be "label 3" and "label 4", respectively.
The tag text may be the text to be processed that contains the tag.
In some embodiments, the second extraction model may add the first tag and the second note to the front and the back of the third entity in the text to be processed, respectively, to obtain the tag text. For example, as shown in fig. 7, the second extraction model may add a first label "label 1" and a second note "label 2" to the front and the back of the first third entity "a" in the to-be-processed text "a main competitor is b …", and add a first label "label 3" and a second note "label 4" to the front and the back … of the first third entity "b" in the to-be-processed text, respectively, so as to obtain a label text: label 1A. Label 2A major competitor was label 3B. Label4 ….
The tag text representation vector may be a vector characterizing the tag text information.
In some embodiments, the second extraction model may utilize a word embedding model to obtain the tag text representation vector based on the tag text. For a detailed description of the word embedding model, reference may be made to fig. 8 and its related description, which are not repeated herein. As shown in fig. 7, the second extraction model may obtain a tag text representation vector based on the tag text: [ L1 ] [ T ]a1】【Ta2】【l2】【Tb1】…【l3】【Tb2】【l4】…。
And step 620, acquiring a corresponding label coding vector based on the label text representation vector.
The tag encoding vector may be a vector fusing the third entity information and the text information to be processed. It is understood that the tag encoding vector may include the text to be processed and an encoding vector corresponding to the tag, and the encoding vector corresponding to the tag and the text to be processed in the tag encoding vector includes features of other encoding vectors.
In some embodiments, the tag encoding layer 720 may encode the tag text representation vectors to obtain correspondencesThe tag of (1) encodes the vector. As shown in fig. 7, the label encoding layer 720 may represent a vector [ l1 ] [ T ] for the label texta1】【Ta2】【l2】【Tb1】…【l3】【Tb2L4, and obtaining corresponding label coding vector L1La1】【La2】【L2】【Lb1】…【L3】【Lb2】【L4】…。
The exemplary tag encoding layer 720 may be implemented by a BERT model or a Transformer.
The first tag vector may be a vector element to which the first tag corresponds in the tag-encoded vector. As shown in fig. 7, the first tag vector of the third entity "a" may be a vector element [ L1 ] of the first tag "label 1" corresponding in the tag coding vector, and the first tag vector of the third entity "b" may be a vector element [ L3 ] of the first tag "label 3" corresponding in the tag coding vector.
In some embodiments, the second extraction model may obtain at least one first tag vector corresponding to at least one first tag in the tag-encoded vector. Specifically, the second extraction model may obtain the first tag vector from the position order in the tag encoding vector based on the position order of the first tag in the tag text. As shown in fig. 7, the first label "label 1" is located in the 1 st position order in the label text, and the corresponding first label vector is the vector element [ L1 ] of the 1 st position attribute in the label encoding vector.
The first tag fusion vector may be a vector in which any two pieces of third entity information and text information to be processed are fused.
In some embodiments, the second extraction model may obtain the first label fusion vector based on any two first label vectors corresponding to any two third entities. For example, the second extraction model may first splice any two first tag vectors, and then map the spliced any two first tag vectors into a first tag fusion vector by using the full connection layer, where a dimension of the first tag fusion vector is the same as a dimension of any one first tag vector.
As shown in fig. 7, the second extraction model may splice a first label vector [ L1 ] corresponding to the third entity "a" and a first label vector [ L2 ] corresponding to the third entity "b", and then map the spliced [ L1 ] and [ L2 ] into a first label fusion vector [ L ] by using a full link layer.
The second joint code may be a vector fusing the features of any two third entities and the features of the text to be processed.
In some embodiments, the second joint encoding layer 730 may fuse the first tag fusion vector with each vector element in the tag text representation vector, respectively, to obtain the second joint encoding. In some embodiments, the manner of fusion may include, but is not limited to, a combination of one or more of addition, averaging, weighted summation, and the like.
Illustratively, as shown in fig. 7, the second joint coding layer 730 may combine the first label fusion vector [ L ] with [ L1 ] in the label text representation vector [ T ] respectivelya1】【Ta2】【l2】【Tb1】…【l3】【Tb2Adding … to obtain a second combined code [ V1 ] [ V4 ]a1】【Ua2】【Ub1】【Ub2】…【V3】【Vb2[ V4 ] …, wherein [ V1 ] = [ L1 ] + [ L ] [ V ]a1】=【Ta1】+【L】、【Va2】=【Ta2】+【L】、【Vb1】=【Tb1】+【L】、…。
In some embodiments, the second joint encoding layer 730 may be a feed-forward neural network. For a detailed description of the feedforward neural network, refer to step 410, which is not described herein.
Some embodiments of the present description obtain the second joint code based on encoding the first tag fusion vector and the tag text representation vector, so that the second joint code simultaneously includes information of two third entities, information of a text to be processed, relationship information between the two third entities, and relationship information between the two third entities and the text to be processed, and accuracy of extracting an open relationship between the two third entities in a subsequent second extraction model can be improved.
And step 640, acquiring an open relationship between any two third entities based on the second joint code.
In some embodiments, the second annotation sequence layer 740 can obtain the relationship annotation sequence corresponding to the tag text based on the second joint encoding.
The relation label sequence may be a result of arranging a plurality of relation labels corresponding to a plurality of characters or a plurality of words in the label text in order. In some embodiments, each relationship label may reflect whether the word and/or word in the corresponding label text is the word and/or word corresponding to the open relationship.
As shown in fig. 7, the word and/or word in the tag text that does not correspond to the open relationship may be "O", where "O" indicates invalid or empty, the word and/or word in the tag text that corresponds to the open relationship may be "B-r" and/or "I-r", where "B-r" and "I-r" respectively denote the first character of the open relationship and the non-first character of the open relationship.
Specifically, the second labeling sequence layer 740 may obtain, based on the second joint encoding, a probability that each word or phrase in the label text belongs to the open type relationship and a probability that each word or phrase does not belong to the open type relationship, and then mark corresponding to the maximum probability value as the entity label of the word or phrase.
Taking fig. 5 as an example, the second layer 740 of annotation sequences may be based on [ V ] in the second joint codinga1The probability that the first character belongs to the open relation is 0.2, the probability that the first character does not belong to the open relation is 0.6, and then the mark O of the non-open relation corresponding to the maximum probability value of 0.6 is used as the relation mark of the first character. For another example, the second layer 740 of annotation sequences may be based on [ V ] in the second joint codingc1To obtain the first word of "race" belonging to open relationThe probability is 0.7, the probability of the non-first character belonging to the open relation is 0.2, the probability of the non-first character not belonging to the open relation is 0.1, and then the mark 'B-r' of the first character of the open relation corresponding to the maximum probability of 0.7 is used as the relation label of the 'competitive' character.
Similarly, the second annotation sequence layer 740 can obtain the relationship annotation of each word or phrase in the Label text "Label 1 a Label2 a main competitor is Label3 b Label4 …", and arrange the words or phrases in the order in the Label text, so as to obtain the relationship annotation sequence: "O", "O" … "B-r", "I-r", "O" ….
In some embodiments, the second annotation sequence layer 740 can include, but is not limited to, one of an N-Gram (N-Gram) Model, a Conditional Random Field (CRF) Model, and a Hidden Markov Model (HMM).
In some embodiments, the second extraction model may determine an open relationship between any two third entities in the text to be processed based on the relationship annotation sequence. For example, continuing with fig. 7, the second extraction model may obtain an open relationship "competition" between the third entity "a si" and "B" based on the words "race" and "war" of the relationship labels "B-r" and "I-r" in the aforementioned relationship label sequence corresponding to the text to be processed.
Further, the class B abstraction module 230 may compose any two third entities and the open relationship therebetween into a class B entity triplet. For example, the class B abstraction module 230 may combine the third entity "class a" and "class B" and the open relationship "competition" between "class a" and "class B" into the group 1 class B entity triplet [ class a, class B, competition ].
The above embodiment shows an implementation structure of the second extraction model, and in some embodiments, the second extraction model may be implemented based on an end-to-end model, such as a BERT-based multi-head selection model, StanfordNLP (StanfordNLP), or LTP (LTP), which is a chinese language parsing tool for studios.
Some embodiments of the present description extract any two third entities by using the second extraction model, and obtain the development relationship between the two third entities based on the two third entities, so as to obtain the type B entity triplet.
FIG. 8 is a block diagram illustrating a physical abstraction layer according to some embodiments of the present description. In some embodiments, the entity extraction layer may be a first entity extraction layer and/or a second entity extraction layer.
As shown in fig. 8, the entity extraction layer (the first entity extraction layer and/or the second entity extraction layer) may include a word embedding layer 810, a feature extraction layer 820, and a text labeling layer 830.
Specifically, the word embedding layer 810 may obtain a text vector of the text to be processed.
The text vector of the text to be processed may be a vector characterizing the text information to be processed.
In some embodiments, before the word embedding layer 810 obtains the text vector of the text to be processed, the text to be processed may be processed as follows: adding [ CLS ] before the text to be processed; and dividing each sentence in the text to be processed by a separator [ SEP ] to distinguish. For example, the text to be treated "the first generation process plants where the main competitors of the company A are B and C, and B is C at the same time" the "CLS" first generation process plants where the main competitors of the company A are B and C [ SEP ] B at the same time are C "after treatment.
In some embodiments, the word embedding layer 810 may derive corresponding character vectors and position vectors, respectively, based on the text to be processed.
A character vector (token embedding) is a vector of character information that characterizes a text to be processed. As shown in FIG. 8, the 23 character information included in the text to be processed, "A main competitors are B and C, and B is C generation processing factory" can be respectively represented by 23 character vectors [ w [ ]a1】【wa2】【wb1】【wb2…. For example, the character information of the character [ A ] may be in wordsSymbol vector [2,3]And (5) characterizing. In a practical application scenario, the dimensionality of the vector representation may be higher. In some embodiments, the character vector may be obtained by querying a word vector table or a word embedding model. In some embodiments, the word embedding model may include, but is not limited to: word2vec model, Term Frequency-Inverse Document Frequency model (TF-IDF), SSWE-C (Skip-Gram Based Combined-sensitive Word Embedding) model, and the like.
The position vector (position embedding) is a vector reflecting the position of the character in the text to be processed, such as indicating that the character is the 1 st character, or the 2 nd character, etc. in the text to be processed. In some embodiments, the position vector of the text to be processed may be obtained by cosine-sine encoding. In some embodiments, a segment vector (segment embedding) may also be included, reflecting the segment in which the character is located. Such as the character [ a ] in sentence 1 (segment) of the text to be processed.
In some embodiments, the word embedding layer 810 may fuse, such as concatenate or overlay, various types of vectors of the text to be processed to obtain a text vector of the text information to be processed. As shown in fig. 8, the word embedding layer 810 may be based on a character vector [ w ]a1】【wa2】【wb1】【wb2… and a position vector (not shown), obtaining a text vector [ t ] of the text to be processeda1】【ta2】【tb1】【tb2】…。
Further, the feature extraction layer 820 may take feature vectors of the text to be processed.
The feature vector of the text to be processed may be a vector characterizing features of the text to be processed.
In some embodiments, the feature vector of the text to be processed may include a word feature vector and/or a word feature vector corresponding to each word and/or word in the text to be processed. It is understood that the dimension of the feature vector of the text to be processed may be the same as the number of words and/or words in the text to be processed.
In some embodiments, the feature extraction layer 820 may encode the text vector to be processed to obtain the text vector to be processedFeature vectors of text are processed. As shown in fig. 8, the feature extraction layer 820 may apply to the text vector [ t [ ]a1】【ta2】【tb1】【tb2…, obtaining the characteristic vector (T) of the text to be processeda1】【Ta2】【Tb1】【Tb2…, wherein [ T ]a1】【Ta2】【Tb1】【Tb2… are respectively the character feature vectors corresponding to "A", "S", "Master" and "Master" ….
An exemplary feature extraction layer may be implemented by a BERT model or a transform.
Still further, the text annotation layer 830 can obtain annotation sequences based on the feature vectors.
The text label sequence is a result of arranging a plurality of text labels respectively corresponding to a plurality of characters or a plurality of words in the text to be processed according to a sequence. In some embodiments, the text label may be used to indicate whether a corresponding word or word in the text to be processed belongs to an entity, and further, the text label may be further divided into a company entity label, an industry entity label, and the like, so as to further indicate the entity type to which the corresponding word or word belongs. Therefore, the text labeling sequence can be used for marking the words or the words belonging to the entity in the text to be processed and the entity type belonging to the words or the words.
In some embodiments, the text labels may be at least one of Chinese characters, numbers, letters, symbols, and the like. For example, a first word or initial word of an entity may be represented by B and a non-first word or non-initial word of an entity may be represented by I. As another example, the text labels B-co or I-co can mark words or words in the text to be processed whose entity type is "company principal". As another example, a text label B-ind or I-ind may mark a word or word in the text to be processed that has an entity type of "industry".
As shown in FIG. 8, the text annotation layer 830 can be based on the feature vector [ T ]a1】【Ta2】【Tb1】【Tb2…, labeling the text to be processed "main competitors of A department are B and C, B is the entity A in the generation processing factory of C at the same time," B-co, B …I-co, B-pro … ", denoting" first word of company, non-first word of company body, first word of company ", respectively.
In some embodiments, the textual annotations may also include non-entity annotations. The non-entity labels may also be at least one of Chinese characters, numbers, letters, symbols, and the like. Words or phrases in the text to be processed that do not belong to an entity may be labeled with the same non-entity label. As shown in fig. 8, the text annotation layer 830 marks the word "primary competitor is" in the text to be processed, which does not belong to an entity, with 7 "O". In some embodiments, words or words in the text to be processed that do not belong to the entity may not be marked.
Specifically, the text labeling layer 830 may obtain, based on the feature vector, probabilities that each word or word in the text to be processed belongs to different entity types and probabilities that each word or word does not belong to any entity, and then use an entity label of an entity type corresponding to the maximum probability value or a non-entity label that does not belong to an entity as the text label of the word or word.
Taking FIG. 8 as an example, the text annotation layer 830 can be based on the feature vector [ T ]a1The probability that the 'A' belongs to the first word of the company subject is 0.8, the probability that the 'A' belongs to the non-first word of the company subject is 0.5, the probability that the 'A' belongs to the first word of the character is 0.3, the probability that the 'A' belongs to the non-first word of the character is 0.3, the probability that the 'A' belongs to the first word of the industry is …, and the probability that the 'A' does not belong to the entity is 0.2, and then the entity mark 'B-co' of the first word of the entity type 'company' corresponding to the maximum probability value of 0.8 is used as the entity mark of the 'A' word.
Similarly, the text label layer 830 may obtain the text labels of each word or phrase in the to-be-processed text "the first competitor is b and c, and b is a generational processing factory of c at the same time", and arrange the words or phrases in the to-be-processed text according to the sequence of the words or phrases in the to-be-processed text, so as to obtain the text label sequence: "B-co", "I-co", "O" … "O", "B-co".
In some embodiments, the text annotation layer 830 can include, but is not limited to, one of an N-Gram (N-Gram) Model, a Conditional Random Field (CRF) Model, and a Hidden Markov Model (HMM).
The embodiment of the specification also provides a computer readable storage medium. The storage medium stores computer instructions, and after the computer reads the computer instructions in the storage medium, the computer realizes the text processing method.
The beneficial effects that may be brought by the embodiments of the present description include, but are not limited to: (1) the method comprises the steps that a target entity triple is extracted from a text to be processed by combining a first extraction model with a predefined relation and a second extraction model with an open relation, and the target entity triple is used as training data to train the first extraction model and the second extraction model, so that on one hand, the first extraction model and the second extraction model can learn each other, the extraction result output by the trained models can simultaneously give consideration to higher accuracy and higher application range, and on the other hand, unsupervised learning can be performed on the first initial extraction model and the second initial extraction model which are trained on the basis of a small amount of manually labeled sample entity triples, and the human resources and the labeled time cost are saved; (2) the first entity vector and the feature vector of the text to be processed are coded to obtain a first joint code, so that the first joint code simultaneously comprises the information of the first entity, the information of the text to be processed and the relation information between the first entity and the text to be processed, and the accuracy of extracting a second entity corresponding to each predefined relation of a subsequent first extraction model can be improved; (3) the first label fusion vector and the label text expression vector are coded to obtain a second joint code, so that the second joint code simultaneously comprises information of two third entities, information of a text to be processed, relationship information between the two third entities and the text to be processed, and the accuracy of extracting an open relationship between the two third entities of a subsequent second extraction model can be improved; (4) and screening the target entity triples from multiple dimensions based on the timeliness of the text to be processed, the occurrence times of the type A entity triples and/or the type B entity triples in the text to be processed and the scoring results of the scoring models, so that the instantaneity, the practicability and the richness of the target entity triples can be improved.
It is to be noted that different embodiments may produce different advantages, and in different embodiments, any one or combination of the above advantages may be produced, or any other advantages may be obtained.
Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be regarded as illustrative only and not as limiting the present specification. Various modifications, improvements and adaptations to the present description may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present specification and thus fall within the spirit and scope of the exemplary embodiments of the present specification.
Also, the description uses specific words to describe embodiments of the description. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the specification is included. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the specification may be combined as appropriate.
Moreover, those skilled in the art will appreciate that aspects of the present description may be illustrated and described in terms of several patentable species or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful improvement thereof. Accordingly, aspects of this description may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.), or by a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present description may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.
For each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., cited in this specification, the entire contents of each are hereby incorporated by reference into this specification. Except where the application history document does not conform to or conflict with the contents of the present specification, it is to be understood that the application history document, as used herein in the present specification or appended claims, is intended to define the broadest scope of the present specification (whether presently or later in the specification) rather than the broadest scope of the present specification. It is to be understood that the descriptions, definitions and/or uses of terms in the accompanying materials of this specification shall control if they are inconsistent or contrary to the descriptions and/or uses of terms in this specification.
Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present disclosure. Other variations are also possible within the scope of the present description. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the specification can be considered consistent with the teachings of the specification. Accordingly, the embodiments of the present description are not limited to only those embodiments explicitly described and depicted herein.
Claims (10)
1. A method of text processing, the method comprising:
acquiring a text to be processed;
extracting a first entity from the text to be processed by using a first extraction model, and extracting a second entity meeting a predefined relationship from the text to be processed based on the first entity to obtain at least one type A entity triple; wherein each said class A entity triplet includes said first entity, said second entity, and a predefined relationship between said first entity and said second entity;
extracting a plurality of third entities from the text to be processed by using a second extraction model, and determining an open relationship between any two third entities to obtain a plurality of class B entity triples; each type B entity triplet comprises two third entities and an open relationship between the two third entities;
and acquiring target entity triples from the type A entity triples and the type B entity triples based on a screening rule.
2. The method of claim 1, further comprising:
and taking the text to be processed as a training sample, taking the target entity triple as a training label, and training the first extraction model and/or the second extraction model.
3. The method of claim 1, the extracting, from the text to be processed based on the first entity, a second entity that satisfies a predefined relationship, comprising:
acquiring a first joint code of the first entity and the text to be processed;
acquiring an entity labeling sequence of the text to be processed corresponding to each predefined relationship based on the first joint code; the entity labels are used for indicating characters and/or words corresponding to the predefined relation in the text to be processed;
and extracting the second entity corresponding to each predefined relationship according to the entity labeling sequence of the text to be processed corresponding to each predefined relationship.
4. The method of claim 1, the determining an open relationship between any two of the third entities comprising:
adding a first label and a second label to each third entity in the text to be processed, obtaining a label text, and obtaining a corresponding label text expression vector based on the label text; wherein the first tag and the second tag are to indicate a first word and a last word of the third entity, respectively;
acquiring a corresponding label coding vector based on the label text representation vector;
acquiring second joint codes corresponding to any two third entities according to the label coding vector;
and acquiring the open relation between any two third entities based on the second joint codes.
5. The method of claim 4, wherein the obtaining of the second joint codes corresponding to any two third entities according to the tag coding vector comprises:
obtaining at least one first label vector corresponding to at least one first label in the label coding vectors;
acquiring a first label fusion vector based on any two first label vectors corresponding to any two third entities;
and acquiring second joint codes corresponding to any two third entities based on the first label fusion vector and the label coding vector.
6. The method of claim 1, the first and/or second extraction models comprising one or more of the following models: BERT, Transformer, Stanford NLP, or LTP.
7. The method of claim 1, the screening rule comprising:
acquiring the target entity triples based on the timeliness of the texts to be processed corresponding to the type A entity triples and/or the type B entity triples;
acquiring the target entity triple based on the occurrence frequency of the type A entity triple and/or the type B entity triple in the text to be processed; and/or
And obtaining the target entity triple according to the scoring result of the type A entity triple and/or the type B entity triple by the scoring model.
8. The method of claim 1, the first, second, and/or third entities being financial entities of a type comprising a company, a person, an industry, a metric, a value, and an address.
9. A text processing system comprising:
the text acquisition module is used for acquiring a text to be processed;
the class A extraction module is used for extracting a first entity from the text to be processed by using a first extraction model, and extracting a second entity meeting a predefined relationship from the text to be processed based on the first entity so as to obtain a class A entity triple; wherein each said class A entity triplet includes said first entity, said second entity, and a predefined relationship between said first entity and said second entity;
the class B extraction module is used for extracting a plurality of third entities from the text to be processed by utilizing a second extraction model and determining the open relationship between any two third entities so as to obtain a plurality of class B entity triples; each type B entity triplet comprises two third entities and an open relationship between the two third entities;
and the screening module is used for acquiring the target entity triples from the type A entity triples and the type B entity triples based on a screening rule.
10. A computer-readable storage medium storing computer instructions which, when read by a computer, cause the computer to perform the method of entity triplet extraction as claimed in any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210433223.1A CN114528418B (en) | 2022-04-24 | 2022-04-24 | Text processing method, system and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210433223.1A CN114528418B (en) | 2022-04-24 | 2022-04-24 | Text processing method, system and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114528418A true CN114528418A (en) | 2022-05-24 |
CN114528418B CN114528418B (en) | 2022-10-14 |
Family
ID=81628023
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210433223.1A Active CN114528418B (en) | 2022-04-24 | 2022-04-24 | Text processing method, system and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114528418B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115620722A (en) * | 2022-12-15 | 2023-01-17 | 广州小鹏汽车科技有限公司 | Voice interaction method, server and computer readable storage medium |
CN116226408A (en) * | 2023-03-27 | 2023-06-06 | 中国科学院空天信息创新研究院 | Agricultural product growth environment knowledge graph construction method and device and storage medium |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140032209A1 (en) * | 2012-07-27 | 2014-01-30 | University Of Washington Through Its Center For Commercialization | Open information extraction |
CN105138507A (en) * | 2015-08-06 | 2015-12-09 | 电子科技大学 | Pattern self-learning based Chinese open relationship extraction method |
US20190122145A1 (en) * | 2017-10-23 | 2019-04-25 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method, apparatus and device for extracting information |
US20200012953A1 (en) * | 2018-07-03 | 2020-01-09 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for generating model |
CN110781683A (en) * | 2019-11-04 | 2020-02-11 | 河海大学 | Entity relation joint extraction method |
CN111027324A (en) * | 2019-12-05 | 2020-04-17 | 电子科技大学广东电子信息工程研究院 | Method for extracting open type relation based on syntax mode and machine learning |
CN111428493A (en) * | 2020-03-06 | 2020-07-17 | 中国平安人寿保险股份有限公司 | Entity relationship acquisition method, device, equipment and storage medium |
CN113011189A (en) * | 2021-03-26 | 2021-06-22 | 深圳壹账通智能科技有限公司 | Method, device and equipment for extracting open entity relationship and storage medium |
CN113051356A (en) * | 2021-04-21 | 2021-06-29 | 深圳壹账通智能科技有限公司 | Open relationship extraction method and device, electronic equipment and storage medium |
CN113779358A (en) * | 2021-09-14 | 2021-12-10 | 支付宝(杭州)信息技术有限公司 | Event detection method and system |
US20210390464A1 (en) * | 2020-06-16 | 2021-12-16 | Baidu Usa Llc | Learning interpretable relationships between entities, relations, and concepts via bayesian structure learning on open domain facts |
CN113887211A (en) * | 2021-10-22 | 2022-01-04 | 中国人民解放军战略支援部队信息工程大学 | Entity relation joint extraction method and system based on relation guidance |
CN114372454A (en) * | 2020-10-14 | 2022-04-19 | 腾讯科技(深圳)有限公司 | Text information extraction method, model training method, device and storage medium |
CN114385812A (en) * | 2021-12-24 | 2022-04-22 | 思必驰科技股份有限公司 | Relation extraction method and system for text |
-
2022
- 2022-04-24 CN CN202210433223.1A patent/CN114528418B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140032209A1 (en) * | 2012-07-27 | 2014-01-30 | University Of Washington Through Its Center For Commercialization | Open information extraction |
CN105138507A (en) * | 2015-08-06 | 2015-12-09 | 电子科技大学 | Pattern self-learning based Chinese open relationship extraction method |
US20190122145A1 (en) * | 2017-10-23 | 2019-04-25 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method, apparatus and device for extracting information |
US20200012953A1 (en) * | 2018-07-03 | 2020-01-09 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for generating model |
CN110781683A (en) * | 2019-11-04 | 2020-02-11 | 河海大学 | Entity relation joint extraction method |
CN111027324A (en) * | 2019-12-05 | 2020-04-17 | 电子科技大学广东电子信息工程研究院 | Method for extracting open type relation based on syntax mode and machine learning |
CN111428493A (en) * | 2020-03-06 | 2020-07-17 | 中国平安人寿保险股份有限公司 | Entity relationship acquisition method, device, equipment and storage medium |
US20210390464A1 (en) * | 2020-06-16 | 2021-12-16 | Baidu Usa Llc | Learning interpretable relationships between entities, relations, and concepts via bayesian structure learning on open domain facts |
CN114372454A (en) * | 2020-10-14 | 2022-04-19 | 腾讯科技(深圳)有限公司 | Text information extraction method, model training method, device and storage medium |
CN113011189A (en) * | 2021-03-26 | 2021-06-22 | 深圳壹账通智能科技有限公司 | Method, device and equipment for extracting open entity relationship and storage medium |
CN113051356A (en) * | 2021-04-21 | 2021-06-29 | 深圳壹账通智能科技有限公司 | Open relationship extraction method and device, electronic equipment and storage medium |
CN113779358A (en) * | 2021-09-14 | 2021-12-10 | 支付宝(杭州)信息技术有限公司 | Event detection method and system |
CN113887211A (en) * | 2021-10-22 | 2022-01-04 | 中国人民解放军战略支援部队信息工程大学 | Entity relation joint extraction method and system based on relation guidance |
CN114385812A (en) * | 2021-12-24 | 2022-04-22 | 思必驰科技股份有限公司 | Relation extraction method and system for text |
Non-Patent Citations (3)
Title |
---|
YOHANES GULTOM ET AL.: "Automatic open domain information extraction from Indonesian text", 《 2017 INTERNATIONAL WORKSHOP ON BIG DATA AND INFORMATION SECURITY (IWBIS)》 * |
武文雅等: "中文实体关系抽取研究综述", 《计算机与现代化》 * |
罗耀东: "湿地实体识别与开放关系抽取的研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115620722A (en) * | 2022-12-15 | 2023-01-17 | 广州小鹏汽车科技有限公司 | Voice interaction method, server and computer readable storage medium |
CN116226408A (en) * | 2023-03-27 | 2023-06-06 | 中国科学院空天信息创新研究院 | Agricultural product growth environment knowledge graph construction method and device and storage medium |
CN116226408B (en) * | 2023-03-27 | 2023-12-19 | 中国科学院空天信息创新研究院 | Agricultural product growth environment knowledge graph construction method and device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN114528418B (en) | 2022-10-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109522553B (en) | Named entity identification method and device | |
CN111222305B (en) | Information structuring method and device | |
CN114528418B (en) | Text processing method, system and storage medium | |
CN111767716B (en) | Method and device for determining enterprise multi-level industry information and computer equipment | |
CN110502738A (en) | Chinese name entity recognition method, device, equipment and inquiry system | |
CN112434535B (en) | Element extraction method, device, equipment and storage medium based on multiple models | |
CN113535963B (en) | Long text event extraction method and device, computer equipment and storage medium | |
WO2023071745A1 (en) | Information labeling method, model training method, electronic device and storage medium | |
CN115017303A (en) | Method, computing device and medium for enterprise risk assessment based on news text | |
CN113779358A (en) | Event detection method and system | |
CN105335350A (en) | Language identification method based on ensemble learning | |
CN112966117A (en) | Entity linking method | |
CN114153978A (en) | Model training method, information extraction method, device, equipment and storage medium | |
CN111091002A (en) | Method for identifying Chinese named entity | |
CN111831810A (en) | Intelligent question and answer method, device, equipment and storage medium | |
Celikyilmaz et al. | A graph-based semi-supervised learning for question-answering | |
De Bruyne et al. | An emotional mess! deciding on a framework for building a Dutch emotion-annotated corpus | |
CN112905796B (en) | Text emotion classification method and system based on re-attention mechanism | |
CN114330318A (en) | Method and device for recognizing Chinese fine-grained entities in financial field | |
CN115526176A (en) | Text recognition method and device, electronic equipment and storage medium | |
CN113486649A (en) | Text comment generation method and electronic equipment | |
JP6942759B2 (en) | Information processing equipment, programs and information processing methods | |
CN112200674A (en) | Stock market emotion index intelligent calculation information system | |
CN115879669A (en) | Comment score prediction method and device, electronic equipment and storage medium | |
Fourli-Kartsouni et al. | A Bayesian network approach to semantic labelling of text formatting in XML corpora of documents |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |