CN114528418B - Text processing method, system and storage medium - Google Patents

Text processing method, system and storage medium Download PDF

Info

Publication number
CN114528418B
CN114528418B CN202210433223.1A CN202210433223A CN114528418B CN 114528418 B CN114528418 B CN 114528418B CN 202210433223 A CN202210433223 A CN 202210433223A CN 114528418 B CN114528418 B CN 114528418B
Authority
CN
China
Prior art keywords
entity
text
processed
label
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210433223.1A
Other languages
Chinese (zh)
Other versions
CN114528418A (en
Inventor
汤甘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Tonghuashun Data Development Co ltd
Original Assignee
Hangzhou Tonghuashun Data Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Tonghuashun Data Development Co ltd filed Critical Hangzhou Tonghuashun Data Development Co ltd
Priority to CN202210433223.1A priority Critical patent/CN114528418B/en
Publication of CN114528418A publication Critical patent/CN114528418A/en
Application granted granted Critical
Publication of CN114528418B publication Critical patent/CN114528418B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The application discloses a text processing method, a system and a storage medium, wherein the method comprises the following steps: acquiring a text to be processed; extracting a first entity from the text to be processed by using a first extraction model, and extracting a second entity meeting a predefined relationship from the text to be processed based on the first entity to obtain at least one type A entity triple; each type A entity triple comprises a first entity, a second entity and a predefined relationship between the first entity and the second entity; extracting a plurality of third entities from the text to be processed by using a second extraction model, and determining the open relationship between any two third entities to obtain a plurality of B-type entity triples; each type B entity triple comprises two third entities and an open relation between the two third entities; and acquiring target entity triples from the type A entity triples and the type B entity triples based on the screening rules.

Description

Text processing method, system and storage medium
Technical Field
The present disclosure relates to the field of natural language processing technologies, and in particular, to a text processing method, system, and storage medium.
Background
The text is an important way for people to acquire knowledge and information, and with the rapid development of internet technology, the number of texts is increased explosively. In order to make a computer better understand text and thus help a human being process massive text information, entity triples composed of two entities and a relationship between the two entities can be used to represent information in the text, so as to build a knowledge graph and a knowledge base based on massive text information. However, the extraction model for obtaining entity triples based on text information is limited by predefined relationships and/or corpora between entities, resulting in low applicability, and requires a large amount of human resources to predefine the relationship types and/or the labeled corpora between the entities.
Accordingly, it is desirable to provide a text processing method, system, and storage medium that can improve both the efficiency and accuracy of text processing.
Disclosure of Invention
One aspect of the present specification provides a text processing method, including: acquiring a text to be processed; extracting a first entity from the text to be processed by using a first extraction model, and extracting a second entity meeting a predefined relationship from the text to be processed based on the first entity to obtain at least one type A entity triple; each type A entity triple comprises a first entity, a second entity and a predefined relationship between the first entity and the second entity; extracting a plurality of third entities from the text to be processed by using a second extraction model, and determining the open relationship between any two third entities to obtain a plurality of B-type entity triples; each type B entity triple comprises two third entities and an open relation between the two third entities; and acquiring target entity triples from the type A entity triples and the type B entity triples based on the screening rules.
Another aspect of the specification provides a text processing system, the system comprising: the text acquisition module is used for acquiring a text to be processed; the class A extraction module is used for extracting a first entity from the text to be processed by using a first extraction model, and extracting a second entity meeting a predefined relationship from the text to be processed based on the first entity so as to obtain a class A entity triple; each type A entity triple comprises a first entity, a second entity and a predefined relationship between the first entity and the second entity; the class B extraction module is used for extracting a plurality of third entities from the text to be processed by utilizing a second extraction model and determining the open relationship between any two third entities so as to obtain a plurality of class B entity triples; each type B entity triple comprises two third entities and an open relation between the two third entities; and the screening module is used for acquiring the target entity triples from the type-A entity triples and the type-B entity triples based on the screening rule.
Another aspect of the present specification provides a computer-readable storage medium characterized in that the storage medium stores computer instructions that, when executed by a processor, implement a text processing method.
Drawings
The present description will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments like numerals are used to indicate like structures, wherein:
FIG. 1 is a diagram of an application scenario for a text processing system, according to some embodiments of the present description;
FIG. 2 is an exemplary block diagram of a text processing system according to some embodiments of the present description;
FIG. 3 is an exemplary flow diagram of a text processing method, shown in accordance with some embodiments of the present description;
FIG. 4 is an exemplary flow diagram illustrating a method for obtaining at least one type A entity triplet using a first extraction model in accordance with some embodiments of the present description;
FIG. 5 is a schematic illustration of a first extraction model according to some embodiments herein;
FIG. 6 is an exemplary flow diagram illustrating the acquisition of multiple type B entity triples using a second decimation model according to some embodiments of the present description;
FIG. 7 is a schematic diagram of a second decimation model according to some embodiments herein;
FIG. 8 is a schematic diagram of a structure of an entity extraction layer in accordance with some embodiments of the present description;
FIG. 9 is a schematic illustration of a method of text processing shown in some embodiments according to the present description.
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples or embodiments of the present description, and that for a person skilled in the art, without inventive effort, the present description can also be applied to other similar contexts on the basis of these drawings. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.
It should be understood that "system", "device", "unit" and/or "module" as used in this specification is a method for distinguishing different components, elements, parts or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.
As used in this specification and the appended claims, the terms "a," "an," "the," and/or "the" are not to be taken in a singular sense, but rather are to be construed to include a plural sense unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" are intended to cover only the explicitly identified steps or elements as not constituting an exclusive list and that the method or apparatus may comprise further steps or elements.
Flow charts are used in this description to illustrate operations performed by a system according to embodiments of the present description. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.
In the information burst age, a large amount of information appears every day, and the expression form of the information is flexible and changeable, so that a problem worthy of research is how to enable a computer to better understand texts, thereby helping human beings to process massive text information. In some embodiments, textual information may be represented using entity triples consisting of two entities and relationships between the two entities, so that a computer may construct a knowledgegraph and build a knowledge base, etc., based on a vast amount of textual information.
In some embodiments, the extraction model may extract a second entity based on a predefined relationship after extracting a first entity in the entity triplet, thereby forming the entity triplet. However, the extraction model based on predefined relationships is limited by the number and type of predefined relationships. When new entity triples appear, the extraction model may have difficulty extracting new entity triples from text such as news reports because new relationships between entities have not been "seen". For example, after the extraction model is trained based on the corpus of the predefined relationship "competition", the entity triples composed of two entities with the relationship "competition" can be extracted from the text, but the new entity triples composed of two entities with the relationship "cooperation" cannot be extracted. In some embodiments, the extraction model may also determine an open relationship between two entities after extracting two entities in the entity triplet, thereby forming the entity triplet. However, if the extraction accuracy of the extraction model based on the open relationship is to be improved, a large amount of corpora are required for training, which consumes much labor and time.
Some embodiments of the present disclosure provide a text processing scheme, which jointly extracts entity triples by using an extraction model based on a predefined relationship (i.e., a first extraction model) and an extraction model based on an open relationship (i.e., a second extraction model), and uses an extraction result for training the extraction model, so as to reduce human resources and time costs for training the extracted model while improving extraction accuracy.
FIG. 1 is a diagram of an application scenario for a text processing system, shown in some embodiments in accordance with the present description. As shown in fig. 1, the application scenario 100 may include:
processor 110 may process data and/or information obtained from other devices or system components. The processor may execute program instructions based on such data, information, and/or processing results to perform one or more of the functions described herein. For example, the processor 110 may obtain a type a entity triplet and/or a type B entity triplet from the user terminal 130. For another example, the processor 110 may extract the type a entity triples from the text to be processed by using the first extraction model; and extracting the B-type entity triples from the text to be processed by utilizing a second extraction model. As another example, the processor 110 may also train the first and/or second decimation models based on the target entity triplets. In some embodiments, the processor 110 may include one or more sub-processing devices (e.g., single core processing devices or multi-core processing devices).
Storage device 120 may be used to store data and/or instructions. For example, the storage device 120 may store pending text, type a entity triplets, and/or type B entity triplets, among others. As another example, the storage device 120 may store parameters of the first extraction model, the second extraction model, and the scoring model. Storage device 120 may include one or more storage components, each of which may be a separate device or part of another device. In some embodiments, storage device 120 may include Random Access Memory (RAM), read Only Memory (ROM), mass storage, removable storage, volatile read-write memory, and the like, or any combination thereof. In some embodiments, the storage device 120 may be implemented on a cloud platform.
User terminal 130 refers to one or more terminal devices or software used by a user. In some embodiments, user terminal 130 may be used for interaction and display with a user. For example, the user terminal 130 may display the type a entity triplets and/or the type B entity triplets to the user. As another example, the user terminal 130 may obtain a target entity triplet selected by the user from the user. In some embodiments, the user terminal 130 may be used by one or more users, may include users who directly use the service, and may also include other related users. In some embodiments, the user terminal 130 may be one or any combination of a mobile device 130-1, a tablet computer 130-2, a laptop computer 130-3, a desktop computer 130-4, or other device having input and/or output capabilities.
Network 140 may connect the various components of the system and/or connect the system with external resource components. Network 140 enables communication between the various components and with other components outside the system to facilitate the exchange of data and/or information. In some embodiments, the network 140 may be any one or more of a wired network or a wireless network. The network connection between the parts can be in one way or in multiple ways. In some embodiments, the network may be a point-to-point, shared, centralized, etc. variety of topologies or a combination of topologies. In some embodiments, network 140 may include one or more network access points. For example, network 140 may include wired or wireless network access points, such as base stations and/or network switching points 140-1, 140-2, \\ 8230, through which one or more components of the access point system may connect to network 140 to exchange data and/or information.
In some embodiments, storage 120 may be included in processor 110, user terminal 130, and possibly other system components. In some embodiments, the processor 110 may be included in the user terminal 130, as well as other possible system components.
FIG. 2 is a block diagram of a text processing system, shown in accordance with some embodiments of the present description.
In some embodiments, a text acquisition module 210, a class a extraction module 220, a class B extraction module 230, a filtering module 240, and a training module 250 may be included in the text processing system 200.
The text obtaining module 210 may be configured to obtain a text to be processed.
The class a extraction module 220 may be configured to extract a first entity from the to-be-processed text by using a first extraction model, and extract a second entity satisfying a predefined relationship from the to-be-processed text based on the first entity, so as to obtain at least one class a entity triple. In some embodiments, each class a entity triplet may include a first entity, a second entity, and a predefined relationship between the first entity and the second entity. In some embodiments, the class a extraction module 220 may be used to perform one or more of the following operations: acquiring a first entity and a first joint code of a text to be processed; acquiring an entity tagging sequence of the text to be processed corresponding to each predefined relationship based on the first joint coding; and extracting a second entity corresponding to each predefined relationship according to the entity marking sequence of the text to be processed corresponding to each predefined relationship. In some embodiments, the entity labels may be used to indicate words and/or phrases in the text to be processed that correspond to predefined relationships. In some embodiments, the first entity and/or the second entity may be financial entities. In some embodiments, the types of financial entities may include companies, people, industries, metrics, values, and addresses. In some embodiments, the first extraction model may include one or more of the following models: BERT, transformer, stanford NLP, or LTP.
The class B extraction module 230 may be configured to extract a plurality of third entities from the text to be processed by using a second extraction model, and determine an open relationship between any two third entities to obtain a plurality of class B entity triples. In some embodiments, each class B entity triplet may include two third entities and an open relationship between the two third entities. In some embodiments, the class B extraction module 230 may be used to perform one or more of the following operations: adding a first label and a second label to each third entity in the text to be processed to obtain a label text, and obtaining a corresponding label text expression vector based on the label text; acquiring a corresponding label coding vector based on the label text representation vector; acquiring second joint codes corresponding to any two third entities according to the label coding vectors; and acquiring the open relation between any two third entities based on the second joint codes. In some embodiments, the first tag and the second tag are used to indicate a first word and a last word of the third entity, respectively. In some embodiments, the class B extraction module 230 may be configured to perform one or more of the following operations: obtaining at least one first label vector corresponding to at least one first label in the label coding vectors; acquiring a first label fusion vector based on any two first label vectors corresponding to any two third entities; and acquiring second joint codes corresponding to any two third entities based on the first label fusion vector and the label coding vector. In some embodiments, the third entity may be a financial entity. In some embodiments, the types of financial entities may include companies, people, industries, metrics, values, and addresses.
In some embodiments, the second extraction model may include one or more of the following models: BERT, transformer, stanford NLP, or LTP.
The screening module 240 may be configured to obtain a target entity triplet from the type a entity triplet and the type B entity triplet based on a screening rule. In some embodiments, the screening rules may include a combination of one or more of the following: acquiring a target entity triplet based on the timeliness of the text to be processed corresponding to the type A entity triplet and/or the type B entity triplet; acquiring a target entity triple based on the occurrence frequency of the type A entity triple and/or the type B entity triple in the text to be processed; and/or obtaining the target entity triple according to the scoring result of the type A entity triple and/or the type B entity triple by the scoring model.
The training module 250 may be configured to train the first extraction model and/or the second extraction model by using the text to be processed as a training sample and the target entity triplet as a training label.
FIG. 3 is an exemplary flow diagram of a text processing method, shown in some embodiments according to the present description.
In some embodiments, the text processing method 300 may be performed by a processing device or implemented by a text processing system disposed on a processing device.
As shown in fig. 3, text processing method 300 may include:
step 310, obtaining a text to be processed. In particular, this step 310 may be performed by the text acquisition module 210.
The text to be processed may be text for which entity triples need to be extracted. For example, the text to be processed may be text information in a financial scenario. As another example, the pending text may be textual information in a robot customer service scenario. For convenience of explanation, the text processing method is described in the specification in conjunction with a financial scenario.
In some embodiments, the pending text may include chapter-level text. Illustratively, pending text may include securities research reports, related industry research reports, audit reports, credit reports, announcements, news and current comments, and the like. In some embodiments, the pending text may comprise sentence-level text. Illustratively, the text to be processed may include sentences included in any of the aforementioned chapter-level texts.
In some embodiments, the text obtaining module 210 may obtain the text to be processed directly from the information in the form of words. For example, the text obtaining module 210 may obtain the text to be processed from a text database. For another example, the text obtaining module 210 may also crawl the text to be processed from the text of the web page.
In some embodiments, the text acquisition module 210 may also acquire the text to be processed from the picture message based on word recognition technology. In some embodiments, the pending text may also be obtained from the Speech information based on Automatic Speech Recognition (ASR) techniques.
In some embodiments, the text acquisition module 210 may pre-process the text to be processed. In some embodiments, the pre-processing may include, but is not limited to, a combination of one or more of segmentation, deduplication, filtering, and the like.
The segmentation may be dividing the text to be processed in the form of a long text into a plurality of texts to be processed in the form of short texts. For example, the segmentation may divide the aforementioned chapter-level securities research report into a plurality of sentence-level texts, "A located in the F region of E.city, D.C.," A main competitors are B and C, and B is C at the same time, "\8230".
In some embodiments, the text obtaining module 210 may determine the length of the short text after segmentation according to the processing efficiency of the first extraction model and/or the second extraction model on the texts with different lengths, so as to improve the text processing efficiency. For the related description of the first extraction model and the second extraction model, reference may be made to fig. 4 and fig. 5 and the related description thereof, which are not described herein again.
Some embodiments of the present description perform deduplication based on a short text after segmentation, which may improve deduplication rate and reduce the same text in a text to be processed.
Deduplication can be a process of removing duplicate text in the text to be processed.
The repeated text may be text of the same and/or similar content. In some embodiments, the text obtaining module 210 may obtain a text vector corresponding to each text in the text to be processed by using a word embedding model, then calculate a similarity between different text vectors, and finally take the text corresponding to the text vector with the similarity greater than a threshold as a repeated text. For a description of the word embedding model, reference may be made to fig. 4 and its related description, which are not described herein again. In some embodiments, the similarity between text vectors may be characterized by the distance between the text vectors. In some embodiments, the distance may include, but is not limited to: euclidean distance, manhattan distance, chebyshev distance, minkowski distance, mahalanobis distance, included angle cosine distance, etc.
For example, the text obtaining module 210 may obtain, based on the aforementioned securities research report, a short text 1 "a house is located in the F area of E, province D", and obtain, based on a certain news, a short text 2 "a house is located in the F area of E, province D", where the text obtaining module 210 may determine that the short text 1 and the short text 2 are repeated texts, and may remove at least one short text in the texts to be processed.
Some embodiments of the present description perform deduplication on a text to be processed, so that it is possible to avoid extracting the same entity triplet based on the same text, that is, to avoid generating interference on subsequent screening of target entity triples based on the occurrence frequency of the entity triplet, thereby improving the accuracy of text processing. For a detailed description of the screening of the triples of the target entity, reference may be made to the related description of step 340, which is not described herein again.
The filtering may remove invalid text from the text to be processed. The invalid text may be text that does not conform to the target scene.
The target scene may be an application scene of the text to be processed desired by the user. Illustratively, if a user desires to obtain entity triples in a financial scenario to build a financial relationship knowledge graph, the target scenario may be a financial scenario. For example, the invalid text may be a researcher disclaimer in the foregoing securities study report. As another example, the invalid text may be a web site link and an advertisement in a web page. As another example, the invalid text may be a space, a messy code, an error character, and the like in the text to be processed.
In some embodiments, the text obtaining module 210 may identify and remove invalid texts from the texts to be processed, so as to obtain the filtered texts to be processed. In some embodiments, the text obtaining module 210 may obtain the filtered text to be processed by using a classification model. Specifically, the classification model may map the input text to a value or a probability, and then obtain a classification result based on the value or the probability. Further, the text obtaining module 210 may use the text with the classification result of "financial scene" as the filtered text to be processed, and use the text with the classification result of other texts as invalid text, and remove the invalid text from the text to be processed.
Some embodiments of the present description filter the text to be processed, so as to reduce interference of invalid texts on the extraction result, thereby improving the extraction accuracy and the extraction efficiency.
And 320, extracting a first entity from the text to be processed by using the first extraction model, and extracting a second entity meeting the predefined relationship from the text to be processed based on the first entity to obtain at least one type A entity triple. In particular, this step 320 may be performed by the class A extraction module 220.
The entity may be a specific individual in the real world. In some embodiments, the entity may be a financial entity. The financial entity may be an entity in a financial application scenario. For example, the entity may be a company A, a company B, a company C, and so on. Also for example, the entity may be Zhang three (stockholder), li four (board of directors), wang two (representative of French), and so on. For another example, the entity may be a pig farming industry, a medical and beauty industry, a real estate industry, and the like.
Entity types can be a broad abstraction of an objective individual. In some embodiments, the types of financial entities may include companies, people, industries, metrics, values, and addresses.
In some embodiments, an entity may be an instance that actually exists under the abstract concept of an entity type. For example, the entity type "company" may specifically be an entity "a (company)", "b (company)", "c (company)", etc., the entity type "person" may specifically be an entity "zhangsan (shareholder)", "li si (director)", "wang di (legal representative)", etc., the entity type "industry" may specifically be an entity "pig farming industry", "medical and American industry", "real estate industry", etc., the entity type "index" may specifically be an entity "total cost in the month", "total annual sales", "total annual profit", etc., the entity type "numerical value" may specifically be an entity "100 ten thousands", "one hundred million", etc., and the entity type "address" may specifically be an entity "F district in E city in D", "street number in G district H", etc.
Illustratively, the entity of the pending text "house located in zone F, E, province D" includes: the corresponding entity types of the district F of E city, province A and D comprise: a company and an address. As yet another example, the entities of the pending text "the main competitors of the company a are the generation factories of b and c, b being simultaneously c" include: the corresponding entity types of the department A, the department B and the department C comprise: company, company and company.
Entities may have relationships between them, which may be described by relationships between their corresponding entity types. For example, a relationship between an entity type "company" and an entity type "address" may be "located", and a relationship between a corresponding entity "house" and an entity "D province, E city, F district" may be "located".
An entity triplet may consist of two entities in the text to be processed and a relationship between the two entities. Illustratively, entity triples may be represented by a structure of [ entity, relationship, entity ]. Yet another example, entity triples may also be represented by the structure of [ entity, relationship ].
In some embodiments, one or more sets of entity triples may correspond to the text to be processed.
For example, the entity triple corresponding to the text to be processed "a is located in zone F of E city, D province" may include: entity triples [ Jia Si, D province, E city, F district, located ], etc., in group 1. For another example, the entity triplet corresponding to the pending text "the generation factories in which the main competitors of the company a are b and c, and b is c at the same time" may include: group 2 entity triplets [ class a, class b, competition ], group 3 entity triplets [ class a, class c, competition ], group 4 entity triplets [ class c, class b, employment ], and the like.
In some embodiments, the entities and relationships in the sets of entity triples may be partially identical. Continuing with the example above, the entity "house" in the 1 st and 2 nd set of entity triples is the same, and the relationship "race" in the 2 nd and 3 rd set of entity triples is the same.
A class a entity triplet may be an entity triplet containing a predefined relationship.
The predefined relationship may be a predefined relationship based on the entity type. In some embodiments, the predefined relationship may be determined manually based on the relationship between the entity types. For example, in a financial application scenario, the relationship between entity type "company" and entity type "company" may include competition, collaboration, employed, employment, etc., and the relationship between entity type "person" may include: the relationship between employment, control, etc. \8230andentity type "address" may include: at a location, at a registry, etc., the predefined relationship corresponding to the entity type "company" may include competition, cooperation, employment, controlled, at a location, at a registry, etc.
In some embodiments, each class a entity triplet may include a first entity, a second entity, and a predefined relationship between the first entity and the second entity.
The first entity may be an entity extracted from the text to be processed.
In some embodiments, the class A extraction module 220 may extract the first entity from the text to be processed using a first extraction model. Specifically, the class a extraction module 220 may process the text to be processed by using the first extraction model to obtain a text tagging sequence of the text to be processed. The text labeling sequence of the text to be processed can be used for marking the words or the words belonging to the entity in the text to be processed and the entity types belonging to the words or the words. In some implementations, the first extraction model includes one or more of the following models: BERT, transformer, stanford NLP, or LTP.
For a detailed description of the first extraction model, refer to fig. 8 and its related description, which are not repeated herein.
As shown in FIG. 5, the first extraction model processes "the main competitor for the first department is B \8230;" obtains the text annotation sequence: the "O", "B-co", "I-co", "O" \8230, the "B-co" \8230, the A-class extraction module 220 can obtain corresponding first entities based on the entity labels "B-co", "I-co" and "B-co" in the A-class extraction module: a house, B \8230, and their corresponding entity type companies, company \8230.
The second entity may be an entity extracted from the text to be processed based on the predefined relationship corresponding to the first entity.
In some embodiments, the class a extraction module 220 may extract from the text to be processed based on the predefined relationship corresponding to the first entity using the first extraction model. For a detailed description of the first extraction model extracting the second entity, refer to fig. 4 and its related description, which are not repeated herein.
As shown in fig. 5, the first extraction model extracts the result of the second entity as null from the text to be processed "a main competitor of the first entity is" b \8230 ", based on the predefined relationship" cooperation "corresponding to the first entity" a shift "; based on the predefined relationship "competition" corresponding to the first entity "a shift", the second entity "b", 8230, may be extracted from the text to be processed "a shift", the main competitor of the shift "b". Further, the class a extraction module 220 may obtain the group 1 class a entity triplets [ class a, class b, competition ] from the text "class a" to be processed, the main competitor is class b ".
It will be appreciated that the relationship between the first entity and the second entity is relative.
In some embodiments, the first entity and the second entity may be exchanged as a new set of class A entity triples. For example, the first extraction model extracts that the result of the second entity is null from the text to be processed, namely, the 'A main competitor is Yiji 8230', based on the predefined system 'cooperation' corresponding to the first entity 'B'; based on the predefined relationship "competition" corresponding to the first entity "B", the second entity "A", 8230, may be extracted from the text to be processed "A, the main competitor of which is B". Further, the class a extraction module 220 may obtain the 2 nd group of entity triplets of class a entity [ b, class a, competition ] from the text to be processed "class a's main competitor is b \8230.
In some embodiments, the relationship between the exchanged first entity and second entity may change.
For example, the first extraction model may extract a first entity "a shift", "b", and "c" from the to-be-processed text "a shift is a main competitor is b and c, and b is a foundry of c at the same time", and then extract a second entity "b" from the to-be-processed text based on the predefined relationship "employment" corresponding to the first entity "c", thereby obtaining a 3 rd group of entity triples of type a [ c, b, employment ]; and taking the result of the second entity from the text to be processed as null based on the predefined relationship 'employment' corresponding to the first entity 'B', and taking the second entity 'C' from the text to be processed based on the predefined relationship 'employed' corresponding to the first entity 'B', so as to obtain a 4 th group of entity triples of class A [ B, C, employed ].
For another example, the first extraction model may extract a first entity "a department" and a "D province E city F district" from a to-be-processed text "the department is located in the D province city E city F district", and then extract a second entity "D province city E city F district" from the to-be-processed text based on a predefined relationship "located" corresponding to the first entity "the department", so as to obtain a 5 th group of entity triples [ the department, the D province city E city F district ] located in; the first entity "D province E city F district" may not have a corresponding predefined relationship, or a result of taking the second entity from the text to be processed based on all predefined relationships corresponding to the first entity "D province city E city F district" is null, that is, "D province city E F district" and "house" cannot be used as the first entity and the second entity to form a type a entity triplet, respectively.
Step 330, using the second extraction model to extract a plurality of third entities from the text to be processed, and determining an open relationship between any two third entities to obtain a plurality of entity triples of class B. In particular, this step 330 may be performed by the class B extraction module 230.
A class B entity triplet may be an entity triplet that contains an open relationship.
An open relationship may be a relationship that is not predefined. In some embodiments, the open relationship may be obtained based on any two third entities.
The third entity may be an entity extracted from the text to be processed. In some embodiments, the class B extraction module 230 may extract the third entity from the text to be processed using a second extraction model. For a detailed description of the extraction of the third entity, reference may be made to the related description of the extraction of the first entity in step 320, which is not described herein again.
As shown in FIG. 7, the second extraction model processes "the main competitor for the first department is B \8230;" obtains the text annotation sequence: "O", "B-co", "I-co", "O" \8230 "," B-co "," O "\8230", the class B extraction module 230 may obtain a corresponding third entity based on the entity labels "B-co", "I-co" and "B-co" therein: a driver and B driver 8230.
Further, in some embodiments, the second extraction model may determine an open relationship between any two third entities based on the any two third entities and the text to be processed.
Specifically, the class B extraction module 230 may process the text to be processed by using the second extraction model, to obtain a relationship labeling sequence of the text to be processed corresponding to any two third entities. The relation labeling sequence can be used for labeling characters and/or words corresponding to the open relation in the text to be processed. Further, the second extraction model may determine an open relationship between any two third entities in the text to be processed based on the relationship labeling sequence. For a detailed description of the second extraction model, refer to fig. 6 and its related description, which are not repeated herein.
In some embodiments, each class B entity triplet may include two third entities and an open relationship between the two third entities. For example, the class B extraction module 230 may obtain the group 1 class B entity triple [ class a, class B, and competition ] based on the third entities "class a" and "class B" and the open relationship "competition" therebetween.
It will be appreciated that the relationship between the two third entities is relative.
In some embodiments, the location of the two third entities may be swapped as a new set of type B entity triples. For example, the positions of the two third entities in the group 2 group B entity triplet may be exchanged, and is denoted as [ B, a, c ].
In some embodiments, after the locations of the two third entities are exchanged, the open relationship between the two third entities may change accordingly. For example, the second extraction model may extract a third entity "a department", "B", and "c" from a to-be-processed text "a department main competitor is B and c, and B is a generation processing factory of c at the same time", and then may obtain an open relationship "generation processing" of "B" and "c" based on the third entity "B" and "c", thereby obtaining a 3 rd group of class B entity triples [ B, c, generation processing ]; based on the third entities "c" and "B", the open relationship "processing" of "c" and "B" can be obtained, thereby obtaining the 4 th group of group B entity triplets [ c, B, processing ].
And step 340, acquiring a target entity triple from the type A entity triple and the type B entity triple based on the screening rule. In particular, this step 340 may be performed by the screening module 240.
The target entity triples may be entity triples that satisfy the extraction requirements.
The screening rules may be rules for determining target entity triples.
In some embodiments, the filtering rule may include obtaining a target entity triplet based on timeliness of the text to be processed corresponding to the type a entity triplet and/or the type B entity triplet.
The timeliness can reflect the influence of the newness degree of the text to be processed on the screening result. In some embodiments, the timeliness of the text to be processed may be assessed using the timeliness indicator. In some embodiments, the timeliness indicator may be determined based on the time indicator and the effectiveness indicator.
The time index may reflect the recency of the text to be processed. In some embodiments, the time indicator may include, but is not limited to, one or more of a publication time indicator, an occurrence time indicator, and an acquisition time indicator, among others.
The posting time index may be the interval between the posting time of the pending text and the current time. The release time of the text to be processed can be the release time of news, the time of uploading a securities research report to a website, the notice time of an audit report and the like. In some embodiments, the filtering module 240 may obtain the publication time of the pending text by accessing the database time information and/or crawling website information.
The occurrence time index may be an interval between the occurrence time of the event described by the pending text and the current time. Wherein, the occurrence time of the event described by the text to be processed can be the occurrence time of the event reported by news. In some embodiments, the filtering module 240 may identify a text in a time format from the text to be processed, so as to obtain an occurrence time of an event described by the text to be processed.
The acquisition time index may be an interval between the time when the text processing system acquires the text to be processed and the current time. For example, the time at which the to-be-processed text is obtained may be the time at which the text processing system 200 crawls the to-be-processed text from a website. In some embodiments, the filtering module 240 may record the time for acquiring the text to be processed directly when acquiring the text to be processed.
Illustratively, the current time is 20 days at 1 month at 2022, the text to be processed may be some news which is published at 2 days at 1 month at 2022, and specifically includes "1 month at 2022, 1 day at 1 month at which the house finishes purchasing" and the text processing system 200 crawls the news from the newsfeed website as the text to be processed at 10 days at 1 month at 2022, the publishing time, the occurrence time and the obtaining time of the described event of the text to be processed may be 2 days at 1 month at 2022, 1 month at 2022 and 10 days at 1 month at 2022, and the corresponding publishing time index, the occurrence time index and the obtaining time index may be 18d, 17d and 10d, respectively.
It will be appreciated that the larger the value of the time indicator, the older the text to be processed.
In some embodiments, different pending texts may obtain different time indexes, for example, the news may obtain corresponding 3 time indexes, and if the occurrence time of an event is not described in the stock research report, only the corresponding release time index (e.g., 20 d) and the corresponding acquisition time index (e.g., 9 d) may be obtained.
In some embodiments, the filtering module 240 may set weights for different time indicators, and perform weighted averaging on a plurality of time indicators based on the weights, so as to obtain a final time indicator. For example, the screening module 240 may set weights of 0.4, 0.5, and 0.1 for the release time index, the occurrence time index, and the acquisition time index, respectively, and continuing with the above example, the time index corresponding to the news is (18 × 0.4+17 × 0.5+10 × 0.1)/3 =5.6d, and the time index corresponding to the securities research report is (20 × 0.4+9 × 0.1)/2 =4.5d.
The effect index may reflect the duration of the impact of the text to be processed on the screening result. It can be understood that the larger the effect index is, the longer the duration of the influence of the text to be processed on the screening result is.
In some embodiments, the persistence indicator may be determined based on different types of text to be processed. Illustratively, the research report, monthly audit report, and news may have corresponding performance metrics of 60d, 30d, and 30d, respectively.
In some embodiments, the timeliness indicator may be a ratio of the effectiveness indicator and the time indicator. For example, the timeliness index of the news may be 30/5.6=5.4, and the timeliness index of the securities research report may be 60/4.5=13.
In some embodiments, the screening module 240 may sort the timeliness of the texts to be processed corresponding to the type a entity triplets and/or the type B entity triplets from large to small based on the timeliness index, and use the type a entity triplets and/or the type B entity triplets with a sorting order smaller than the first sorting threshold as the target entity triplets. For example, group 1 class A entity triplets [ class A, class B, competition ], group 2 class A entity triplets [ class B, class A, competition ], group 3 class A entity triplets [ class C, class B, employment ], group 4 class A entity triplets [ class B, class C, employed ], group 5 class A entity triplets [ class A), the texts to be processed, which are positioned in the district F of the district E of province D and are positioned in the entity triples of the class B of the group 1, the entity triples of the class A of the group 1, the entity triples of the class B of the group B, the competition triple of the class B of the group 2, the entity triples of the class C, the entity triples of the class B of the group 4, the processing triple of the class B of the group 4, are subjected to timeliness sorting to be the entity triples of the class A of the group 1, the class B of the group D, competition = group 2 group a entity triplet [ B, class a, competition = group 3 group a entity triplet [ c, B, hiring = group 4 group a entity triplet [ B, c, hired = group 1 group B entity triplet [ a, B, competition = group 2 group B entity triplet 8230 = group 4 group B entity triplet [ c, B, processing > group 5 group a entity triplet [ a ], province E, city F, located ] the screening module 240 may use as the target entity triplets of group 1 to group 4 group a entity triplets and group 1 to group 4 group B entity triplets with a ranking order of less than the first ranking threshold 2.
In some embodiments, the screening module 240 may use the type a entity triplet and/or the type B entity triplet corresponding to the text to be processed whose timeliness indicator is greater than the timeliness threshold as the target entity triplet. For example, the timeliness indexes of the texts to be processed corresponding to the group 1 to group 4 entity triples of type a and the group 1 to group 4 entity triples of type B are all 4.5D, the timeliness index of the texts to be processed corresponding to the group 5 entity triples of type a [ a ], the group D, the city E, and the area F is 3D, and the screening module 240 may use the group 1 to group 4 entity triples of type a and the group 1 to group 4 entity triples of type B, of which the timeliness index is greater than the timeliness threshold 4, as the target entity triples.
In some embodiments, the first ranking threshold and timeliness threshold may be determined based on the number of texts to be processed and the number of training times of the extraction model (first extraction model and/or second extraction model). It can be understood that the larger the number of texts to be processed, the more training times of the extraction model, the smaller the first ordering threshold value, and the larger the timeliness threshold value.
Some embodiments of the present description determine a target entity triplet based on timeliness of a text to be processed corresponding to an entity triplet (i.e., a type a entity triplet and/or a type B entity triplet), which may enable the target entity triplet to have real-time performance.
In some embodiments, the target entity triplet is obtained based on the occurrence number of the type a entity triplet and/or the type B entity triplet in the text to be processed.
In some embodiments, the screening module 240 may use, as the target entity triplet, the type a entity triplet and/or the type B entity triplet that have the same occurrence number of the type a entity triplet and/or the type B entity triplet in the text to be processed, which is greater than the frequency threshold.
For example, the type a entity triplets and/or the type B entity triplets in the pending text "a si located in area F of E, d.province" and "a si main competitors are B and c, and B is a generation process plant of c at the same time" include: <xnotran> 1 A 【 , , 】, 2 A 【 , , 】, 3 A 【 , , 】, 4 A 【 , , 】, 5 A 【 , D E F , 】, 1 B 【 , , 】, 2 B 【 , , 】, 3 B 【 , , 】 4 B 【 , , 】, , 【 , , 】 2 ( 1 A 1 B ), 【 , , 】 2 ( 2 A 2 B ), 【 , , 】 1 ( 3 A ), 【 , , 】 1 ( 4 A ), 【 , D E F , 】 1 ( 5 A ), 【 , , 】 1 ( 3 B ) 【 , , 】 1 ( 4 B ), 240 1 【 , , 】 【 , , 】 . </xnotran>
In some embodiments, the time threshold may be determined based on the number of texts to be processed and the number of training times of the extraction model (the first extraction model and/or the second extraction model). It can be understood that the larger the number of texts to be processed, the more training times of the extraction model, and the larger the time threshold value.
Some embodiments of the present description may make the target entity triplet practical by determining the target entity triplet based on the number of occurrences of the entity triplet (i.e., the type a entity triplet and/or the type B entity triplet) in the pending text.
In some embodiments, the target entity triples are obtained according to the scoring result of the scoring model on the type a entity triples and/or the type B entity triples.
In some embodiments, the inputs of the scoring model may include type a entity triplets and/or type B entity triplets, and the outputs may be the scoring results corresponding to the type a entity triplets and/or type B entity triplets.
In some embodiments, the scoring model may include, but is not limited to, a Text Rank model, a Logistic regression model, a naive bayes classification model, a gaussian distributed bayes classification model, a decision tree model, a random forest model, a KNN classification model, a neural network model, and the like.
For example, the scoring model may respectively process the group 1 to group 5 entity triples of type a and the group 1 to group 4 entity triples of type B, and respectively obtain the scoring results corresponding to the groups as follows: 0.8, 0.4, 0.3, 0.7, and 0.8, 0.6, and 0.2.
In some embodiments, the inputs to the scoring model may also include timeliness indicators and the number of occurrences of the type a entity triplets and/or the type B entity triplets in the text to be processed.
Further, the scoring model may take the type a entity triples and/or the type B entity triples with the scoring result exceeding the scoring threshold and/or with the scoring result ordering smaller than the second ordering threshold as the target entity triples.
For example, the scoring model may assign a score result that exceeds a score threshold of 0.5 for a type a entity triplet and/or a type B entity triplet: the group 1 type A entity triplets [ A, B, competition ], the group 2 type A entity triplets [ B, A, competition ], the group 3 type A entity triplets [ C, B, employment ], the group 5 type A entity triplets [ A, D province, E city F district, and the group 3 type B entity triplets [ B, A, competition ] and the group 2 type B entity triplets [ B, A, B, competition ] are taken as target entity triplets.
For another example, the scoring model may order entity type a entity triplets and/or type B entity triplets with a scoring result less than a second ordering threshold of 4: the method comprises the steps of sorting and juxtaposing 1 st group A entity triples [ A, B, competition ], 2 nd group A entity triples [ B, A, competition ], 1 st group B entity triples [ A, B, competition ], 2 nd group B entity triples [ B, A, competition ], 5 th group A entity triples [ A, D, E, F, and the like in the sorting 2, and 3 rd group B entity triples [ B, C, and the like in the sorting 3 as target entity triples.
In some embodiments, the second ranking threshold and the score threshold may be determined based on the number of texts to be processed and the number of training times of the extraction model (the first extraction model and/or the second extraction model). It can be understood that the larger the number of texts to be processed, the more training times of the extraction model, the smaller the second sorting threshold value, and the larger the score threshold value.
Some embodiments of the present description obtain the target entity triplet based on the scoring model, so that the obtained target entity triplet can be evaluated from multiple dimensions, and the accuracy of the extraction result is improved.
In some embodiments, the training module 250 may train the scoring model individually based on a number of first training samples with first training labels. Specifically, a first training sample with a first training label is input into the scoring model, and parameters of the scoring model are updated through training. In some embodiments, the first training sample may be a sample entity triplet. In some embodiments, the first training label may be a manually labeled true (1) or false (0). For example, when the sample entity triplet in the first training sample is indeed the target entity triplet, the first training label may be true or 1, and when the sample entity triplet in the first training sample is not the target entity triplet, the first training label may be false or 0.
In some embodiments, the screening module 240 may further send the type a entity triplets and/or the type B entity triplets to the user terminal 130 through the network 140, and further, the screening module 240 may receive the user-selected type a entity triplets and/or the type B entity triplets from the user terminal 130 through the network 140 as target entity triplets. In some embodiments, the number of times the filtering module 240 determines the target entity triplet in conjunction with the user interaction may be determined based on the number of texts to be processed and the number of training times of the extraction model. It is to be appreciated that the filtering module 240 may increase the number of times that the type a entity triplets and/or the type B entity triplets are sent to the user terminal 130 when the amount of text to be processed is small and/or the number of training times for extracting the model is small. In some embodiments, the screening module 240 may also determine an opportunity to send a type a entity triplet and/or a type B entity triplet to the user terminal 130 based on the user's selection.
It can be understood that, compared with the manual extraction of sample entity triples from a large amount of sample texts as training labels, the manual labeling of the embodiment only needs to judge whether the extracted type a entity triples and/or type B entity triples are target entity triples, so that the human resources and the time cost are saved. Some embodiments of the present description filter target entity triples in combination with user interaction, and may also appropriately guide a filtering result based on user settings, thereby improving accuracy of an extraction result while saving human resources.
FIG. 9 is a schematic diagram of a text processing method according to some embodiments of the present description. As shown in fig. 9, after the text processing system 100 obtains the type a entity triplet and the type B entity triplet from the text to be processed by using the first extraction model and the second extraction model, the target entity triplet is screened from the type a entity triplet and the type B entity triplet based on the screening rule, and the first extraction model and/or the second extraction model may be trained based on the target entity triplet.
In some embodiments, the training module 250 may train the first extraction model and/or the second extraction model using the text to be processed as a training sample and the target entity triplet as a training label.
For example, the training module 250 may use a text to be processed "a house is located in the F area of E, D province" as a second training sample 1, and use a corresponding target entity triple [ a house, D province, E, F area ] as a second training label of the second training sample 1; taking the text to be processed, namely a generation processing factory in which the main competitors of the first department are B and C and the B is C at the same time, as a second training sample 2, and performing the processing on the corresponding target entity triples [ the first department, the second department, competition, [ forma, competition ], [ ethyl, forma, competition ], [ forma, competition and [ b, c, g processing ] as a second training label for the second training sample 2, 8230.
In some embodiments, the training module 250 may train the first extraction model alone.
Specifically, a second training sample with a second training label is input into an initial first extraction model, the initial first extraction model is used for processing the second training sample to obtain an A-type entity triplet output by the initial first extraction model, and parameters of the initial first extraction model are adjusted according to the difference between the second training label and the A-type entity triplet until the trained middle first extraction model meets a preset condition to obtain the trained first extraction model, wherein the preset condition can be that a loss function is smaller than a threshold value, convergence is achieved, or a training period reaches the threshold value.
In some embodiments, the training module 250 may train the second extraction model separately.
Specifically, a second training sample with a second training label is input into an initial second extraction model, the initial second extraction model is used for processing the second training sample to obtain a class B entity triplet output by the initial second extraction model, and a parameter of the initial second extraction model is adjusted according to a difference between the second training label and the class B entity triplet until a middle second extraction model of training meets a preset condition to obtain a trained second extraction model, wherein the preset condition can be that a loss function is smaller than a threshold value, convergence is achieved, or a training period reaches the threshold value.
In some embodiments, the training module 250 may jointly train the first extraction model, the second extraction model, and the scoring model.
Specifically, a second training sample with a second training label is respectively input into an initial first extraction model and an initial second extraction model, the second training sample is processed by using the initial first extraction model, the initial second extraction model and the initial scoring model to obtain a target entity triple output by the initial scoring model, parameters of the initial first extraction model, the initial second extraction model and the initial scoring model are adjusted according to the difference between the second training label and the target entity triple until the trained middle first extraction model, the trained middle second extraction model and the middle scoring model meet preset conditions, and the trained first extraction model, the trained second extraction model and the trained scoring model are obtained, wherein the preset conditions can be that a loss function is smaller than a threshold value, convergence is achieved, or a training period reaches the threshold value.
In some embodiments, the initial first and second decimation models may be models trained based on a small number of manually labeled sample entity triples.
Some embodiments of the present description combine a first extraction model with a predefined relationship and a second extraction model with an open relationship to extract a target entity triplet from a text to be processed, and train the first extraction model and the second extraction model using the target entity triplet as training data, so that on one hand, the first extraction model and the second extraction model can learn each other, and thus, an extraction result output by the trained models can simultaneously give consideration to higher accuracy and a higher application range, and on the other hand, unsupervised learning can be performed on the first initial extraction model and the second initial extraction model trained on the basis of a small amount of manually labeled sample entity triplets, thereby saving human resources and time cost of labeling.
FIG. 4 is an exemplary flow diagram illustrating a method for obtaining at least one triple of a class A entity using a first decimation model according to some embodiments of the present description. In particular, fig. 4 may be performed by the class a extraction module 220.
As shown in fig. 5, the first extraction model may include: a first entity extraction layer 510, a first joint coding layer 520, a first annotation sequence layer 530, and an entity identification layer 540.
As shown in fig. 4, a method 400 for obtaining at least one triple of a class a entity using a first extraction model may include:
step 410, a first joint encoding of the first entity and the text to be processed is obtained.
In some embodiments, the first joint encoding layer 520 may obtain the first joint encoding based on the feature vector and the first entity vector of the text to be processed.
The feature vector of the text to be processed may be a vector characterizing features of the text to be processed. In some embodiments, the first entity extraction layer 510 may obtain a feature vector of the text to be processed based on the text to be processed. For a detailed description of the feature vector of the text to be processed, reference may be made to fig. 8 and its related description, which are not repeated herein.
As previously mentioned, the first entity may be an entity extracted from the text to be processed. In some embodiments, the first entity extraction layer 510 may extract the first entity in the text to be processed. Specifically, the first entity extraction layer 510 may obtain a text tagging sequence corresponding to the text to be processed, and then extract the first entity based on the text tagging sequence. As shown in FIG. 5, the first entity extraction layer 510 may obtain a corresponding first entity based on the text annotation sequence "O", "B-co", "I-co", "O" \8230and "B-co" \8230: a driver and B driver 8230. For a detailed description of extracting the first entity, reference may be made to the related description of step 320, which is not described herein again.
The first entity vector may be a vector of features characterizing the first entity. In some embodiments, the first extraction model may obtain a first entity vector corresponding to the first entity based on a word and/or word feature vector corresponding to the first entity in the feature vector of the text to be processed.
In some embodiments, the first extraction model may pool the word and/or word feature vectors corresponding to the first entity in the feature vectors of the text to be processed, so as to obtain the first entity vector. Pooling may be achieved by reducing the size of data by representing a particular region of data by an average, minimum, and/or maximum of a plurality of data of the particular region, etc. Accordingly, in some embodiments, pooling may include, but is not limited to, average pooling, minimum pooling, maximum pooling, and the like.
For example, the first extraction model may perform average pooling on elements at the same position in a plurality of word and/or word feature vectors corresponding to the first entity, so as to obtain a first entity vector having the same dimension as each word and/or word feature vector. As shown in fig. 5, the first extraction model may obtain a feature vector [ T ] corresponding to the text to be processed a1 】【T a2 】【T b1 】【T b2 8230A character feature vector T corresponding to the first entity named Jiasi a1 And [ T ] a2 Then, for the word feature vector [ T ] corresponding to the first entity "house a1 And [ T ] a2 Averaging the elements at the same position to obtain the first entity vector of 'A si' [ T ] a [ solution ] A. For example, [ T ] a1 】=【2,4,6】,【T a2 - = [ 4,6,8 ], then [ T ] a [ 3,5,7 ]. For a detailed description of the word and/or the word feature vector, reference may be made to fig. 8 and its related description, which are not repeated herein.
The first joint encoding may be a vector fusing the features of any one of the first entities and the features of the text to be processed. In some embodiments, the first joint encoding layer 520 may encode any one of the first entity vector and the feature vector of the text to be processed, to obtain the first joint encoding.
In some embodiments, the first joint coding layer 520 may fuse any one of the first entity vectors with each word and/or word feature vector in the feature vectors of the text to be processed, so as to obtain the first joint coding.
In some embodiments, the manner of fusion may include, but is not limited to, a combination of one or more of addition, averaging, weighted summation, and the like. For example, as shown in fig. 5, the first joint coding layer 520 may correspond a first entity vector [ T ] corresponding to a first entity "house a Respectively and [ T ] in feature vector of text to be processed a1 】【T a2 】【T b1 】【T b2 8230adding to obtain the first combined code U a1 】【U a2 】【U b1 】【U b2 】…,Wherein, the [ U ] a1 】=【T a1 】+【T a 】、【U a2 】=【T a2 】+【T a 】、【U b1 】=【T b1 】+【T a 】、【U b2 】=【T b2 】+【T a 】…。
Some embodiments of the present specification encode the first entity vector and the feature vector of the text to be processed to obtain the first joint code, so that the first joint code simultaneously includes information of the first entity, information of the text to be processed, and relationship information between the first entity and the text to be processed, and accuracy of extracting a second entity corresponding to each predefined relationship in a subsequent first extraction model can be improved.
In some embodiments, the first joint encoding layer 520 may be a feed-forward neural network. The feedforward neural network can fuse the first entity vector and the feature vector of the text to be processed through an activation function to obtain a first joint code.
It is to be understood that the first entity extracted from the extracted text to be processed may be a second entity of other first entities corresponding to a predefined relationship, and therefore, in some embodiments, after the first entity extraction layer 510 extracts a plurality of first entities from the text to be processed, it may be determined whether the current first entity may use other first entities as second entities to form a category B entity triple based on a front-back order and a character distance between each first entity and other first entities. For example, when the sequence of the first entity is at the end and the distances between the first entity and the other first entities exceed a preset distance (e.g., 10 characters), it may be determined that the first entity does not have a corresponding second entity in the text to be processed.
Further, when the judgment result is negative, the second entity corresponding to the predefined relationship is abandoned based on the first entity, and when the judgment result is positive, the second entity corresponding to the predefined relationship is extracted based on the first entity, and the subsequent steps are continued.
Some embodiments of the present description may determine in advance whether each first entity has a second entity corresponding to a predefined relationship, so as to improve the extraction efficiency.
And step 420, acquiring an entity labeling sequence of the text to be processed corresponding to each predefined relationship based on the first joint coding.
As previously described, the predefined relationship may be a predefined relationship based on the entity type.
In some embodiments, each first entity may correspond to at least one predefined relationship. For example, continuing with the foregoing example, the predefined relationship corresponding to the entity type "company" of the first entity "company" may include competition, cooperation, employment, controlled, located, registered, etc., and the predefined relationship corresponding to the first entity "company" may also include competition, cooperation, employment, controlled, located, registered, etc.
The entity annotation sequence may be a result of arranging a plurality of entity annotations corresponding to a plurality of words or a plurality of words in the text to be processed in order. In some embodiments, the entity label may be used to indicate whether a corresponding word or word in the text to be processed belongs to the second entity. Further, in some embodiments, entity labels may be used to indicate words and/or phrases that correspond to predefined relationships. Illustratively, the entity labels may be further classified into "competitive" relationship entity labels, "cooperative" relationship entity labels, and the like, based on the type of the predefined relationship to which the first entity corresponds, so as to further indicate the predefined relationship to which the corresponding word or phrase corresponds. Thus, the entity tagging sequence can be used for tagging the words or words belonging to the second entity in the text to be processed and the predefined relationship corresponding to the words or words.
In some embodiments, the entity labels may be at least one of Chinese characters, numbers, letters, symbols, and the like. For example, a first word or initial word of a second entity may be represented by B and a non-first word or non-initial word of the second entity may be represented by I. As another example, the predefined relationships may be denoted as "cooperative" and "competitive" by r1 and r2, respectively.
In some embodiments, each predefined relationship may correspond to a set of entity annotation sequences. For example, the entity label B-r1 or I-r1 may mark words or words belonging to the second entity whose predefined relationship in the text to be processed is "collaboration". As another example, the entity labels B-r1 or I-r2 may mark words or phrases belonging to the second entity whose predefined relationship is "competitive" in the text to be processed.
In some embodiments, the first annotation sequence layer 530 can annotate the first joint code to obtain the entity annotation sequence of the text to be processed corresponding to each predefined relationship. As shown in FIG. 5, the first annotation sequence layer 530 can be based on a first joint coding [ U ] a1 】【U a2 】【U b1 】【U b2 \8230, marking the B in the text to be processed, namely the generation processing factories in which B is the main competitor of the company A and B is the company C at the same time, as B-r2 in the entity labeling sequence 2, and representing the first character of the second entity corresponding to the 'competition' of the predefined relationship.
In some embodiments, the entity annotations may also include non-predefined relationship annotations. The non-predefined relationship labels may also be at least one of Chinese characters, numbers, letters, symbols, and the like. Words or phrases in the text to be processed that do not belong to the second entity corresponding to the predefined relationship may be marked with the same non-predefined relationship labels. As shown in fig. 5, the first annotation sequence layer 530 marks the word "a major competitor of the to-be-processed text that does not belong to the second entity corresponding to the predefined relationship" collaboration "in the entity annotation sequence 1 with" O "is" b \8230; ". In some embodiments, words or phrases in the text to be processed that do not belong to the second entity corresponding to the predefined relationship may also be left without any labeling.
Specifically, the first labeling sequence layer 530 may obtain, based on the first joint encoding, a probability that each word or phrase in the text to be processed respectively belongs to the second entity corresponding to each predefined relationship and a probability that each word or phrase does not belong to the second entity corresponding to any predefined relationship, and then use a label corresponding to a maximum value of the probabilities as the entity label of the word or phrase.
Taking fig. 5 as an example, the first annotation sequence layer 530 can be based on [ U ] in the first joint coding a1 "A" is 0.2, belonging to predefined relation "and" cooperation "corresponding to the probability of the first wordThe probability of the non-first word of the corresponding second entity is 0.2, the probability of the second entity not belonging to the predefined relationship cooperation is 0.6, and then the mark O of the non-predefined relationship mark with the maximum probability of 0.6 is used as the entity mark of the first word.
Similarly, as shown in fig. 5, the first annotation sequence layer 530 can obtain the entity annotations in the entity annotation sequence 1 of each word or word in the to-be-processed text "the main competitors of the first department are b and c, and b is a generational processing factory of c at the same time", and arrange the words or words in the to-be-processed text according to the order of the words or words, so as to obtain the entity annotation sequence 1: "O", "O" \ 8230 "," O "," O "; and entity labels in the entity label sequence 2, and arranging according to the sequence of the characters or words in the text to be processed, thereby obtaining the entity label sequence 2: o, O \8230, O, B-r 2.
In some embodiments, the first annotation sequence layer 530 can include, but is not limited to, one of an N-Gram Model, a Conditional Random Fields (CRF) Model, and a Hidden Markov Model (HMM).
And step 430, extracting a second entity corresponding to each predefined relationship according to the entity labeling sequence of the text to be processed corresponding to each predefined relationship.
In some embodiments, the entity identification layer 540 may extract the second entity corresponding to each predefined relationship.
Specifically, the entity identification layer 540 may use the word and/or word corresponding to the entity label corresponding to each predefined relationship in the entity label sequence as the word and/or word of the second entity corresponding to the predefined relationship. For example, as shown in fig. 5, the entity identification layer 540 may determine, based on the entity labels "B-r2" in the entity label sequence 2, that the second entity corresponding to the predefined relationship "competition" in the text to be processed is "B", i.e., the second entity forming a "competition" relationship with the first entity "a" in the text to be processed may be "B".
It is to be understood that in some embodiments, the predetermined relationship may not have a corresponding second entity. For example, as shown in fig. 5, the entity identification layer 540 may determine that the predefined relationship "collaborate" has no corresponding second entity in the text to be processed, i.e., there is no second entity forming a "collaborating" relationship with the first entity "a" in the text to be processed, based on the non-predefined relationship labels "O", "O" \8230 "," O ", in the entity label sequence 1.
Further, the class a extracting module 220 may combine the first entity, the predefined relationship corresponding to the first entity, and the second entity corresponding to each predefined relationship in the text to be processed into a class a entity triple. For example, the class a extraction module 220 may combine the first entity "class", the predefined relationship "competition" corresponding to the first entity "class", and the second entity "second" corresponding to the predefined relationship "competition" into the group 1 class a entity triple [ class a, class b, competition ].
The above embodiment provides an implementation structure of the first extraction model, and in some embodiments, the first extraction model may be implemented on an end-to-end model, such as a BERT-based multi-head selection model, stanford NLP, or LTP.
Some embodiments of the present description utilize the first extraction model to obtain the type a entity triplet based on the predefined relationship, so that the first entity and the second entity in the type a entity triplet are necessarily close to or satisfy the predefined relationship, and the accuracy of the type a entity triplet may be improved.
FIG. 6 is an exemplary flow diagram illustrating the acquisition of multiple type B entity triples using a second decimation model according to some embodiments of the present description. In particular, fig. 6 may be performed by the class B extraction module 230.
As shown in fig. 7, the second extraction model may include: a second entity extraction layer 710, a label coding layer 720, a second joint coding layer 730, and a second label sequence layer 740.
As shown in fig. 6, the method 600 for obtaining at least one triple of a class B entity may include:
step 610, in the text to be processed, adding a first tag and a second tag to each third entity, obtaining a tag text, and obtaining a corresponding tag text expression vector based on the tag text.
As previously mentioned, the third entity may be an entity extracted from the text to be processed.
In some embodiments, the second entity extraction layer 710 may extract a third entity in the text to be processed. Specifically, the second entity extraction layer 710 may obtain a text labeling sequence corresponding to the text to be processed, and extract a third entity based on the text labeling sequence.
As shown in FIG. 7, the second entity extraction layer 710 may obtain a corresponding third entity based on the text annotation sequence "O", "B-co", "I-co", "O" \8230and "B-co" \8230: a driver and B driver 8230. For a detailed description of extracting the third entity, reference may be made to the related description of step 330, which is not described herein again.
The first tag and the second tag may be used to indicate a first word and a last word of a third entity, respectively. In some embodiments, the first label and/or the second label may be a number (e.g., 1, 2), a chinese character, a letter (e.g., a, b), or other symbol, and combinations thereof. For example, the first label and the second label of the first third entity in the text to be processed may be "label1" and "label2", respectively, and the first label and the second label of the second third entity may be "label3" and "label4", respectively.
The tag text may be the text to be processed that contains the tag.
In some embodiments, the second extraction model may add the first tag and the second note to the front and the back of the third entity in the text to be processed, respectively, to obtain the tag text. For example, as shown in fig. 7, the second extraction model may add a first label "label1" and a second note "label2" to the text to be processed "a main competitor is" b "\8230", respectively, in front of and behind the "first third entity" a ", and add a first label" label3 "and a second note" label4 "to in front of and behind the" b "first third entity in the aforementioned text to be processed, respectively, to obtain the label text" \\ 8230 ": the main competitor of label1 and LAbel2 was label3 and Label4 \8230.
The tag text representation vector may be a vector characterizing the tag text information.
In some embodiments, the second extraction model may utilize a word embedding model to obtain a tag text representation vector based on the tag text. For a detailed description of the word embedding model, reference may be made to fig. 8 and its related description, which are not repeated herein. As shown in fig. 7, the second extraction model may obtain a tag text representation vector based on the tag text: [ L1 ] T a1 】【T a2 】【l2】【T b1 】…【l3】【T b2 】【l4】…。
And step 620, acquiring a corresponding label coding vector based on the label text representation vector.
The tag encoding vector may be a vector fusing the third entity information and the text information to be processed. It is understood that the tag encoding vector may include the text to be processed and an encoding vector corresponding to the tag, and the encoding vector corresponding to the tag and the text to be processed in the tag encoding vector includes features of other encoding vectors.
In some embodiments, the tag encoding layer 720 may encode the tag text representation vector to obtain a corresponding tag encoding vector. As shown in fig. 7, the tag encoding layer 720 may represent a vector [ l1 ] [ T ] for the tag text a1 】【T a2 】【l2】【T b1 】…【l3】【T b2 L4, and obtaining corresponding label coding vector L1L a1 】【L a2 】【L2】【L b1 】…【L3】【L b2 】【L4】…。
The exemplary tag encoding layer 720 may be implemented by a BERT model or a Transformer.
Step 630, according to the tag coding vector, obtaining a second joint code corresponding to any two third entities.
The first tag vector may be a vector element to which the first tag corresponds in the tag-encoded vector. As shown in fig. 7, the first tag vector of the third entity "a" may be a vector element [ L1 ] of the first tag "label1" corresponding to the tag coding vector, and the first tag vector of the third entity "b" may be a vector element [ L3 ] of the first tag "label3" corresponding to the tag coding vector.
In some embodiments, the second decimation model may obtain at least one first tag vector corresponding to at least one first tag in the tag-encoded vector. Specifically, the second extraction model may obtain the first tag vector from a position order in the tag encoding vector based on a position order of the first tag in the tag text. As shown in fig. 7, the first label "label1" is located at the 1 st position in the label text, and then the corresponding first label vector is the vector element [ L1 ] of the 1 st position attribute in the label encoding vector.
The first tag fusion vector may be a vector in which any two pieces of third entity information and text information to be processed are fused.
In some embodiments, the second extraction model may obtain the first label fusion vector based on any two first label vectors corresponding to any two third entities. Illustratively, the second extraction model may first splice any two first tag vectors, and then map the spliced any two first tag vectors into a first tag fusion vector by using the full connection layer, where a dimension of the first tag fusion vector is the same as a dimension of any one first tag vector.
As shown in fig. 7, the second extraction model may stitch the first label vector [ L1 ] corresponding to the third entity "a" and the first label vector [ L2 ] corresponding to the third entity "b", and then map the stitched [ L1 ] and [ L2 ] to the first label fusion vector [ L ] by using the full link layer.
The second joint code may be a vector fusing the features of any two third entities and the features of the text to be processed.
In some embodiments, the second joint encoding layer 730 may separately fuse the first tag fusion vector with each vector element in the tag text representation vector to obtain the second joint encoding. In some embodiments, the manner of fusing may include, but is not limited to, a combination of one or more of adding, averaging, weighted summing, and the like.
For example, as shown in fig. 7, the second joint coding layer 730 may separately combine the first label fusion vector [ L ] and [ L1 ] in the label text representation vector [ T ] as shown in fig. 7 a1 】【T a2 】【l2】【T b1 】…【l3】【T b2 L4, adding to obtain the second combined code V1V 8230 a1 】【U a2 】【U b1 】【U b2 】…【V3】【V b2 [ V4 ] 8230, wherein [ V1 ] is = [ L1 ] and [ L ] is [ V a1 】=【T a1 】+【L】、【V a2 】=【T a2 】+【L】、【V b1 】=【T b1 】+【L】、…。
In some embodiments, the second joint encoding layer 730 may be a feed-forward neural network. For a detailed description of the feedforward neural network, refer to step 410, which is not described herein.
Some embodiments of the present description obtain the second joint code based on encoding the first tag fusion vector and the tag text representation vector, so that the second joint code simultaneously includes information of two third entities, information of a text to be processed, relationship information between the two third entities, and relationship information between the two third entities and the text to be processed, and accuracy of extracting an open relationship between the two third entities in a subsequent second extraction model can be improved.
And step 640, acquiring an open relationship between any two third entities based on the second joint code.
In some embodiments, the second annotation sequence layer 740 can obtain the relationship annotation sequence corresponding to the tag text based on the second joint encoding.
The relation label sequence may be a result of arranging a plurality of relation labels corresponding to a plurality of characters or a plurality of words in the label text in order. In some embodiments, each relationship label may reflect whether the word and/or word in the corresponding label text is the word and/or word corresponding to the open relationship.
As shown in fig. 7, the corresponding relation labels in the relation label sequence may be "O", where "O" represents invalid or empty, and the corresponding characters and/or words in the label text belong to an open relation, and the corresponding relation labels in the relation label sequence may be "B-r" and/or "I-r", where "B-r" and "I-r" respectively denote the first character of an open relation and the non-first character of an open relation.
Specifically, the second labeling sequence layer 740 may obtain, based on the second joint encoding, a probability that each word or phrase in the label text belongs to the open type relationship and a probability that each word or phrase does not belong to the open type relationship, and then mark corresponding to the maximum probability value as the entity label of the word or phrase.
Taking fig. 5 as an example, the second layer 740 of annotation sequences can be based on [ V ] in the second joint coding a1 The probability that the first character belongs to the open relation is 0.2, the probability that the first character does not belong to the open relation is 0.6, and then the mark O of the non-open relation corresponding to the maximum probability value of 0.6 is used as the relation mark of the first character. For another example, the second layer 740 of annotation sequences may be based on [ V ] in the second joint coding c1 The probability that the 'contest' belongs to the first word in the open relation is 0.7, the probability that the 'contest' belongs to the first word in the open relation is 0.2, the probability that the 'contest' does not belong to the first word in the open relation is 0.1, and then the mark 'B-r' of the first word in the open relation corresponding to the maximum probability value of 0.7 is used as the relation mark of the 'contest' word.
Similarly, the second labeled sequence layer 740 can obtain the Label text "Label 1A Label2 main competitor is Label 3B Label4 \8230", and the relationship labels of each word or phrase in the Label text are arranged according to the sequence of the word or phrase in the Label text, so as to obtain the relationship labeled sequence: o, O \8230, B-r, I-r, O and O8230.
In some embodiments, the second annotation sequence layer 740 can include, but is not limited to, one of an N-Gram (N-Gram) Model, a Conditional Random Field (CRF) Model, and a Hidden Markov Model (HMM).
In some embodiments, the second extraction model may determine an open relationship between any two third entities in the text to be processed based on the relationship annotation sequence. For example, continuing with fig. 7, the second abstraction model may obtain the open relationship "competition" between the third entities "a" and "B" based on the words "race" and "war" of the relationship labels "B-r" and "I-r" in the aforementioned relationship label sequence corresponding to the text to be processed.
Further, the class B abstraction module 230 may compose any two third entities and the open relationship therebetween into a class B entity triple. For example, the class B abstraction module 230 may combine the third entity "class a" and "class B" and the open relationship "competition" between "class a" and "class B" into the group 1 class B entity triplet [ class a, class B, competition ].
The above embodiment shows an implementation structure of the second extraction model, and in some embodiments, the second extraction model may be implemented based on an end-to-end model, such as a BERT-based multi-head selection model, stanfordNLP (StanfordNLP), or LTP (LTP), which is a chinese language parsing tool for studios.
Some embodiments of the present description extract any two third entities by using the second extraction model, and obtain the development relationship between the two third entities based on the two third entities, so as to obtain the type B entity triplet.
FIG. 8 is a block diagram illustrating a physical abstraction layer according to some embodiments of the present description. In some embodiments, the entity extraction layer may be a first entity extraction layer and/or a second entity extraction layer.
As shown in fig. 8, the entity extraction layer (the first entity extraction layer and/or the second entity extraction layer) may include a word embedding layer 810, a feature extraction layer 820, and a text labeling layer 830.
In particular, word embedding layer 810 may obtain a text vector for the text to be processed.
The text vector of the text to be processed may be a vector characterizing the text information to be processed.
In some embodiments, before the word embedding layer 810 obtains the text vector of the text to be processed, the text to be processed may be processed as follows: adding [ CLS ] before the text to be processed; and dividing each sentence in the text to be processed by a separator [ SEP ] to distinguish. For example, the text to be processed "the pro factories where the main competitors of the house are B and C, and B is C at the same time" the main competitors of [ CLS ] house after processing "the pro factories where the main competitors of the house are B and C [ SEP ] B are C at the same time".
In some embodiments, the word embedding layer 810 may obtain corresponding character vectors and position vectors, respectively, based on the text to be processed.
A character vector (token embedding) is a vector of character information that characterizes a text to be processed. As shown in FIG. 8, the 23 character information included in the text to be processed, "A main competitors are B and C, and B is C generation processing factory" can be respectively represented by 23 character vectors [ w [ ] a1 】【w a2 】【w b1 】【w b2 \8230andcharacterization. For example, the character information for the character [ A ] may be represented by a character vector [2,3 ]]And (5) characterizing. In a practical application scenario, the dimensionality of the vector representation may be higher. In some embodiments, the character vector may be obtained by querying a word vector table or a word embedding model. In some embodiments, the word embedding model may include, but is not limited to: word2vec model, term Frequency-Inverse file Frequency model (TF-IDF), or SSWE-C (Skip-Gram Based Combined-driven Word Embedding) model.
The position vector (position embedding) is a vector reflecting the position of the character in the text to be processed, such as indicating that the character is the 1 st character, or the 2 nd character, etc. in the text to be processed. In some embodiments, the position vector of the text to be processed may be obtained by cosine-sine encoding. In some embodiments, a segment vector (segment embedding) may also be included, reflecting the segment in which the character is located. As if the character [ a ] is located in the 1 st sentence (segment) of the text to be processed.
In some embodiments, the word embedding layer 810 may fuse, such as concatenate or overlay, various types of vectors of the text to be processed to obtain a text vector of the text information to be processed. As shown in fig. 8, the word embedding layer 810 may be based on a character vector [ w ] a1 】【w a2 】【w b1 】【w b2 8230a position vector (not shown), and obtaining a text vector [ t ] of the text to be processed a1 】【t a2 】【t b1 】【t b2 】…。
Further, the feature extraction layer 820 may take feature vectors of the text to be processed.
The feature vector of the text to be processed may be a vector characterizing features of the text to be processed.
In some embodiments, the feature vector of the text to be processed may include a word feature vector and/or a word feature vector corresponding to each word and/or word in the text to be processed. It is understood that the dimension of the feature vector of the text to be processed may be the same as the number of words and/or phrases in the text to be processed.
In some embodiments, the feature extraction layer 820 may encode the text vector to be processed, to obtain the feature vector of the text to be processed. As shown in fig. 8, the feature extraction layer 820 may apply to the text vector [ t ] a1 】【t a2 】【t b1 】【t b2 8230encoding to obtain characteristic vector T of text to be processed a1 】【T a2 】【T b1 】【T b2 8230in which T a1 】【T a2 】【T b1 】【T b2 'Jia', 'Si', 'Master' and 'Du' respectively as well as 'Du' 8230and corresponding character feature vectors.
An exemplary feature extraction layer may be implemented by a BERT model or a transform.
Still further, the text annotation layer 830 can obtain annotation sequences based on the feature vectors.
The text label sequence is a result of arranging a plurality of text labels respectively corresponding to a plurality of characters or a plurality of words in the text to be processed according to a sequence. In some embodiments, the text label may be used to indicate whether a corresponding word or word in the text to be processed belongs to an entity, and further, the text label may be further divided into a company entity label, an industry entity label, and the like, so as to further indicate the entity type to which the corresponding word or word belongs. Therefore, the text labeling sequence can be used for marking the words or the words belonging to the entity in the text to be processed and the entity type belonging to the words or the words.
In some embodiments, the text labels may be at least one of Chinese characters, numbers, letters, symbols, and the like. For example, a first word or initial word of an entity may be represented by B and a non-first word or non-initial word of an entity may be represented by I. As another example, the text labels B-co or I-co may mark words or phrases in the text to be processed that have an entity type of "company principal". As another example, a text label B-ind or I-ind may mark a word or phrase in the text to be processed whose entity type is "industry".
As shown in FIG. 8, the text annotation layer 830 can be based on the feature vector [ T ] a1 】【T a2 】【T b1 】【T b2 '8230, the text to be processed' the main competitors of the A department are B and C, and B is the generation processing factory of C at the same time 'entity' A department, B \8230 ', the' labels 'B-co, I-co, B-pro \8230', respectively, represent 'the first word of a company, the non-first word of a company subject, the first word of a company'.
In some embodiments, the textual annotations may also include non-entity annotations. The non-entity labels may also be at least one of Chinese characters, numbers, letters, symbols, and the like. Words or phrases in the text to be processed that do not belong to an entity may be labeled with the same non-entity label. As shown in fig. 8, the text annotation layer 830 marks the word "primary competitor is" in the text to be processed, which does not belong to an entity, with 7 "O". In some embodiments, words or words in the text to be processed that do not belong to the entity may not be marked.
Specifically, the text labeling layer 830 may obtain, based on the feature vector, probabilities that each word or word in the text to be processed belongs to different entity types and probabilities that each word or word does not belong to any entity, and then use an entity label of an entity type corresponding to the maximum probability value or a non-entity label that does not belong to an entity as the text label of the word or word.
Taking FIG. 8 as an example, the text annotation layer 830 can be based on the feature vector [ T ] a1 The probability that the 'A' belongs to the first word of a company main body is 0.8, the probability that the 'A' belongs to the first word of the company main body is 0.5, the probability that the 'A' belongs to the first word of a character is 0.3, the probability that the 'A' belongs to the first word of the industry is 8230, the probability that the 'A' does not belong to the entity is 0.2, and then the entity mark 'B-co' of the first word of the entity type 'company' corresponding to the maximum probability of 0.8 is used as the entity mark of the 'A' word.
Similarly, the text label layer 830 may obtain the text labels of each word or phrase in the to-be-processed text "the first competitor is b and c, and b is a generational processing factory of c at the same time", and arrange the words or phrases in the to-be-processed text according to the sequence of the words or phrases in the to-be-processed text, so as to obtain the text label sequence: "B-co", "I-co", "O" \ 8230 "," O "," B-co ".
In some embodiments, the text annotation layer 830 can include, but is not limited to, one of an N-Gram (N-Gram) Model, a Conditional Random Field (CRF) Model, and a Hidden Markov Model (HMM).
The embodiment of the specification further provides a computer readable storage medium. The storage medium stores computer instructions, and after the computer reads the computer instructions in the storage medium, the computer realizes the text processing method.
The beneficial effects that may be brought by the embodiments of the present description include, but are not limited to: (1) The method comprises the steps that a first extraction model with a predefined relation and a second extraction model with an open relation are combined to extract target entity triples from a text to be processed, and meanwhile the target entity triples are used as training data to train the first extraction model and the second extraction model, so that on one hand, the first extraction model and the second extraction model can learn each other, the extraction result output by the trained models can simultaneously give consideration to higher accuracy and higher application range, on the other hand, unsupervised learning can be carried out on the first initial extraction model and the second initial extraction model which are trained on the basis of a small amount of manually marked sample entity triples, and therefore human resources and marking time cost are saved; (2) The first entity vector and the feature vector of the text to be processed are coded to obtain a first joint code, so that the first joint code simultaneously comprises the information of the first entity, the information of the text to be processed and the relation information between the first entity and the text to be processed, and the accuracy of extracting a second entity corresponding to each predefined relation of a subsequent first extraction model can be improved; (3) The first label fusion vector and the label text expression vector are coded to obtain a second joint code, so that the second joint code simultaneously comprises information of two third entities, information of a text to be processed, relationship information between the two third entities and the text to be processed, and the accuracy of extracting an open relationship between the two third entities of a subsequent second extraction model can be improved; (4) And screening the target entity triples from multiple dimensions based on timeliness of the text to be processed, the occurrence frequency of the type A entity triples and/or the type B entity triples in the text to be processed and the scoring result of the scoring model, so that the instantaneity, the practicability and the richness of the target entity triples can be improved.
It is to be noted that different embodiments may produce different advantages, and in different embodiments, any one or combination of the above advantages may be produced, or any other advantages may be obtained.
Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be regarded as illustrative only and not as limiting the present specification. Various modifications, improvements and adaptations to the present description may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present specification and thus fall within the spirit and scope of the exemplary embodiments of the present specification.
Also, the description uses specific words to describe embodiments of the description. Reference to "one embodiment," "an embodiment," and/or "some embodiments" means a feature, structure, or characteristic described in connection with at least one embodiment of the specification. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the specification may be combined as appropriate.
Moreover, those skilled in the art will appreciate that aspects of the present description may be illustrated and described in terms of several patentable categories or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful modification thereof. Accordingly, aspects of this description may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.), or by a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present description may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.
For each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., cited in this specification, the entire contents of each are hereby incorporated by reference into this specification. Except where the application history document does not conform to or conflict with the contents of the present specification, it is to be understood that the application history document, as used herein in the present specification or appended claims, is intended to define the broadest scope of the present specification (whether presently or later in the specification) rather than the broadest scope of the present specification. It is to be understood that the descriptions, definitions and/or uses of terms in the accompanying materials of this specification shall control if they are inconsistent or contrary to the descriptions and/or uses of terms in this specification.
Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present disclosure. Other variations are also possible within the scope of the present description. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the specification can be considered consistent with the teachings of the specification. Accordingly, the embodiments of the present description are not limited to only those explicitly described and depicted herein.

Claims (8)

1. A method of text processing, the method comprising:
acquiring a text to be processed;
extracting a first entity from the text to be processed by using a first extraction model, fusing a vector of any one first entity with each character and/or word feature vector in the feature vector of the text to be processed respectively to obtain a first joint code of the first entity and the text to be processed, obtaining an entity tagging sequence of the text to be processed corresponding to each predefined relationship based on the first joint code, and extracting a second entity corresponding to each predefined relationship from the text to be processed according to the entity tagging sequence of the text to be processed corresponding to each predefined relationship to obtain at least one entity triplet of type A; the entity labels are used for indicating characters and/or words corresponding to the predefined relation in the text to be processed; each said class A entity triplet including said first entity, said second entity and a predefined relationship between said first entity and said second entity;
extracting a plurality of third entities from the text to be processed by using a second extraction model, adding a first label and a second label to each third entity in the text to be processed to obtain a label text, and obtaining a corresponding label text expression vector based on the label text; acquiring a corresponding label coding vector based on the label text representation vector; acquiring second combined codes corresponding to any two third entities according to the label coding vectors, and determining an open relationship between any two third entities based on the second combined codes to acquire a plurality of B-type entity triples; wherein the first tag and the second tag are to indicate a first word and a last word of the third entity, respectively; each type B entity triplet comprises two third entities and an open relationship between the two third entities;
and acquiring target entity triples from the type A entity triples and the type B entity triples based on a screening rule.
2. The method of claim 1, further comprising:
and taking the text to be processed as a training sample, taking the target entity triple as a training label, and training the first extraction model and/or the second extraction model.
3. The method of claim 1, wherein the obtaining of the second joint codes corresponding to any two third entities according to the tag coding vector comprises:
obtaining at least one first label vector corresponding to at least one first label in the label coding vectors;
acquiring a first label fusion vector based on any two first label vectors corresponding to any two third entities;
and acquiring second joint codes corresponding to any two third entities based on the first label fusion vector and the label coding vector.
4. The method of claim 1, the first and/or second extraction models comprising one or more of the following models: BERT, transformer, stanford NLP, or LTP.
5. The method of claim 1, the filtering rule comprising:
acquiring the target entity triplet based on the timeliness of the text to be processed corresponding to the type A entity triplet and/or the type B entity triplet;
acquiring the target entity triples based on the occurrence times of the type A entity triples and/or the type B entity triples in the text to be processed; and/or
And obtaining the target entity triples according to the scoring results of the type A entity triples and/or the type B entity triples by the scoring model.
6. The method of claim 1, the first, second, and/or third entities being financial entities of a type comprising a company, a person, an industry, a metric, a value, and an address.
7. A text processing system comprising:
the text acquisition module is used for acquiring a text to be processed;
the class-A extraction module is used for extracting a first entity from the text to be processed by using a first extraction model, fusing a vector of any one first entity with each character and/or word feature vector in the feature vector of the text to be processed respectively to obtain a first joint code of the first entity and the text to be processed, obtaining an entity tagging sequence of the text to be processed corresponding to each predefined relationship based on the first joint code, and extracting a second entity corresponding to each predefined relationship from the text to be processed according to the entity tagging sequence of the text to be processed corresponding to each predefined relationship to obtain a class-A entity triple; the entity labels are used for indicating characters and/or words corresponding to the predefined relation in the text to be processed; each said class A entity triplet including said first entity, said second entity and a predefined relationship between said first entity and said second entity;
the class-B extraction module is used for extracting a plurality of third entities from the text to be processed by utilizing a second extraction model, adding a first label and a second label to each third entity in the text to be processed to obtain a label text, and obtaining a corresponding label text expression vector based on the label text; acquiring a corresponding label coding vector based on the label text representation vector; acquiring second combined codes corresponding to any two third entities according to the label coding vectors, and determining an open relationship between any two third entities based on the second combined codes to acquire a plurality of B-type entity triples; wherein the first tag and the second tag are to indicate a first word and a last word of the third entity, respectively; each type B entity triplet comprises two third entities and an open relationship between the two third entities; and the screening module is used for acquiring the target entity triples from the type A entity triples and the type B entity triples based on the screening rule.
8. A computer-readable storage medium storing computer instructions, wherein when the computer instructions in the storage medium are read by a computer, the computer executes the text processing method according to any one of claims 1 to 6.
CN202210433223.1A 2022-04-24 2022-04-24 Text processing method, system and storage medium Active CN114528418B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210433223.1A CN114528418B (en) 2022-04-24 2022-04-24 Text processing method, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210433223.1A CN114528418B (en) 2022-04-24 2022-04-24 Text processing method, system and storage medium

Publications (2)

Publication Number Publication Date
CN114528418A CN114528418A (en) 2022-05-24
CN114528418B true CN114528418B (en) 2022-10-14

Family

ID=81628023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210433223.1A Active CN114528418B (en) 2022-04-24 2022-04-24 Text processing method, system and storage medium

Country Status (1)

Country Link
CN (1) CN114528418B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115620722B (en) * 2022-12-15 2023-03-31 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium
CN116226408B (en) * 2023-03-27 2023-12-19 中国科学院空天信息创新研究院 Agricultural product growth environment knowledge graph construction method and device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781683A (en) * 2019-11-04 2020-02-11 河海大学 Entity relation joint extraction method
CN111027324A (en) * 2019-12-05 2020-04-17 电子科技大学广东电子信息工程研究院 Method for extracting open type relation based on syntax mode and machine learning
CN113011189A (en) * 2021-03-26 2021-06-22 深圳壹账通智能科技有限公司 Method, device and equipment for extracting open entity relationship and storage medium
CN113051356A (en) * 2021-04-21 2021-06-29 深圳壹账通智能科技有限公司 Open relationship extraction method and device, electronic equipment and storage medium
CN113779358A (en) * 2021-09-14 2021-12-10 支付宝(杭州)信息技术有限公司 Event detection method and system
CN114385812A (en) * 2021-12-24 2022-04-22 思必驰科技股份有限公司 Relation extraction method and system for text

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140032209A1 (en) * 2012-07-27 2014-01-30 University Of Washington Through Its Center For Commercialization Open information extraction
CN105138507A (en) * 2015-08-06 2015-12-09 电子科技大学 Pattern self-learning based Chinese open relationship extraction method
CN107783960B (en) * 2017-10-23 2021-07-23 百度在线网络技术(北京)有限公司 Method, device and equipment for extracting information
CN110737758B (en) * 2018-07-03 2022-07-05 百度在线网络技术(北京)有限公司 Method and apparatus for generating a model
CN111428493A (en) * 2020-03-06 2020-07-17 中国平安人寿保险股份有限公司 Entity relationship acquisition method, device, equipment and storage medium
WO2021253238A1 (en) * 2020-06-16 2021-12-23 Baidu.Com Times Technology (Beijing) Co., Ltd. Learning interpretable relationships between entities, relations, and concepts via bayesian structure learning on open domain facts
CN114372454A (en) * 2020-10-14 2022-04-19 腾讯科技(深圳)有限公司 Text information extraction method, model training method, device and storage medium
CN113887211A (en) * 2021-10-22 2022-01-04 中国人民解放军战略支援部队信息工程大学 Entity relation joint extraction method and system based on relation guidance

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781683A (en) * 2019-11-04 2020-02-11 河海大学 Entity relation joint extraction method
CN111027324A (en) * 2019-12-05 2020-04-17 电子科技大学广东电子信息工程研究院 Method for extracting open type relation based on syntax mode and machine learning
CN113011189A (en) * 2021-03-26 2021-06-22 深圳壹账通智能科技有限公司 Method, device and equipment for extracting open entity relationship and storage medium
CN113051356A (en) * 2021-04-21 2021-06-29 深圳壹账通智能科技有限公司 Open relationship extraction method and device, electronic equipment and storage medium
CN113779358A (en) * 2021-09-14 2021-12-10 支付宝(杭州)信息技术有限公司 Event detection method and system
CN114385812A (en) * 2021-12-24 2022-04-22 思必驰科技股份有限公司 Relation extraction method and system for text

Also Published As

Publication number Publication date
CN114528418A (en) 2022-05-24

Similar Documents

Publication Publication Date Title
CN109522553B (en) Named entity identification method and device
CN114528418B (en) Text processing method, system and storage medium
CN107944027B (en) Method and system for creating semantic key index
CN111222305B (en) Information structuring method and device
Shi et al. Sentiment analysis of Chinese microblogging based on sentiment ontology: a case study of ‘7.23 Wenzhou Train Collision’
CN110390018A (en) A kind of social networks comment generation method based on LSTM
CN113535963B (en) Long text event extraction method and device, computer equipment and storage medium
WO2023040493A1 (en) Event detection
WO2023071745A1 (en) Information labeling method, model training method, electronic device and storage medium
CN115017303A (en) Method, computing device and medium for enterprise risk assessment based on news text
CN112966117A (en) Entity linking method
CN114153978A (en) Model training method, information extraction method, device, equipment and storage medium
Celikyilmaz et al. A graph-based semi-supervised learning for question-answering
De Bruyne et al. An emotional mess! deciding on a framework for building a Dutch emotion-annotated corpus
Jagdish et al. Identification of end-user economical relationship graph using lightweight blockchain-based BERT model
Singh et al. Sentiment score analysis and topic modelling for gst implementation in India
CN111091002A (en) Method for identifying Chinese named entity
CN114330318A (en) Method and device for recognizing Chinese fine-grained entities in financial field
JP6942759B2 (en) Information processing equipment, programs and information processing methods
CN115879669A (en) Comment score prediction method and device, electronic equipment and storage medium
Fourli-Kartsouni et al. A Bayesian network approach to semantic labelling of text formatting in XML corpora of documents
CN112905796B (en) Text emotion classification method and system based on re-attention mechanism
CN115034302A (en) Relation extraction method, device, equipment and medium for optimizing information fusion strategy
CN114626463A (en) Language model training method, text matching method and related device
CN113486649A (en) Text comment generation method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant