CN114372454B - Text information extraction method, model training method, device and storage medium - Google Patents

Text information extraction method, model training method, device and storage medium Download PDF

Info

Publication number
CN114372454B
CN114372454B CN202011098112.7A CN202011098112A CN114372454B CN 114372454 B CN114372454 B CN 114372454B CN 202011098112 A CN202011098112 A CN 202011098112A CN 114372454 B CN114372454 B CN 114372454B
Authority
CN
China
Prior art keywords
data
text
processed
model
relation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011098112.7A
Other languages
Chinese (zh)
Other versions
CN114372454A (en
Inventor
张倩汶
闫昭
张士卫
饶孟良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202011098112.7A priority Critical patent/CN114372454B/en
Publication of CN114372454A publication Critical patent/CN114372454A/en
Application granted granted Critical
Publication of CN114372454B publication Critical patent/CN114372454B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The application discloses a text information extraction method, a model training method, a text information extraction device, a model training device and a storage medium, wherein subject data in a text to be processed and relational data associated with the subject data are obtained by identifying the acquired text to be processed, then the text to be processed, the subject data and the relational data are input into an object extraction model, objects in the text to be processed are identified, object data corresponding to the subject data and the relational data are obtained, and triplet data are generated according to the subject data, the relational data and the object data.

Description

Text information extraction method, model training method, device and storage medium
Technical Field
The application relates to a natural language processing technology, in particular to a text information extraction method, a model training device and a storage medium.
Background
With the development of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) technology and the increasing demand for application in specific fields, research into applying artificial intelligence technology to specific fields such as medical fields has been advanced. Among them, natural language processing (Natural Language Processing, NLP) technology is an important branch of artificial intelligence technology. In the natural language processing technology, the construction of a Knowledge Graph (knowledgegraph) plays an important role in artificial intelligence application. For example, artificial intelligence can utilize knowledge-graph to complete search, question-answer, etc.
The knowledge graph is composed of relationships between a plurality of entity pairs, for example, SPO triplet data, and refers to a triplet composed of an entity pair (subject S-object O) and a relationship (P) between them. The SPO triplet data in the knowledge graph can be widely applied to knowledge question answering, searching and recommending products. For the work of building a knowledge graph, the extraction of SPO triplet data from massive texts is an important basic work.
The existing SPO triplet extraction model is usually based on word2vec, openAI-GPT, semantic representation model (Embeddings from Language Models, ELMo), bi-directional conversion encoder (Bidirectional Encoder Representation from Transformer, BERT) model and other predictive training models, and the training models are used for extracting entities first and then determining relations among the entities so as to extract triples in texts, so that the training model can obtain better effect in entity relation extraction of short sentences, but as the length of texts increases, the diversity of entity types in the texts correspondingly increases, and meanwhile, one entity in the texts can have a plurality of relations, and the performance of the prior art in SPO triplet data extraction for processing the long texts of the entities is poor.
Disclosure of Invention
In order to solve at least one of the above technical problems, the present application provides a text information extraction method, a model training method, a text information extraction device, a model training device and a computer readable storage medium, so as to improve the accuracy of text triplet information extraction.
According to a first aspect of the present application, there is provided a text information extraction method including the steps of:
acquiring a text to be processed;
identifying the text to be processed to obtain main body data in the text to be processed and relationship data associated with the main body data;
inputting the text to be processed, the subject data and the relation data into an object extraction model, and identifying an object in the text to be processed to obtain object data corresponding to the subject data and the relation data;
And generating triplet data according to the subject data, the relation data and the object data.
According to a second aspect of the present application, there is provided a model training method comprising the steps of:
acquiring training text data, and determining triplet information of the training text data, wherein the triplet information comprises third subject data, third object data and third relation data, and the third relation data is relation attribute of the third subject data and the third object data;
Inputting the training text data, the third subject data and the third relation data into an object extraction model, identifying the object in the training text data, obtaining an object identification result, and correcting parameters of the object extraction model according to the object identification result and the third object data.
According to a third aspect of the present application, there is provided a text information extracting apparatus comprising:
the first acquisition unit is used for acquiring the text to be processed;
The first recognition unit is used for recognizing the text to be processed to obtain main body data in the text to be processed and relation data associated with the main body data;
the first guest identification unit is used for inputting the text to be processed, the subject data and the relation data into a guest extraction model, and identifying the guest in the text to be processed to obtain guest data corresponding to the subject data and the relation data;
generating means for generating triplet data from the subject data, the relationship data, and the object data.
According to a fourth aspect of the present application, there is provided a model training apparatus comprising:
The second acquisition unit is used for acquiring training text data and determining triplet information of the training text data, wherein the triplet information comprises third subject data, third object data and third relation data, and the third relation data is a relation attribute of the third subject data and the third object data;
And the model training unit is used for inputting the training text data, the third subject data and the third relation data into an object extraction model, identifying the object in the training text data to obtain an object identification result, and correcting parameters of the object extraction model according to the object identification result and the third object data.
According to a fifth aspect of the present application, there is provided a text information extracting apparatus comprising:
At least one memory for storing a program;
and the at least one processor is used for loading the program to execute the text information extraction method.
According to a sixth aspect of the present application, there is provided a model training apparatus comprising:
At least one memory for storing a program;
at least one processor for loading the program to perform the model training method described above.
According to a seventh aspect of the present application, there is provided a storage medium storing a program which, when executed by a processor, implements the above-described information extraction method or implements the above-described model training method.
The embodiment of the application has the beneficial effects that:
Obtaining subject data in the text to be processed and relation data associated with the subject data by identifying the text to be processed, inputting the text to be processed, the subject data and the relation data into an object extraction model, identifying objects in the text to be processed to obtain object data corresponding to the subject data and the relation data, generating triplet data according to the subject data, the relation data and the object data, the method and the device for identifying the object are different from the prior method and the device for identifying the entity and predicting the relationship between the entity pairs, and the method and the device for identifying the object are characterized in that the main body and the corresponding relationship are identified firstly, then the object in the text to be processed is extracted according to the identified main body and the relationship, the relationship in the text to be processed is identified more comprehensively, the object is identified more accurately, the problem of diversified object types can be solved, and the accuracy of identifying the triplet data and the accurate recommendation of the high triplet data are improved.
Drawings
FIG. 1 is a schematic diagram of a convolutional neural network algorithm of the prior art;
FIG. 2 is a schematic illustration of an implementation environment provided by an embodiment of the present application;
FIG. 3 is a flowchart of a text message extraction method according to an embodiment of the present application;
FIG. 4 is a flowchart of an exemplary method of one embodiment of step 320 of FIG. 3;
FIG. 5 is a flowchart of a specific method of step 420 of FIG. 4;
FIG. 6 is a flowchart of an embodiment of the method of step 330 of FIG. 3;
FIG. 7 is a flowchart of an exemplary method of one embodiment of step 320 of FIG. 3;
FIG. 8 is a flowchart of a specific method of step 710 of FIG. 7;
FIG. 9 is a flowchart of a specific method of one embodiment of step 330 of FIG. 3;
FIG. 10 is a flowchart of a specific method of step 910 of FIG. 9;
FIG. 11 is a schematic structural diagram of an RSO model;
FIG. 12 is a schematic structural diagram of an SRO model;
FIG. 13 is a schematic diagram of classification of text to be processed;
FIG. 14 is a schematic diagram of an information extraction system according to an embodiment of the present application;
fig. 15 is a flowchart of a text information extraction method according to an embodiment of the present application;
FIG. 16 is a flow chart of a model training method provided by an embodiment of the present application;
FIG. 17 is a schematic diagram of an extraction interface of the knowledge-graph extraction system according to an embodiment of the present application;
FIG. 18 is a schematic diagram of an extraction result interface of the knowledge-graph extraction system according to an embodiment of the present application;
fig. 19 is a schematic diagram of a knowledge editing interface of the knowledge graph extraction system according to an embodiment of the present application.
Detailed Description
The application will be further described with reference to the drawings and specific examples. The described embodiments should not be taken as limitations of the present application, and all other embodiments that would be obvious to one of ordinary skill in the art without making any inventive effort are intended to be within the scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.
1) Entity: something that is distinguishable and independent in the real world, such as: name of person, place, game name, etc.
2) Type of entity: refers to a collection of entities having the same attributes, such as in the travel arts, the entities can be divided into: scenic spot level, suitable season of scenic spot, location of scenic spot, scenic spot name and scenic spot open time, etc. For example, "hometown" is an entity belonging to the type "scenic spot name"; "Beijing city" is an entity of the type that belongs to "the location of the attraction".
3) And (3) relation extraction: a relationship is defined as a relationship between two or more entities, and relationship extraction is the identification of the relationship by learning semantic relationships between multiple entities in text. The input for relation extraction is a segment or sentence of text and the output is typically a triplet: < entity 1, relationship, entity 2>. For example, the text to be processed "the known palace is in Beijing city", after the relation is extracted, the output triplet is < palace, address, beijing city >, and can also be expressed by "address (palace, beijing city)". Of course, in some cases, the two entities may not be related, which may be represented by < entity 1, na, entity 2> at this time.
4) SPO triples, knowledge maps store knowledge in the form of triples, i.e. "Subject, PREDICATE relations, subject objects", where Subject and Object are typically named entities and the relations are typically attributes. The knowledge graph question-answer data is composed of questions and corresponding answers, wherein the questions comprise head entities and relations, and the answers comprise tail entities, so that the triplet information can be expressed as < subject, relation and object >. The process of identifying relationships corresponding to knowledge-graphs from questions is referred to as relationship matching. The relation matching data set (Q-R) in the open field is the data set formed by matching the question Q of the question-answer data and the relation R of the corresponding knowledge graph in the open field. Because the open field has more perfect data accumulation, a data basis is provided for the subsequent construction of a knowledge-graph relationship matching model of the game field by acquiring a relationship matching data set of the open field.
5) Knowledge graph: the knowledge carrier represented by the graph data structure describes things and interrelationships of the objective world, the nodes represent the things of the objective world, the edges represent the relations among the things, and a piece of knowledge is represented by a triplet SPO.
6) Word2vec, openAI-GPT, recurrent neural network (Recurrent Neural Networks, RNN), long Short-Term Memory (LSTM), semantic representation model (Embeddings from Language Models, ELMo), bi-directional transform encoder (Bidirectional Encoder Representation from Transformer, BERT) are common models in the natural language technical field.
7) Named Entity Recognition (NER), also known as entity Recognition, entity chunking, and entity extraction, is a subtask of information extraction, aimed at locating and classifying named entities in text into predefined categories such as people, organizations, locations, temporal expressions, numbers, monetary values, percentages, etc.
8) Natural language processing NLP (Natural Language Processing) is a branch discipline in the fields of artificial intelligence and linguistics. The field discusses how natural language is handled and used; natural language processing includes aspects and steps, basically including cognition, understanding, generation, and the like. Natural language cognition and understanding is to let a computer change the input language into interesting symbols and relationships, and then reprocess them according to the purpose. The natural language generation system converts the computer data into natural language.
9) Structured data, which is logically expressed and implemented by a two-dimensional table structure, strictly follows the data format and length specifications, and is mainly stored and managed through a relational database.
10 Unstructured data, meaning data that is not structured by a predefined data model or schema. Typical unstructured data includes text files, emails, social media, website data, mobile data, communication data, and the like.
11 Semi-structured data, which refers to data having a certain data structure that requires further analysis to obtain, such as encyclopedia data, web page data, etc.
12 If an error occurs within a function that is not trapped by itself, the error is thrown to the outer call function, and if the outer function is not trapped, the error is thrown up the chain of function calls until trapped by the advanced programming engine, and the code terminates execution.
The text information extraction method and the model training method of the text provided by the embodiment of the application can be applied to artificial intelligence. Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
An artificial intelligence cloud service, also commonly referred to as AIaaS (AIAS A SERVICE, chinese is "AI as service"). The service mode of the artificial intelligent platform is the mainstream at present, and particularly, the AIaaS platform can split several common AI services and provide independent or packaged services at the cloud. This service mode is similar to an AI theme mall: all developers can access one or more artificial intelligence services provided by the use platform through an API interface, and partial deep developers can also use an AI framework and AI infrastructure provided by the platform to deploy and operate and maintain self-proprietary cloud artificial intelligence services.
Natural language processing (Nature Language Processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
Knowledge graph is a semantic network that reveals relationships between entities, formally describing real world things and their relationships. The current knowledge graph is used for comprehensively referring to various large-scale knowledge bases, has strong semantic processing capability and open interconnection capability, and has great potential in the fields of natural language processing, artificial intelligence and the like. The knowledge graph provides a more effective way for massive, heterogeneous and dynamic big data expression, organization, management and utilization on the Internet, so that the intelligent level of the network is higher and is more similar to the cognitive thinking of human beings. To form high-quality knowledge, quality assessment is also required to be performed on the knowledge graph, and the meaning is that the quality of the knowledge is effectively ensured by quantifying the credibility of the knowledge, reserving higher confidence and discarding lower confidence.
At present, the knowledge graph has been applied in intelligent searching, deep question and answer, social networks and some vertical industries, and becomes a power source for supporting the development of the application. In the vertical industries of medical treatment, finance, electronic commerce and the like, an industry knowledge graph is constructed by means of data of specific industries, so that the knowledge graph has specific industrial significance.
Three types of data sources for information extraction are structured data, semi-structured data and unstructured data. The vast amounts of data in the real world exist in unstructured forms, such as: historical books, government documents, encyclopedias, news reports and the like want the machine to more intelligently conduct questions, dialogues and searches, and the information needs to be converted into structural knowledge, and the conversion process is not separated from the powerful support of information extraction technology. Unstructured knowledge extraction is a process of automatically finding and extracting structured information from plain text data, and is an important link for constructing a large-scale knowledge graph. When the open knowledge graph construction capability is realized, data collection is very difficult, because of information mixing, structured and semi-structured data which can be rapidly provided by an operator are limited, a large amount of unstructured texts similar to encyclopedia introduction also needs a large amount of manual annotation knowledge, and the automatic extraction capability is very important, in other words, a tool capable of extracting the triplet data hidden in the unstructured texts needs to be provided. The purpose of the unstructured text information extraction task is to extract all of its triplet data from the large amount of text corresponding to the entity.
One way in the industry today is to extract entity relationships in the text to be processed by design rules, e.g., X is Y; y comprises X, can accurately act on a vertical scene through setting rules, is inexhaustible in language exhaustion, has low coverage rate, is easy to generate rule conflict or redundancy, and has an overall effect but far from sufficient effect.
There are also statistical machine learning methods at present, and the level of such performance depends on how good the extracted features are. The impact factor of feature engineering on the overall model is very large. Feature extraction in turn depends on the output of existing NLP systems, relies on part-of-speech tagging of NLP tools, syntactic analysis, etc., under which error accumulation transfer is unavoidable. Take convolutional neural networks (Convolutional Neural Networks, CNN) as an example: referring to fig. 1, the CNN algorithm includes a three-layer structure: word representation (Word Representation), feature extraction (Feature Extraction), and Output (Output). First, the first layer is a word representation layer, through which word tokens are converted into word vectors. For example, the semi-supervised word representation method can be used for extraction, or a neural network model word2vec can be used for replacing the CNN algorithm. Then, the second layer is a feature extraction layer, extracts vocabulary level and sentence level features, and directly connects the vocabulary level and sentence level features in series to form final features. Finally, the third layer is an output layer, the features are passed through a layer of logistic regression model softmax classifier to obtain the confidence of various relations, and the relation of two marker nouns is the high confidence. It can be seen that the above approach is to first label all entities and then predict the relationship between pairs of entities, which can have problems with error propagation.
The scheme provided by the embodiment of the application relates to the technologies of artificial intelligence such as knowledge graph, machine learning, deep learning and the like, and is specifically described by the following embodiment.
FIG. 2 is a schematic diagram of an implementation environment provided by an embodiment of the present application. Referring to fig. 2, the implementation environment includes a server 201 and a terminal 202.
The server 201 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms, and the like.
The server 201 has at least a quality evaluation function of the knowledge graph, and can quantify the credibility of the knowledge tuples in the knowledge graph to obtain the confidence level of the knowledge tuples, and the accuracy of the knowledge graph is ensured by keeping the confidence level higher and discarding the confidence level lower. The knowledge graph construction function and the function of providing the background service to the terminal 202 based on the knowledge graph may be realized by the server 201 or by another server associated with the server 201. In the embodiment of the present application, a description will be given by taking, as an example, a function of constructing a knowledge graph, a quality evaluation function, and a function of providing a background service based on the knowledge graph, which are provided by the server 201.
Terminal 202 may be, but is not limited to, a smart phone, tablet, notebook, desktop, smart box, smart watch, etc. Optionally, a client, such as a browser client, a medical client, or a shopping client, etc., is running on terminal 202. The terminal 202 and the server 201 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein. Optionally, server 201 provides background services to clients running on terminal 202 based on the knowledge-graph.
In an alternative implementation, server 201 provides services such as attraction introduction, attraction information search, attraction guidance, etc., to tour guide class clients running on terminal 202 based on the attraction knowledge graph. Taking the scenic spot information search as an example, a user inputs search information through a tour guide client on the terminal 202 and performs search operation; the terminal 202 acquires search information in response to a search operation, and transmits a search request carrying the search information to the server 201; based on the search information in the received search request, server 201 queries the entity corresponding to the search information, the relationship between the entities, the attribute of the entity, and the like from the sight point knowledge base, obtains a search result, and returns the search result to terminal 202.
In another alternative implementation, server 201 provides services such as intelligent customer service, intelligent merchandise recommendation, etc., to shopping class clients running on terminal 202 based on the merchandise knowledge graph. Taking intelligent customer service as an example, a user performs a customer service chat interface through a shopping client on the terminal 202, inputs problem information and performs customer service inquiry operation; the terminal 202 responds to customer service inquiry operation, acquires question information and sends an inquiry request carrying the question information to the server 201; based on the question information in the received question-answer request, the server 201 queries the entity corresponding to the question information, the relationship between the entities, the attribute of the entity, and the like from the commodity knowledge graph, obtains an answer result, and returns the answer result to the terminal 202.
Fig. 3 is a flowchart of a text information extraction method according to an embodiment of the present application. In the embodiment of the present application, a server is taken as an execution body for illustration, and referring to fig. 3, the embodiment includes the following steps 310 to 340.
In step 310, the text to be processed is obtained.
In this step, the text to be processed may include only one sentence, or may include a sentence composed of a plurality of sentences, or even an article composed of a plurality of sentences. The text to be processed may be retrieved from the internet, entered by a local input device, or read from a memory, for example, the text to be processed may be automatically searched and retrieved from the internet.
And 320, performing recognition processing on the text to be processed to obtain main body data and associated relation data of the main body data in the text to be processed.
In this step, a training model may be used to identify named entity recognition NER on the main body data and the relationship data of the text to be processed. For example, existing BERT predictive models may be used to identify the subject data of the text to be processed and its relationship data. In addition, the main body data and the relationship data can be identified at the same time, the main body data and the relationship data of the text to be processed can be identified through two prediction models respectively, for example, the entity and the entity type can be identified through a NER tool, and then the relationship of the entity can be identified through a BERT prediction model. For example, for the text to be processed, "Zhang three birth place is hong Kong", the identified subject data and relationship data are: < Zhang three, place of birth >, for example, for the text to be processed "the now-known palace is located in Beijing city, which is also the capital of China", the main body data and the relationship data are identified as follows: < hometown, address >, < Beijing city, country >.
And 330, inputting the text to be processed, the subject data and the relationship data into an object extraction model, and identifying the object in the text to be processed to obtain object data corresponding to the subject data and the relationship data.
In this step, the object extraction model is a machine learning model trained in advance, and is used for identifying objects in the text to be processed. In an embodiment, a BERT prediction model may be used as an object extraction model, and the text to be processed, the subject data, and the relationship data are input to the BERT prediction model to obtain object data corresponding to the input subject data and relationship data. For example, the text to be processed, "Zhang Sanat birth place is hong Kong" and the subject data and the relationship data < Zhang Sanat birth place > identified in step 320 are input into the BERT prediction model for calculation, and the BERT prediction model identifies the object data "hong Kong" of the correspondence data < Zhang Sanat birth place >. For example, a text to be processed, "the ancient palace of which the appearance is known is located in beijing city, which is also the capital of china", subject and relationship data < the homepalace, address >, < beijing city, the country where the country is located > is input to the BERT prediction model, object data "beijing city" corresponding to the subject and relationship data < the homepalace, address > is identified, and object data "china" corresponding to the subject and relationship data < beijing city, the country where the country is located > is identified.
And 340, generating triplet data according to the subject data, the relation data and the object data.
In this step, since the above-described step 320 recognizes the subject data and the relationship data of the text to be processed and the step 330 recognizes the object data of the text to be processed, the triplet data may be constructed and generated based on the recognized subject data, relationship data, and object data. For example, the identified subject data, relationship data, and object generation triplet information for the text to be processed, "Zhang three's birth place is hong Kong" < Zhang three, birth place, hong Kong >; for the text to be processed, "the palace of the best known to appear is located in Beijing city, which is also the capital of China", the generation of the triplet information comprises: < hometown, address, beijing city > and < Beijing city, country, china >. The generated triplet information may be output or stored in a database of the knowledge-graph extraction system or in the server 201 or the terminal 202 shown in fig. 2, or may be displayed in a display interface in the knowledge-graph extraction system.
According to the technical scheme provided by the embodiment of the application, the main body data in the text to be processed and the relation data associated with the main body data are obtained through identification processing of the acquired text to be processed, then the text to be processed, the main body data and the relation data are input into the object extraction model, the objects in the text to be processed are identified, the object data corresponding to the main body data and the relation data are obtained, and then the triplet data are generated according to the main body data, the relation data and the object data.
Referring to fig. 4, a further explanation of step 320 is provided for, and step 320 specifically includes steps 410 to 420. In steps 410 to 420 of the present embodiment, the identifying the relationship data of the text to be processed first, and then identifying the main body data in the text to be processed specifically includes:
And 410, inputting the text to be processed into a relation extraction model, identifying relations among all entities in the text to be processed, and determining relation data in the text to be processed.
In this step, a BERT prediction model is used as a basis for relationship judgment, and in order to save training resources, in this embodiment, a pre-trained model of google is used to perform fine-tuning (a means of transfer learning, and secondary training is performed on a pre-trained model of other people for their own application). The BERT prediction model is implemented using a normalized index (softmax) classifier, which is composed of an input layer, a first hidden layer, a second hidden layer, and an output layer. And carrying out normalization processing on the n+m-dimensional vector through the softmax classifier, and finally mapping the n+m-dimensional vector into a z-dimensional output vector. The softmax classifier in effect functions to map the input vector into the classification result. And normalizing the output result through a softmax classifier, and then carrying out one_hot coding on the label, so as to form the relationship data from one or more main bodies marked in the text to be processed. For example, the activation function may be used to obtain the probabilities of the respective tags, and then predict the distance between the tags and the formal tags, as a loss term, to predict the relationship data of the text to be processed. For example, for a text to be processed, "AAAA", which is a web novel written by BBBB, the web novel is carried on CCCC, and after being identified and classified by the BERT prediction model, the relationship data corresponding to the text to be processed is marked or output as "author".
And step 420, inputting the text to be processed and the relation data into a main body extraction model, and identifying the main body in the text to be processed to obtain main body data corresponding to the relation data.
In this step, similar to the above step 410, the text to be processed and the relationship data are input into the BERT prediction model, the label is subjected to one_hot encoding after normalization by using the output result of the softmax classifier, and the main body data is formed from one or more main bodies marked in the text to be processed, for example, the network novels written in the text to be processed "AAAA" and BBBB are carried in the CCCC and "author" and input into the BERT prediction model, and after identification and classification by the BERT prediction model, the main body data corresponding to the text to be processed is marked or output as "AAAA".
According to the technical scheme provided by the embodiment of the application, the relation data and the subject data of the text to be processed are extracted respectively through the relation extraction model and the subject extraction model, and the extraction objects trained by each model are different, so that the performance of identifying the subject and the relation is good, the accuracy of identifying can be improved. After the relationship is specified, the extraction of the subject and the object becomes more clear.
Referring to fig. 5, a further explanation of step 420 is provided for an embodiment of the present application, and step 420 specifically includes steps 510 to 520.
And 510, merging the relation data into the text to be processed to generate a second text.
In this step, the relationship data identified in step 410 is incorporated into the text to be processed, and the relationship data may be added to the beginning or end of the text to be processed, or may be incorporated into the middle of the text to be processed. In this embodiment, merging the identified relationship data into the end of the text to be processed generates a second text, i.e. the relationship data is spliced to the end of the text to be processed. For example, the text to be processed identified in step 410, AAAA, is a web novel written by BBBB and carried by the CCCC, and the relationship data "author" is added to the end of the text to be processed to form a second text, AAAA, is a web novel written by BBBB and carried by the CCCC author ".
And step 520, inputting the second text to the subject extraction model, and identifying the subject in the second text to obtain subject data corresponding to the relationship data.
In this embodiment, the second text generated in step 510 is input as an injection into a subject extraction model, for example, a pre-trained BERT prediction model is used to identify a subject in the second text, and one_hot encoding is performed on the label after normalization of the output result of the softmax classifier, so that one or more subjects marked from the text to be processed. The data may be processed by using a vectorization text tool Tokenizer, tokenizer generates a dictionary, and statistics of word frequency information and the like are represented by vectors. Tokenizer may be defined to mark the text, e.g., predefined token_labels:
token_labels:["[Padding]","[##WordPiece]","[CLS]","[SEP]","B-SUB","I-SUB","O"];
Where [ packing ] corresponds to the 0 position, the Subject start bit is B-SUB, the following bit is I-SUB, [ CLS ] corresponds to the beginning of the sentence, [ SEP ] corresponds to the end of the sentence, [ # WordPiece ] is used to label the word split by tokenizer, typically such word starts with "#". For example, the recognition result of the second text generates the following token identification sentence:
0B-SUB I-SUB I-SUB I-SUB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
therefore, the main body data identified by the BERT prediction model can be determined to be "AAAA" according to the identification result.
In one embodiment, the text to be processed may be identified with a plurality of relationship data in the step 410, for example, for the text to be processed, "2011, 36 years old, a certain person in Chu fragrance New Chart, is decorated with Chu fragrance", a plurality of relationship data may be identified in the step 410: a director, a work, a finishing person. When a plurality of relationships are identified for the text to be processed, a second text is generated for each relationship. I.e. the above step 510 further comprises: and respectively combining each relation data with the text to be processed to generate a plurality of second texts. For example, for the plurality of relationship data identified above: the director, the work, the finishing character, each relationship data will generate a second text, namely, the following three texts are generated in total:
1) In 2011, a 36 year old Zhang is shown in Chu fragrance New Chart Chu Liuxiang;
2) In 2011, a piece of 36 years old is decorated with a new Chu fragrance;
3) In 2011, a person who has 36 years old is decorated with Chu and fragrance in Chu and fragrance Xin Tong.
Executing the step 520 to the second texts, and correspondingly obtaining a plurality of main body data, and outputting the following token identification sentence:
1)0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B-SUB I-SUB I-SUB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
2)0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B-SUB I-SUB I-SUB I-SUB I-SUB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
3)0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B-SUB I-SUB I-SUB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.
As can be seen, for different positions of the main body data of the second text identifying marks, the main body data corresponding to the position of the 1 st item token identifying sentence in the second text is "Zhang somewhere", the main body data corresponding to the position of the 2 nd item token identifying sentence in the second text is "Chu fragrance new transmission", and the main body data corresponding to the position of the 3 rd item token identifying sentence in the second text is "Chu fragrance", and the output structure can be identified through predefining Tokenizer, so that the main body data of each second text can be extracted.
Referring to fig. 6, in one embodiment of the application, step 330 is further described, where step 330 specifically includes steps 610 through 620.
At step 610, the subject data and the relationship data are combined into the text to be processed to generate a first text.
In this step, the main body data and the relationship data identified in the step 320 are combined into the text to be processed to generate the first text, and the main body data and the relationship data may be added to the beginning or the end of the text to be processed, or may be integrated into the middle of the text to be processed. In one embodiment, the relationship data identified in step 410 and the subject data identified in step 420 may be input into the text to be processed to generate the first text. The body data identified in step 520 may also be added to the second text in step 510 to generate the first text.
For example, the text to be processed in the above embodiment, "AAAA" is a web novel written by BBBB and carried on CCCC "forms a second text in step 510," AAAA "is a web novel written by BBBB and carried on CCCC author", and step 520 identifies the second text to obtain the main body data "AAAA", and the main body data "AAAA" is incorporated into the second text to generate the first text, "AAAA" is a web novel written by BBBB and carried on CCCC author AAAA ".
And 620, inputting the first text into the object extraction model, and identifying the object in the first text to obtain object data corresponding to the subject data and the relation data.
In this embodiment, the first text generated in step 610 is input as injection into an object extraction model, for example, a pre-trained BERT prediction model is used to identify objects in the first text, and one_hot encoding is performed on the label after normalization of the output result of the softmax classifier, so that one or more objects marked from the text to be processed. Wherein the data may be processed using vectorized text tool Tokenizer, tokenizer is defined as: predefining token_labels:
token_labels:["[Padding]","[##WordPiece]","[CLS]","[SEP]","B-OBJ","I-OBJ","O"];
Wherein, [ packing ] corresponds to the 0 position, the main body initial position is B-OBJ, the following position is I-OBJ, [ CLS ] corresponds to the sentence head, [ SEP ] corresponds to the sentence tail, [ # WordPiece ] is used for labeling the words split by tokenizer, and generally, the words start with "#".
The first text recognition result generation in step 610, for example, proceeds as follows:
0 0 0 0 0 0 0 B-OBJ I-OBJ I-OBJ I-OBJ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
therefore, it can be determined that the object data identified by the BERT prediction model is "BBBB" based on the above identification result.
The BERT prediction model used in the step 620 and the step 520 is converted into the token identifier sentence one by one according to the subject and the relation in the training data during training, and adds a subject name, a spliced sentence of the subject name, the attribute name and the attribute name into the sentence end, and each subject attribute pair generates a row of object token identifier sentence. In addition, the BERT prediction model can adopt the following formula to obtain an optimal sequence labeling result:
wherein X is the sentence input, and the sentence input, Is the optimal label, and y is the confidence of each label.
In the above embodiment, the result of each sub-model is input as index information to the subsequent model, for example, the recognition result of the relation extraction model is input to the subject extraction model, and the recognition result of the subject extraction model and the recognition result of the relation extraction model are input to the object extraction model. Two models for entity extraction (subject, object) use context and index information in a single packaging sequence, labels separated by a [ SEP ] token. Here, [ CLS ] token is set at the beginning of each sequence as a collective sequence representation of the relationship classifier. In addition, there may be multiple relationships in a sentence, so the relationship classification can be considered a multi-tag learning task, and the loss function focal loss can be used to solve for the positive and negative sample imbalance. The method reduces the cost of traversing the relation set, and can quickly find the relation related to the text in a large number of candidate relation sets. After the relationship is specified, the extraction of the subject and the object becomes more clear.
Referring to fig. 7, a further explanation of step 320 is provided for, and step 320 specifically includes steps 710 to 720. In steps 710 to 720 of this embodiment, first, entity data of a text to be processed is identified, and then relationship data in the text to be processed is identified, which specifically includes:
And step 710, inputting the text to be processed into a main body extraction model, identifying the main body of the text to be processed, and determining main body data existing in the text to be processed.
In this step, the main body data in the text to be processed is identified by a pre-trained main body extraction model, wherein the main body data can be predicted and identified by a BERT prediction model, can be identified by a NER (NAMED ENTITY Recognition) tool, for example, can be implemented by a NER tool Stanford NLP, or can also be identified by a machine reading understanding (MACHINE READING Comprehension) MRC model of the BERT prediction model. For example, for the text "Zhang San's birth site is hong Kong", the subject data "Zhang San" is recognized by the MRC model.
And step 720, inputting the subject data into a relation extraction model, and identifying the relation of each subject in the subject data to obtain relation data.
In this step, the subject data identified in step 710 is input into a relationship extraction model to identify the relationship of each subject, where the relationship extraction model may be a classification model, and a classification manner may be adopted to obtain possible relationships between each entity.
Referring to fig. 8, a further explanation of step 710 is provided for, and step 710 specifically includes steps 810 to 820.
Step 810, obtaining an entity question set, wherein the entity question set comprises a plurality of preset entity questions, and the entity questions are used for inquiring entities existing in the text to be processed.
In this step, an entity question set is preset, where the entity question set includes T different questions q= { Q 1,Q2,...,QT }, where T is a positive integer. These questions are used to query the entity. Questions can be structured like { which words are entities of people? A process of the polymer (c) is performed, { which words are entities of the attraction? And the like.
And step 820, inputting the text to be processed and each entity problem in the entity problem set into a main body extraction model, identifying the entity corresponding to the entity problem in the text to be processed, and taking the identified entity as a main body to obtain main body data.
In this embodiment, an MRC model is used for recognition, a text to be processed is input to the MRC model, and questions in the entity question set are sequentially answered based on the MRC model. Sequentially answering entity questions through the MRC model, locating entity groups of the main body from the input context, and finally selecting a final main body by using a weighted voting strategy, wherein the formula is as follows:
Wherein Wt and at respectively represent the confidence and answer corresponding to each question.
In one embodiment of the present application, step 720 is further described, where step 720 specifically includes: and inputting the main body data into a relation extraction model, classifying each main body in the main body data through the relation extraction model, and determining the relation existing in each main body to obtain the relation data. In this step, the subject data identified in step 710 is input into a relationship extraction model to identify the relationship of each subject, where the relationship extraction model may be a classification model, and a classification manner may be adopted to obtain possible relationships between each entity. For example, the entity that recognizes that the text to be processed "Zhang San's birth place is hong Kong" includes Zhang San "and hong Kong", and the relationship between the two entities is recognized as "birth place" by the classification model. The classification model may employ the following formula: pr (relation=r k|ei)=σ(Wr·hi+br), which expects to obtain a set of all attribute relationships greater than a threshold under the subject ei condition, W is a weight matrix, and b is a bias vector.
Referring to fig. 9, a further explanation is provided for step 330, and step 330 specifically includes steps 910 to 920.
Step 910, generating a third text from the combination of the subject data and the relationship data.
In this step, the main body data and the relationship data are combined, and in one embodiment, the entity data identified in the step 710 and the relationship data identified in the step 720 may be combined, where in this embodiment, the main body data and the relationship data are combined into the third text according to a template in a preset format.
And step 920, inputting the text to be processed and the third text to the object extraction model, and identifying the object in the text to be processed according to the third text to obtain object data corresponding to the subject data and the relation data.
In this step, the MRC model may be used for recognition, that is, the text to be processed is input to the MRC model, and corresponding object data is sequentially recognized by a third text based on the MRC model.
Referring to fig. 10, a further explanation of step 910 is provided for an embodiment of the present application, where step 910 specifically includes steps 1010 to 1020.
Step 1010, obtaining a question template corresponding to the relationship data, wherein the question template comprises a main body mark position corresponding to the relationship data.
In this step, a corresponding problem template is determined for the identified subject data and relationship data, where when the relationship data is in a preset relationship data set, the problem template corresponding to the relationship data is obtained in a preset problem template set. Question templates are preset in the preset relation data set, and query is performed through relation data, for example, the main body data and the relation data identified in the steps are < Zhang san, birth place >, wherein the birth place has corresponding question templates { [ character name ] in which city is born? And acquiring the problem template. And when the relation data is not in the preset relation data set, acquiring a general problem template, wherein the general problem template comprises a main body mark position and a relation mark position, filling the relation data into the relation mark position of the general problem template, and generating the problem template. For example, if the identified principal data and relationship data are < hometown, address >, and the relationship data address does not have a corresponding question template in the preset relationship data set, a general question template is obtained, and the general question template represents a fuzzy question, for example, { what is the [ attribute name ] of finding [ principal name? [ principal name ] is principal mark position, [ attribute name ] is relation mark position, substituting the relation data into general problem template, generating problem template, i.e., { (what is the address of finding principal name? }.
In one embodiment, when the identified relationship data does not have a corresponding problem template in the preset relationship data set, the user may be prompted and input a custom problem template with respect to the relationship data, one way is to prompt the user in the process of executing step 1010, for example, by prompting the relationship data without a corresponding problem template in a window of the terminal, so that the user inputs the custom problem template to improve the accuracy of current identification, and another way is to store the relationship data corresponding to a non-preset problem template, so that the user fills in the custom problem template or supplements the corresponding problem template in the preset relationship data set according to the stored relationship data. For example, with a non-predefined relationship "publisher," the custom question template can be added in the above way, { [ subject name ] which publisher published? Is the publication unit of the novel? }.
And 1020, filling the main body data into the main body mark position of the problem template to generate the natural language problem.
In this step, the main body data is filled into the main body mark position of the problem template, and an inserting or replacing mode may be adopted, for example, the main body mark position is the character of the problem template, and an inserting mode is adopted, for example, the main body mark position is a specific mark in the problem template, and an replacing mode is adopted. E.g., question template { [ name of person ] is born in what city? -wherein [ person name ] is a label, then the subject data is replaced with [ person name ], e.g. < Zhang San, birth place > the subject data and relationship data are filled in question templates, and natural language questions { where in what city Zhang Sanis born? }. For another example, in step 1010, the principal data and the relationship data are set as what is the principal data hometown in < hometown, address > and question template { find [ principal name ] address? Is a natural language question generated { what is the address to find the home uterus? }.
In one embodiment, referring to fig. 10, the step 920 specifically includes the following steps:
step 1030, inputting the text to be processed and the natural language question into the object extraction model, and identifying object data corresponding to the subject data and the relationship data from the text to be processed according to the natural language question.
In this step, the object extraction model is a machine-readable understanding (MACHINE READING Comprehension) MRC model, and the MRC model is trained on a large number of open-domain corpora, so that not only can the extraction of predefined relationships be processed, but also the extraction of non-predefined attributes can be processed, and the text to be processed and the natural language problem determined in the step 1020 are input into the MRC model for recognition, so that the MRC model extracts object data from the text to be processed based on the natural language problem. For example, for natural language questions { what city is Zhang Santa born? Identifying the object data as hong kong from the corresponding text to be processed, "Zhang three's birth place is hong kong", what is the address of the home uterus found for natural language questions? And identifying object data 'Beijing city' from the corresponding text to be processed 'the now-known palace is located in Beijing city'.
In this embodiment, non-preset relationship extraction is realized based on the main body, relationship data extraction, and problem template utilization. The recognition performance of the non-preset relation of the text to be processed is improved.
Referring to fig. 11, an information extraction model is provided in an embodiment of the present application, where the model is sequentially formed by a relational (R, relationship) sub-model, a subject (S, subject) sub-model, and an object (O, object) sub-model, and is referred to as an RSO model, and in an embodiment, each of the sub-models has a feature generation layer, a feature fusion layer, and an enhancement layer, and is used for executing a prediction model algorithm.
Wherein the RSO model performs steps 410 to 420 in fig. 4, steps 510 to 520 in fig. 5, and the text information extraction method of any one of steps 610 to 620 in fig. 6 of the above embodiments. Wherein the relational sub-model is the relational extraction model of step 410 in fig. 4, the subject sub-model is the subject extraction model of step 420 in fig. 4, and the object sub-model is the object extraction model of step 620 in fig. 6. Through the RSO model, after the relation set of sentences is compressed, main body extraction and object extraction are sequentially carried out, so that the overlapping problem is solved.
Referring to fig. 12, an information extraction model is provided in an embodiment of the present application, where the model is sequentially formed by a main body (S, subject) sub-model, a relationship (R, relationship) sub-model, and an object (O, object) sub-model, and is referred to as an SRO model, and in an embodiment, each of the sub-models has a feature generation layer, a feature fusion layer, and an enhancement layer, and is used for executing a prediction model algorithm.
Wherein the SRO model performs steps 710 to 720 in fig. 7, steps 810 to 820 in fig. 8, steps 910 to 920 in step 9, and the text information extraction method of any one of steps 1010 to 1030 in fig. 10 of the above embodiments. Wherein the subject sub-model is the subject extraction model of step 710 of fig. 7, the relationship sub-model is the relationship extraction model of step 720 of fig. 7, and the object sub-model is the object extraction model of step 920 of fig. 9. The SRO model is based on main body extraction and problem template utilization, and extraction of preset relations and non-preset relations is achieved.
The RSO model and the SRO model both support diversified object extraction, and the parameters of the RSO model and the SRO model are set, so that the two models realize extraction tasks in a long answer form in the last step to carry out object extraction, and the method has stronger adaptability to diversified object types.
Referring to fig. 13, three types of text to be processed are shown, including a general type text, an entity vertical text EPO (EntityPairOverlap) text, and a single entity overlapping SEO (SingleEntityOverlap) text, EPO text refers to the same entity pair having multiple relationships, for example, for the text to be processed, "director of movie a is Zhang three, and dubbing actor of this animation" includes two sets of triplet information < movie a, director, zhang three >, < movie a, dubbing actor, zhang three >. The SEO text refers to an entity repeating in multiple triples of a sentence, for example, the text to be processed "the palace of the popular name is in beijing city, which is also the capital of china", including two sets of triplet information < palace, address, beijing city >, < beijing city, country in which the beijing city is located, china >. The rest are general types of text, such as "Zhang three's place of birth is hong Kong", including a set of triplet information < Zhang three, place of birth, hong Kong >. The solution provided in this embodiment naturally handles the overlapping problem on the premise of the subject extraction and relationship prediction results. For the RSO model, the relationship classification is first performed, and then the subject data and the object data are sequentially extracted using the sequence tag mode. The RSO model reduces the cost of traversing the set of relationships and finds relationships related to text among a large number of candidate sets of relationships. After the relation is specified, the extraction of the subject and the object becomes clearer, and the effect of processing the overlapping problem of the text to be processed is better. For the SRO model, main body data is extracted first, and then the extraction of the relation and the object is completed by combining a diversified questioning mechanism. The model utilizes natural language problems to enhance understanding of the relationship by the model, and utilizes problem templates to provide possibility for non-preset relationship extraction under the scene.
In order to improve the accuracy of generating the triplet information, the RSO model and the SRO model may be integrated in a large system, referring to an information extraction system shown in fig. 14, where the two models are running in parallel, it should be noted that, the running in parallel is only that the RSO model and the SRO model are independently running, and it is not limited that the two models are running simultaneously, and the RSO model and the SRO model may be running simultaneously, or may run sequentially, for example, the RSO model is running first, and then the SRO model is running, or vice versa. The text input to be processed is processed in parallel by the RSO model and the SRO model, respectively, as in the information extraction system shown in fig. 14, and the text information extraction method of the above embodiment is performed. Wherein each sub-model of the RSO model (in turn, the relationship sub-model, the subject sub-model, and the object sub-model) and each sub-model of the SRO model (in turn, the subject sub-model, the relationship sub-model, and the object sub-model) has a feature generation layer and a feature fusion and enhancement layer for executing a predictive model algorithm, as shown in fig. 14. The feature generation layer may be processed by a bi-directional encoder (Bidirectional Encoder Representation from Transformer, BERT), or may be processed by a part-of-speech tag POStag, where POStag and BERT are models that are common in the natural language arts. The feature fusion and enhancement layer comprises HighWay layers, the feature vector generated by the feature generation layer is mapped through HighWay layers to obtain a mapping vector of a text to be processed, more than one HighWay layers can be arranged, and as shown in fig. 14, two HighWay layers are arranged. The feature fusion and enhancement layer also includes BiLSTM +CRF model, biLSTM-CRF model mainly includes BiLSTM (Bi-directional Long Short-Term Memory network) layer, and CRF (Conditional Random Field ) layer. The word vector corresponding to the word of the text to be processed is used as the input of the model, the probability of the category of each word can be predicted through the BiLSTM layer of the model, then the predicted probability of each word on each category label is used as the input of the CRF layer, and the CRF layer adopts a dynamic programming algorithm such as a Viterbi algorithm (Viterbi Algorithm) based on the predicted probability to determine and mark the category to which each word finally belongs. Referring to fig. 14, training data is input into the RSO model and the SRO model for training, and the test case is used for testing the RSO model and the SRO model after training until the recognition rates of the RSO model and the SRO model reach the requirements. For the SRO model, the entity problem set is also required to be set because the entity existing in the text to be processed is queried in the manner of entity problem.
By adopting the information extraction system shown in fig. 14, the first triplet data obtained by the RSO model processing and the second triplet data obtained by the SRO model processing are obtained, if the first triplet data and the second triplet data are identical, one triplet data is reserved, for example, the first triplet data is used as a final generation result, if the first triplet data and the second triplet data are different, the first triplet data and the second triplet data are generated at the same time, and the first triplet data and the second triplet data are used as target triplet data. The target triplet data is output to the knowledge graph, wherein the target triplet data can be input to the knowledge graph after being manually rechecked.
Referring to fig. 15, a text information extraction method is provided in this embodiment, which uses the RSO model and the SRO model in parallel, and the embodiment includes the following steps 1510 to 15100.
In step 1510, the text to be processed is obtained.
In this step, the manner of acquiring the text to be processed may be the same as in the above-described embodiment, and for example, the text to be processed may be acquired from the internet, input from a local input device, or read from a memory. The text to be processed is a common input of the RSO model and the SRO model in parallel, that is to say the text to be processed of the RSO model and the SRO model input described below is identical.
In step 1520, the relationships existing between the entities in the text to be processed are identified by the first relationship extraction model, so as to obtain first relationship data.
In this step, the first relation extraction model is a relation sub-model in the RSO model, and the first relation data may be relation data identified by the RSO model to the text to be processed. Step 410 shown in fig. 4 in the above-described embodiments may be performed, for example.
Step 1530, based on the first relationship data, identifying the subject in the text to be processed through the first subject extraction model, thereby obtaining first subject data.
In this step, the first body extraction model is a body sub-model in the RSO model, and the first body data may be body data for performing body recognition on the text to be processed by the RSO model. For example, step 420 shown in fig. 4 in the above embodiment, or steps 510 to 520 shown in fig. 5 may be performed.
In step 1540, the object in the text to be processed is identified by the first object extraction model based on the first relationship data and the first subject data, so as to obtain first object data.
In this step, the first guest extraction model is a guest submodel in the RSO model, and the first guest data may be guest data for performing guest recognition on the text to be processed by the RSO model. For example, step 330 shown in fig. 3 in the above embodiment, or steps 610 to 620 shown in fig. 6 may be performed.
At step 1550, first triplet data is generated from the first relationship data, the first body data, and the first guest data.
Step 1560, identifying the main body in the text to be processed through the second main body extraction model, and obtaining second main body data.
In this step, the second subject extraction model is a subject sub-model in the SRO model, and the second subject data may be subject data identified by the SRO model to be processed text. For example, step 710 described in the above embodiment in fig. 7, or steps 810 to 820 as shown in fig. 8 may be performed.
Step 1570, based on the second subject data, identifying the relationship in the text to be processed through the second relationship extraction model to obtain second relationship data.
In this step, the second relational extraction model is a relational sub-model in the SRO model, and the second relational data may be relational data identified by the SRO model to be processed. Step 720 of the embodiment described above with respect to fig. 7 may be performed, for example.
And step 1580, identifying the object in the text to be processed through a second object extraction model based on the second relation data and the second subject data, and obtaining second object data.
In this step, the second object extraction model is an object sub-model in the SRO model, and the second object data may be object data identified by the SRO model for the text to be processed. For example, step 330 shown in fig. 3 in the above embodiment, or steps 910 to 920 shown in fig. 9 may be performed.
And step 1590, generating second triplet data according to the second relation data, the second subject data and the second object data.
And step 15100, generating target triplet data according to the first triplet data and the second triplet data.
In this step, the first triplet data and the second triplet data are further processed to generate target triplet data, where the first triplet data is an output result of the RSO model, and the second triplet data is an output result of the SRO model, and the target triplet data may be processed by adopting a policy, for example, if the first triplet data and the second triplet data are identical, one of the triplet data is reserved, and the first triplet data is used as a final generation result. For example, if the first triplet data and the second triplet data are different, the first triplet data and the second triplet data are generated at the same time, and the first triplet data and the second triplet data are used as target triplet data.
Referring to fig. 16, a flowchart of a model training method is provided in this embodiment. The model trained by the method can be applied to a text information extraction method shown in fig. 3, wherein the model can be an RSO model shown in fig. 12 or an SRO model shown in fig. 13. In the embodiment of the present application, a server is taken as an example of an execution body, and referring to fig. 16, the embodiment includes the following steps 1610 to 1620.
In step 1610, training text data is obtained, and triplet information of the training text data is determined, where the triplet information includes third subject data, third object data, and third relationship data, and the third relationship data is a relationship attribute of the third subject data and the third object data.
In this step, the training data includes training text data and corresponding triplet information, where the training text may be a general type text, EPO text or SEO text in the above embodiment, where the triplet information is manually interpreted from the training text data in advance, for example, the training text data "the best-known home palace is in beijing city, which is also the capital of china", and the corresponding triplet information < the home palace, address, beijing city >, < beijing city, country, china >, forms a set of training data.
Step 1620, inputting the training text data, the third subject data and the third relation data into an object extraction model, identifying the object in the training text data to obtain an object identification result, and correcting parameters of the object extraction model according to the object identification result and the third object data
In this step, training text data, third subject data, and third relationship data are used to train the object extraction model. The model trained therein may be a guest submodel in the RSO model shown in fig. 12 or a guest submodel of the SRO model shown in fig. 13. For example, training text data "the ancient palace of which the appearance is known is located in beijing city," the city is also the capital of china, "and corresponding third subject data and third relationship data are used as training data, that is, < the homepalace, address >, < beijing city, the country where the country is located > and the training text data form training data to train the object extraction model, the recognition result is compared with the third object data, and parameters of the object extraction model are corrected.
In one embodiment, the model training method shown in fig. 16 further includes the following steps:
Inputting the training text data into a relation extraction model, identifying the relation among all entities in the training text data to obtain a relation identification result, and correcting parameters of the relation extraction model according to the relation identification result and the third relation data.
In this step, the relationship extraction model is trained using training text data, where the trained model may be a relational sub-model in the RSO model shown in fig. 12. For example, training text data "the ancient palace of the best name is in Beijing city, which is also the capital of China" is used as training data to train the relation extraction model, and the recognition result is compared with third relation data to correct the parameters of the relation extraction model.
In one embodiment, the model training method shown in fig. 16 further includes the following steps:
Inputting the training text data and the third relation data into a main body extraction model, identifying the main body in the training text data according to the third relation data to obtain a main body identification result, and correcting parameters of the main body extraction model according to the main body identification result and the third main body data.
In this step, the training text data and the third relationship data are used to train the subject extraction model, where the trained model may be a subject sub-model in the RSO model shown in fig. 12. For example, the training text data "the ancient palace of the world is in Beijing city, which is also the capital of China", the third relation data "address" and "country" are used as training data to train the main body extraction model, the recognition result is compared with the third main body data, and the parameters of the main body extraction model are corrected.
In one embodiment, the model training method shown in fig. 16 further includes the following steps:
Inputting the training text data into a main body extraction model, identifying a main body in the training text data to obtain a main body identification result, and correcting parameters of the main body extraction model according to the main body identification result and the third main body data.
In this step, the training text data is used to train the subject extraction model, where the trained model may be a subject sub-model in the SRO model shown in fig. 13. For example, training text data "the ancient palace of the world is in Beijing city, which is also the capital of China" is used as training data to train the main body extraction model, and the recognition result is compared with the third main body data to correct the parameters of the main body extraction model.
In one embodiment, the model training method shown in fig. 16 further includes the following steps:
And inputting the third main body data into a relation extraction model, identifying the relation among entities in the third main body data to obtain a relation identification result, and correcting parameters of the relation extraction model according to the relation identification result and the third relation data.
In this step, the third subject data is used to train the relationship extraction model, where the trained model may be a relational sub-model in the SRO model shown in fig. 13. For example, the third subject data "hometown", "beijing city" and "chinese" are used as training data to train the subject extraction model, and the recognition result is compared with the third relationship data to correct the parameters of the relationship extraction model.
Compared with the existing similar algorithm, the text information extraction method provided by the embodiment of the application has better recognition performance. According to the information extraction model provided by the implementation of the application, according to different sequences of subtasks, two modes of the RSO model and the SRO model are corresponding, and the overlapping problem and the non-preset relation extraction problem can be effectively solved respectively. The information extraction model provided by the implementation of the application expands the object into the text fragment, supports various numerical values, and ensures that the application range of the knowledge extraction system is wider.
The following is a performance comparison test of the information extraction model and other existing algorithm models provided in the embodiment of the present application.
Table 1 dataset test results 1
Referring to the data set test results shown in table 1, 4 Chinese and English data sets are adopted for comparison experiments, and tests are performed on comprehensive performance, general texts, EPO texts and SEO texts, wherein P is Precision, and the ratio of the exact numbers is found in the prediction; r is Recall (R), and the ratio of the full-complement pair of the truly predicted pair is seen in the prediction. F1 is a comprehensive consideration parameter for the precision and recall, and the calculation formula is as follows: f1 =2×pxr/(p+r). MHE is a semantic training model algorithm for implementing a rolling time domain estimation algorithm, ERNIE TAGGING. From table 1 it can be seen that the advantages of the RSO model extraction performance, in particular the RSO model in solving the overlap problem.
Table 2 dataset test results 2
Table 2 is a comparative experiment on the english ACE05 and colll 04 datasets, MTQA, MHE being the comparative algorithm. The overlapping problem of the two data sets of ACE05 and CoNLL04 occurs less frequently, and experimental results show the superiority of the SRO model to the simple triplet extraction task.
Action P R F1
SRO model +qp 59.7 48.7 53.6
SRO model +S 85.5 75.4 80.1
SRO model+S+NLQ 90.3 86.2 88.2
TABLE 3 SRO model data test results
Table 3 is a comparative experiment on the Scene20 dataset, which sequentially provides sentence templates (QP), subjects (S), and subjects+natural language compliant questions (S+NLQ) for the SRO model, and the experiment shows that the SRO model has better extraction capability for non-predefined attribute triples.
The knowledge graph construction method provided by the embodiment obtains the triplet data generated by the text information extraction method of any embodiment, and constructs the knowledge graph according to the obtained triplet data.
The knowledge graph extraction system provided by the embodiment is used for executing the text information extraction method of any embodiment. The user can upload the text to be processed and display the triad data extraction result through the knowledge graph extraction system. Referring to fig. 17, to illustrate an extraction interface of the knowledge graph extraction system, a user may click 1710 an upload button to upload text data to be extracted as illustrated in 1720 list, enter an extraction result interface illustrated in fig. 18 after uploading the extracted text and completing extraction of triplet data, and edit the identified knowledge graph at the knowledge editing interface illustrated in fig. 19.
The embodiment discloses an open text information extraction device, including:
the first acquisition unit is used for acquiring the text to be processed;
The first recognition unit is used for recognizing the text to be processed to obtain main body data in the text to be processed and relation data associated with the main body data;
the first guest identification unit is used for inputting the text to be processed, the subject data and the relation data into a guest extraction model, and identifying the guest in the text to be processed to obtain guest data corresponding to the subject data and the relation data;
Generating means for generating triplet data from the subject data, the relationship data and the object data.
The embodiment discloses a model training device, includes:
The second acquisition unit is used for acquiring training text data and determining triplet information of the training text data, wherein the triplet information comprises first main body data, first client data and first relation data, and the first relation data is a relation attribute of the first main body data and the first client data;
And the model training unit is used for inputting the training text data, the first subject data and the first relation data into an object extraction model, identifying the object in the training text data to obtain an object identification result, and correcting parameters of the object extraction model according to the object identification result and the first object data.
The embodiment discloses a text information extraction device, including:
At least one memory for storing a program;
at least one processor configured to load the program to perform the text information extraction method according to any of the embodiments described above.
The embodiment discloses a model training device, includes:
At least one memory for storing a program;
At least one processor configured to load the program to perform the model training method of any of the embodiments described above.
The present embodiment discloses a storage medium storing a program which, when executed by a processor, implements the information extraction method of any of the above embodiments or implements the model training method of any of the above embodiments.
The present embodiments disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the information extraction method of any of the above embodiments or implements the model training method of any of the above embodiments.
The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in whole or in part in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a magnetic disk, or an optical disk, etc., which can store program codes.
The step numbers in the above method embodiments are set for convenience of illustration, and the order of steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.
While the preferred embodiment of the present application has been described in detail, the present application is not limited to the embodiments described above, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims (9)

1. A text information extraction method, characterized by comprising the steps of:
acquiring a text to be processed;
Identifying the relation existing among all entities in the text to be processed through a first relation extraction model to obtain first relation data; based on the first relation data, identifying a main body in the text to be processed through a first main body extraction model to obtain first main body data; based on the first relation data and the first subject data, identifying objects in the text to be processed through a first object extraction model to obtain first object data; generating first triplet data from the first relationship data, the first subject data, and the first client data;
Identifying a main body in the text to be processed through a second main body extraction model to obtain second main body data; based on the second main body data, identifying the relationship in the text to be processed through a second relationship extraction model to obtain second relationship data; based on the second relation data and the second subject data, identifying the object in the text to be processed through a second object extraction model to obtain second object data; generating second triplet data according to the second relation data, the second subject data and the second object data;
And generating target triple data according to the first triple data and the second triple data.
2. The text information extraction method according to claim 1, wherein the identifying the object in the text to be processed by the first object extraction model based on the first relationship data and the first subject data, to obtain first object data, includes:
Merging the first main body data and the first relation data into the text to be processed to generate a first text;
and inputting the first text into the first client extraction model, and identifying the objects in the first text to obtain the first client data corresponding to the first subject data and the first relation data.
3. The text information extraction method according to claim 1, wherein: the identifying the main body in the text to be processed through a first main body extraction model based on the first relation data to obtain first main body data comprises the following steps:
merging the first relation data into the text to be processed to generate a second text;
And inputting the second text into the first subject extraction model, and identifying the subject in the second text to obtain first subject data corresponding to the first relation data.
4. The text information extraction method according to claim 1, wherein the identifying, based on the second relationship data and the second subject data, the object in the text to be processed by the second object extraction model, to obtain second object data includes:
Generating a third text according to the second main body data and the second relation data combination;
and inputting the text to be processed and the third text to a second object extraction model, and identifying objects in the text to be processed according to the third text to obtain second object data corresponding to the second subject data and the second relation data.
5. The text information extraction method of claim 4, wherein the third text is a natural language question; the generating a third text according to the second main body data and the second relation data combination comprises:
Acquiring a problem template corresponding to the second relation data, wherein the problem template comprises a main body mark position corresponding to the second relation data;
Filling the second subject data into the subject mark position of the problem template to generate the natural language problem;
Inputting the text to be processed and the third text to a second object extraction model, identifying objects in the text to be processed according to the third text, and obtaining second object data corresponding to the second subject data and the second relation data, wherein the method comprises the following steps:
Inputting the text to be processed and the natural language problem into the second object extraction model, and identifying the second object data corresponding to the second subject data and the second relation data from the text to be processed according to the natural language problem.
6. The text information extraction method of claim 5, wherein the obtaining a question template corresponding to the second relationship data includes one of:
when the second relation data are in a preset relation data set, acquiring a problem template corresponding to the second relation data in a preset problem template set;
Or when the second relation data is not in the preset relation data set, acquiring a general problem template, wherein the general problem template comprises a main body mark position and a relation mark position, filling the second relation data into the relation mark position of the general problem template, and generating the problem template.
7. The method for extracting text information according to claim 1, wherein the identifying the subject in the text to be processed by the second subject extraction model to obtain the second subject data includes:
acquiring an entity problem set, wherein the entity problem set comprises a plurality of preset entity problems, and the entity problems are used for inquiring entities existing in the text to be processed;
And inputting the text to be processed and each entity problem in the entity problem set into a second main body extraction model, identifying an entity corresponding to the entity problem in the text to be processed, and taking the identified entity as a main body to obtain the second main body data.
8. A text information extracting apparatus, characterized by comprising:
At least one memory for storing a program;
at least one processor for loading the program to perform the text information extraction method of any of the preceding claims 1 to 7.
9. A computer-readable storage medium storing computer-executable instructions, characterized in that: the computer-executable instructions are for performing the information extraction method of any one of claims 1 to 7.
CN202011098112.7A 2020-10-14 2020-10-14 Text information extraction method, model training method, device and storage medium Active CN114372454B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011098112.7A CN114372454B (en) 2020-10-14 2020-10-14 Text information extraction method, model training method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011098112.7A CN114372454B (en) 2020-10-14 2020-10-14 Text information extraction method, model training method, device and storage medium

Publications (2)

Publication Number Publication Date
CN114372454A CN114372454A (en) 2022-04-19
CN114372454B true CN114372454B (en) 2024-08-16

Family

ID=81137805

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011098112.7A Active CN114372454B (en) 2020-10-14 2020-10-14 Text information extraction method, model training method, device and storage medium

Country Status (1)

Country Link
CN (1) CN114372454B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114528418B (en) * 2022-04-24 2022-10-14 杭州同花顺数据开发有限公司 Text processing method, system and storage medium
CN114816577A (en) * 2022-05-11 2022-07-29 平安普惠企业管理有限公司 Method, device, electronic equipment and medium for configuring service platform function

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413746A (en) * 2019-06-25 2019-11-05 阿里巴巴集团控股有限公司 The method and device of intention assessment is carried out to customer problem
CN110795543A (en) * 2019-09-03 2020-02-14 腾讯科技(深圳)有限公司 Unstructured data extraction method and device based on deep learning and storage medium
CN111708899A (en) * 2020-06-13 2020-09-25 广州华建工智慧科技有限公司 Engineering information intelligent searching method based on natural language and knowledge graph

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7987416B2 (en) * 2007-11-14 2011-07-26 Sap Ag Systems and methods for modular information extraction
US11514012B2 (en) * 2013-03-15 2022-11-29 Refinitiv Us Organization Llc Method and system for generating and using a master entity associative data network
CN111143536B (en) * 2019-12-30 2023-06-20 腾讯科技(深圳)有限公司 Information extraction method based on artificial intelligence, storage medium and related device
CN111192692B (en) * 2020-01-02 2023-12-08 上海联影智能医疗科技有限公司 Entity relationship determination method and device, electronic equipment and storage medium
CN111241209B (en) * 2020-01-03 2023-07-11 北京百度网讯科技有限公司 Method and device for generating information
CN111309921A (en) * 2020-01-19 2020-06-19 上海方立数码科技有限公司 Text triple extraction method and extraction system
CN111339774B (en) * 2020-02-07 2022-11-29 腾讯科技(深圳)有限公司 Text entity relation extraction method and model training method
CN111291172B (en) * 2020-03-05 2023-08-04 支付宝(杭州)信息技术有限公司 Method and device for processing text

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413746A (en) * 2019-06-25 2019-11-05 阿里巴巴集团控股有限公司 The method and device of intention assessment is carried out to customer problem
CN110795543A (en) * 2019-09-03 2020-02-14 腾讯科技(深圳)有限公司 Unstructured data extraction method and device based on deep learning and storage medium
CN111708899A (en) * 2020-06-13 2020-09-25 广州华建工智慧科技有限公司 Engineering information intelligent searching method based on natural language and knowledge graph

Also Published As

Publication number Publication date
CN114372454A (en) 2022-04-19

Similar Documents

Publication Publication Date Title
CN112131393B (en) Medical knowledge graph question-answering system construction method based on BERT and similarity algorithm
CN109241258B (en) Deep learning intelligent question-answering system applied to tax field
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
CN111708869B (en) Processing method and device for man-machine conversation
CN113127624B (en) Question-answer model training method and device
CN110110054A (en) A method of obtaining question and answer pair in the slave non-structured text based on deep learning
CN112214593A (en) Question and answer processing method and device, electronic equipment and storage medium
CN110737758A (en) Method and apparatus for generating a model
CN110175227A (en) A kind of dialogue auxiliary system based on form a team study and level reasoning
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN118132719A (en) Intelligent dialogue method and system based on natural language processing
CN113282711B (en) Internet of vehicles text matching method and device, electronic equipment and storage medium
CN111078837A (en) Intelligent question and answer information processing method, electronic equipment and computer readable storage medium
CN112287069B (en) Information retrieval method and device based on voice semantics and computer equipment
CN115827819A (en) Intelligent question and answer processing method and device, electronic equipment and storage medium
CN115497477B (en) Voice interaction method, voice interaction device, electronic equipment and storage medium
CN114153994A (en) Medical insurance information question and answer method and device
US11880664B2 (en) Identifying and transforming text difficult to understand by user
CN116010581A (en) Knowledge graph question-answering method and system based on power grid hidden trouble shooting scene
CN114911893A (en) Method and system for automatically constructing knowledge base based on knowledge graph
CN112528654A (en) Natural language processing method and device and electronic equipment
CN105095271A (en) Microblog retrieval method and microblog retrieval apparatus
CN114372454B (en) Text information extraction method, model training method, device and storage medium
CN119046433A (en) Output method, device, equipment and storage medium for search enhancement generation type question and answer
CN114330704A (en) Statement generation model updating method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant