CN117094396A

CN117094396A - Knowledge extraction method, knowledge extraction device, computer equipment and storage medium

Info

Publication number: CN117094396A
Application number: CN202311352348.2A
Authority: CN
Inventors: 王伟; 贾惠迪; 邹克旭; 郭东宸; 常鹏慧; 孙悦丽; 朱珊娴; 田启明
Original assignee: Beijing Yingshi Ruida Technology Co ltd
Current assignee: Beijing Yingshi Ruida Technology Co ltd
Priority date: 2023-10-19
Filing date: 2023-10-19
Publication date: 2023-11-21
Anticipated expiration: 2043-10-19
Also published as: CN117094396B

Abstract

The embodiment of the invention provides a knowledge extraction method, a knowledge extraction device, computer equipment and a storage medium, and relates to the technical field of data processing, wherein the method comprises the following steps: receiving a query sentence of query knowledge, dividing the query sentence into a plurality of first text blocks, and extracting a query word in the query sentence; matching the query words with a pre-stored query word data set to obtain synonymous word groups of the query words, wherein the synonymous word groups comprise the query words with the same meaning as the query words; dividing knowledge text data for query into a plurality of second text blocks; respectively carrying out similarity matching on the query words and the first text blocks in the synonymous word groups and each second text block through a matching model, determining the second text blocks with the similarity meeting a preset threshold as target text blocks, extracting the target text blocks, and obtaining the matching model through training a large language model; and integrating each extracted target text block into knowledge response of the query sentence. The knowledge extraction method and the knowledge extraction device can accurately, conveniently and efficiently realize knowledge extraction.

Description

Knowledge extraction method, knowledge extraction device, computer equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a knowledge extraction method, a knowledge extraction device, a computer device, and a storage medium.

Background

Knowledge extraction refers to the automatic extraction of useful information and knowledge from a large volume of text. The current knowledge extraction method mainly comprises the following steps:

rule pattern matching is the extraction of a particular type of knowledge based on manually defined rules or patterns. By designing a matching pattern or rule, the system can identify entities, relationships, etc. from the text.

However, the rule pattern matching method has the following defects: the need to manually write rules may not be flexible and efficient enough for complex knowledge extraction tasks and large-scale text processing; when language structures or text changes which are not covered by the rules are encountered, accurate extraction is difficult to carry out by rule matching; rich context information cannot be captured because rule matching is typically based on local grammar and keyword matching, and global semantic understanding is difficult.

The machine learning method includes supervised learning and unsupervised learning. In supervised learning, labeled training data may be used to train a classifier or sequence labeling model to identify entities and relationships. In unsupervised learning, techniques such as clustering or association rules may be used to discover potential knowledge patterns.

However, the machine learning method has the following drawbacks: although rules can be automatically learned from the data, a large amount of labeling data and feature engineering is required; in addition, this approach typically employs feature engineering and shallow models in knowledge extraction, which have limited generalization capability.

The knowledge graph method is a structured knowledge representation and can be used for storing and organizing a large number of entities, attributes and relationships. The method extracts entities with specific categories from the text, identifies the association relationship between the entities, and reconstructs entity links, thereby forming a structured, queriable and inferable knowledge base.

However, the knowledge-graph method has the following defects: a large amount of high quality data is required, implementation is difficult in a plurality of fields, and updated knowledge is required continuously.

Thus, the problem of how to accurately and conveniently implement knowledge extraction based on a large model is not yet solved.

Disclosure of Invention

In view of the above, the embodiment of the invention provides a knowledge extraction method to solve the technical problem that knowledge extraction cannot be accurately and conveniently realized in the prior art. The method comprises the following steps:

receiving a query sentence of query knowledge, dividing the query sentence into a plurality of first text blocks, and extracting a query word in the query sentence;

Matching the query words with a pre-stored query word data set to obtain synonymous word groups of the query words, wherein the synonymous word groups comprise the query words with the same semantic meaning as the query words;

dividing knowledge text data for query into a plurality of second text blocks;

respectively carrying out similarity matching on the query words and the first text blocks in the synonymous word groups and each second text block through a matching model, determining the second text blocks with the similarity meeting a preset threshold as target text blocks, and extracting the target text blocks, wherein the matching model is obtained by training a large model;

and integrating each extracted target text block into a knowledge response of the query statement.

The embodiment of the invention also provides a knowledge extraction device to solve the technical problem that knowledge extraction cannot be accurately and conveniently realized in the prior art. The device comprises:

the data receiving module is used for receiving a query sentence of query knowledge, dividing the query sentence into a plurality of first text blocks and extracting a query word in the query sentence;

the matching module is used for matching the query words with a pre-stored query word data set to obtain synonymous word groups of the query words, wherein the synonymous word groups comprise the query words with the same semantic meaning as the query words;

The data partitioning module is used for partitioning knowledge text data for query into a plurality of second text blocks;

the extraction module is used for carrying out similarity matching on the query words and the first text blocks in the synonymous word groups and each second text block through a matching model, determining the second text blocks with the similarity meeting a preset threshold as target text blocks, and extracting the target text blocks, wherein the matching model is obtained by training a large model;

and the integration module is used for integrating each extracted target text block into a knowledge response of the query statement.

The embodiment of the invention also provides computer equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes any knowledge extraction method when executing the computer program so as to solve the technical problem that knowledge extraction cannot be realized accurately and conveniently in the prior art.

The embodiment of the invention also provides a computer readable storage medium which stores a computer program for executing any knowledge extraction method, so as to solve the technical problem that knowledge extraction cannot be accurately and conveniently realized in the prior art.

Compared with the prior art, the beneficial effects that above-mentioned at least one technical scheme that this description embodiment adopted can reach include at least: and determining a synonymous word group of the query words in the query statement of the query knowledge, dividing the query statement into a plurality of first text blocks, dividing knowledge text data for query into a plurality of second text blocks, further respectively carrying out similarity matching on the query words in the synonymous word group and the first text blocks and each second text block through a matching model, determining the second text blocks with similarity meeting a preset threshold value as target text blocks, extracting the target text blocks, and finally integrating the target text blocks into knowledge response of the query statement. The method has the advantages that the knowledge text is matched with the synonymous word groups of the questions in the form of text blocks, so that knowledge extraction can be ensured to be more focused and accurately carried out on the questions in the synonymous word groups in the knowledge extraction process, and the accuracy of the knowledge extraction is improved; meanwhile, in the process of extracting the target text block, the matching model can be helped to better understand the context and accurately analyze the meaning of the query sentence based on the query word in the synonymous word group, so that the matching model can better understand and accurately extract the target text block, ambiguity is reduced, and the accuracy of knowledge extraction and answer is improved; in addition, the application of the large model, which is realized by training the large model, can concentrate the attention points of the large model on the text blocks related to the query words by combining the query words and the text blocks, so that the text quantity processed by the large model can be reduced, and the knowledge extraction efficiency is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a knowledge extraction method provided by an embodiment of the present application;

FIG. 2 is a block diagram of a computer device according to an embodiment of the present application;

fig. 3 is a block diagram of a knowledge extraction device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Other advantages and effects of the present application will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present application with reference to specific examples. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. The application may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In an embodiment of the present invention, a knowledge extraction method is provided, as shown in fig. 1, and the method includes:

step S101: receiving a query sentence of query knowledge, dividing the query sentence into a plurality of first text blocks, and extracting a query word in the query sentence;

step S102: matching the query words with a pre-stored query word data set to obtain synonymous word groups of the query words, wherein the synonymous word groups comprise the query words with the same semantic meaning as the query words;

step S103: dividing knowledge text data for query into a plurality of second text blocks;

step S104: respectively carrying out similarity matching on the query words and the first text blocks in the synonymous word groups and each second text block through a matching model, determining the second text blocks with the similarity meeting a preset threshold as target text blocks, and extracting the target text blocks, wherein the matching model is obtained by training a large model;

step S105: and integrating each extracted target text block into a knowledge response of the query statement.

As can be seen from the flow shown in fig. 1, in the embodiment of the present invention, the matching of the knowledge text with the synonymous word group of the question word in the question sentence in the form of text block is realized, so that the knowledge extraction can be ensured to be more focused and accurately performed on the question word in the synonymous word group in the knowledge extraction process, which is beneficial to improving the accuracy of the knowledge extraction; meanwhile, in the process of extracting the target text block, the matching model can be helped to better understand the context and accurately analyze the meaning of the query sentence based on the query word in the synonymous word group, so that the matching model can better understand and accurately extract the target text block, ambiguity is reduced, and the accuracy of knowledge extraction and answer is improved; in addition, the application of the large model, which is realized by training the large model, can concentrate the attention points of the large model on the text blocks related to the query words by combining the query words and the text blocks, so that the text quantity processed by the large model can be reduced, and the knowledge extraction efficiency is further improved.

In specific implementation, the knowledge extraction method can be used for various application scenes needing to extract professional knowledge, for example, the knowledge of a specified field or a specified theme can be extracted in a targeted manner.

In particular, in order to improve accuracy of dividing text blocks by knowledge text data and avoid semantic ambiguity, it is proposed that knowledge text data for query be divided into a plurality of second text blocks by the following manner of dividing subjects:

identifying different topics in the knowledge text data for the query, wherein each topic includes a plurality of words, each word generated by one topic;

calculating, for each document in the knowledge text data for query, a topic probability distribution for each document and a word probability distribution, wherein the topic probability distribution includes probabilities of different topics appearing in the document, and the word probability distribution includes probabilities of different words being generated by the topics;

determining topics with probability greater than a first probability threshold as topics appearing in the document according to the topic probability distribution; determining, from the word probability distribution, words having probabilities greater than a second probability threshold as generated by the topic;

for each topic that appears in each document, the words generated by each topic are divided into a plurality of second text blocks by semantic unit.

In practice, the topic probability distribution can be calculated by the following formula:

the word probability distribution can be calculated by the following formula:

wherein,probability of the kth topic for the mth document,/->For the total number of words belonging to the kth topic in the mth document,/for the total number of words belonging to the kth topic in the mth document>For the kth value of the topic a priori parameter vector, is->Probability of generating the t-th word for the kth topic,/->Generating a total number of t words in all documents for the kth topic,/for the kth topic>The t-th value of the prior parameter vector of the word is V, the total number of words is V, and K is the total number of topics.

In particular, in the process of dividing the second text block based on the topic, the knowledge text data can be divided into blocks according to the topic by using a topic modeling method (such as Latent Dirichlet Allocation, LDA). The method is capable of identifying different subjects or topics in the text data. LDA assumes that each word in a document is generated by one topic (i.e., each word is generated by only one topic), which in turn is generated by a probability distribution of a set of words.

First, knowledge text data is converted into a mathematical representation suitable for topic modeling, such as a bag of words model or TF-IDF (Term Frequency-reverse document Frequency). Taking the bag of words model as an example, a bag of words vector for text data is represented as follows, Representing the number of occurrences of the i-th word in the text:

.../>...,/>

next, the number of topics to be divided is determined, and the number of topics to be extracted is determined to be selected according to domain knowledge or through experiments. A corresponding word probability distribution is generated for each topic and a corresponding topic probability distribution is generated for each document. Assuming that K topics and M documents are provided, the expression of the topic probability distribution and the word probability distribution is as follows:

wherein,probability of the kth topic for the mth document,/->For the total number of words belonging to the kth topic in the mth document,/for the total number of words belonging to the kth topic in the mth document>For the kth value of the topic a priori parameter vector, is->Probability of generating the t-th word for the kth topic,/->Generating a total number of t words in all documents for the kth topic,/for the kth topic>The t-th value of the prior parameter vector of the words, and V is the total number of the words.

Then, the topic probability distribution and the word probability distribution are solved by using a Gibbs Sampling algorithm.

(1) Initially, each word in the text is randomly assigned a topic。

(2) The total number of words belonging to the kth topic for each document is calculated, and the kth topic generates the total number of the kth words in all documents.

(3) And removing the topics of the current word, and evaluating the probability of the current word being endowed with each topic according to the topics of other words in the document. Wherein,

Wherein,for the subject to which the i-th word corresponds, +.>Other topic distribution indicating removal of topic corresponding to the ith word,/I>Is the distribution of the whole words.

(4) After obtaining the topic probability distribution of the current word, a new topic is sampled for the word according to the probability distribution. For example, topics having a probability greater than a second probability threshold are sampled as topics generating the word. Similarly, topics with probability greater than a first probability threshold are sampled as topics appearing in the document, and then the topics appearing in each document and words generated by each topic are determined.

(5) Starting again from step (2), the subject of the next word is updated continuously until(subject distribution under mth text) and +.>(word distribution under kth topic) converges.

When the method is implemented, after the words generated by each theme are obtained, the words generated by each theme can be divided into a plurality of second text blocks directly according to the semantic units. In order to improve the precision and accuracy of text block division, it is proposed to divide a second text block in each topic based on a knowledge hierarchy, for example, dividing words generated by each topic into different knowledge hierarchies according to the concept range of knowledge, wherein the knowledge hierarchies include a knowledge definition principle hierarchy, a multi-domain knowledge cross extension hierarchy and a knowledge application hierarchy; in each knowledge level, the words of each knowledge level are divided into a plurality of second text blocks according to semantic units.

In the implementation, the second text block is a text data block divided from the knowledge text data. The semantic unit may be a text unit having complete semantics such as a noun, and a second text block may be a noun.

In particular, in order to improve the precision and accuracy of text block division, a method of dividing knowledge text data for query into a plurality of second text blocks directly based on knowledge hierarchy is also proposed, for example,

dividing knowledge text data for query into different knowledge layers according to the concept range of knowledge, wherein the knowledge layers comprise a knowledge definition principle layer, a multi-domain knowledge cross expansion layer and a knowledge application layer;

in each knowledge level, dividing the data of each knowledge level into a plurality of second text blocks according to semantic units. The semantic unit may be a text unit with complete semantics, such as a sentence, a paragraph, a noun, etc., and a second text block may be a sentence, a paragraph, or a noun.

In specific implementation, the knowledge hierarchy refers to knowledge in different knowledge concept ranges in knowledge text data, for example, the knowledge hierarchy may include a knowledge definition principle hierarchy, a multi-domain knowledge cross expansion hierarchy and a knowledge application hierarchy, and the knowledge definition principle hierarchy refers to basic concepts of knowledge including knowledge definition, knowledge principle and the like; the multi-domain knowledge cross expansion hierarchy refers to knowledge content comprising a knowledge cross expansion with other knowledge, namely cross expansion knowledge of at least two kinds of knowledge, which can also be called as a medium-level concept of knowledge; the knowledge application hierarchy refers to a high-level concept including knowledge content of knowledge in an application field, which may also be referred to as knowledge, for example, application-related knowledge of knowledge in an application field, updated latest knowledge, and the like.

For example, taking knowledge text data of PM2.5 as an example, first, knowledge of the following knowledge hierarchy may be divided:

basic concept: the fine particulate matter is also called fine particles, fine particles and PM2.5. Fine particulate matter definition: the particles with the aerodynamic equivalent diameter less than or equal to 2.5 microns in the air in the environment can be suspended in the air for a longer time, and the higher the content concentration in the air is, the more serious the air pollution is.

Medium-level concept: fine particulate matter (PM 2.5) has an important influence on human health and the environment. Due to its small size, PM2.5 can penetrate deep into the deepest parts of the respiratory tract and even into the blood circulation. Prolonged exposure to high concentrations of PM2.5 is associated with health problems such as respiratory diseases, cardiovascular diseases, and cancer. In addition, PM2.5 can also affect visibility, atmospheric transparency, and precipitation patterns, affecting the climate.

Advanced concepts: in order to monitor and control the effects of fine particulate matter (PM 2.5), air quality monitoring networks have been established in many countries. The monitoring station periodically measures the concentration of PM2.5 and reports the data to the government and the public. Government and environmental organizations take measures to reduce PM2.5 emissions, such as enhancing vehicle exhaust emissions standards, promoting clean energy use, and improving industrial processes. In addition, some research is also exploring ways to reduce PM2.5 pollution both indoors and outdoors using air purification techniques, architectural designs, and the like.

Based on semantic unit division, each concept level or knowledge level is divided into a plurality of second text blocks, for example, taking the basic concept level as an example, the following second text blocks may be divided, and each second text block is represented by "/" separated content:

fine particulate matter/also known as/fines, fine particles, PM 2.5/fine particulate matter definition: the term/ambient air/aerodynamic equivalent diameter/less than or equal to/2.5 microns/particulate/it can/longer time/suspension/in/air/its in air/concentration/higher/representative/more severe/air pollution.

In the implementation, the process of dividing the knowledge hierarchy and dividing the second text block based on the knowledge hierarchy can be realized through a trained large language model, and the large language model can be a model based on a transducer architecture. By collecting text data, knowledge contents of different knowledge layers are included; labeling data, associating each text sample with an appropriate knowledge level tag to represent its location in the knowledge hierarchy. And inputting the text sample into the large model for training, so that the trained large model has the capability of dividing text blocks.

In the implementation, in the process of dividing the query sentence into a plurality of first text blocks, the query sentence can be realized based on a semantic unit mode, a knowledge level and semantic unit mode, and a theme, knowledge level and semantic unit mode.

In the specific implementation, in the process of matching the target text, an existing similarity calculation method may be adopted, but in order to better understand and accurately extract the target text block to reduce ambiguity, in this embodiment, it is proposed that the query word in the synonymous phrase and the first text block are respectively subjected to similarity matching with each of the second text blocks through a matching model based on a Triplet network by the following steps, and the second text block with similarity meeting a preset threshold value is determined as the target text block:

inputting the query words and the second text blocks in the synonymous word groups into a trained Triplet network, and calculating a first distance between each query word and each second text block through the Triplet network to obtain a plurality of first distances;

mapping a plurality of first distances into a feature space in order from small to large;

Determining a second text block corresponding to a first distance smaller than a preset distance threshold value as a target text block in the feature space;

inputting the first text blocks and the second text blocks into a trained Triplet network, and calculating a second distance between each first text block and each second text block through the Triplet network to obtain a plurality of second distances;

mapping a plurality of the second distances into a feature space in order from small to large;

and determining a second text block corresponding to a second distance smaller than a preset distance threshold value as a target text block in the feature space.

In the implementation, the calculation of the first distance and the second distance can be realized by using methods such as Euclidean distance and cosine distance.

In specific implementation, the process of calculating the similarity between the query word and the second text block and the similarity between the first text block and the second text block can be realized by adopting a trained Triplet network. The Triplet network contains three sub-networks, each of which processes one input text sample. The text samples processed by the three sub-networks represent "anchor", "positive" and "negative" samples, respectively. The input text is encoded by the sub-network and feature vectors are calculated. Triples (anchor, positive, negative) are constructed from the dataset, where an "anchor" sample is a sample to be measured, a "positive" sample is a sample similar to the "anchor" sample, and a "negative" sample is a sample dissimilar to the "anchor" sample. The distance between the "anchor" sample and the "positive" sample, and the distance between the "anchor" sample and the "negative" sample are calculated. The distance may use euclidean distance, cosine distance, or the like. Training of the Triplet network uses ternary losses as an objective function. Ternary losses desirably the distance between the "anchor" sample and the "positive" sample is less than the distance between the "anchor" sample and the "negative" sample. During training, the Triplet network adjusts the weights by minimizing ternary losses so that similar samples are closer together and dissimilar samples are farther apart. Through the training process, the trained Triplet network learns a representation, and can map similar text samples to positions closer in the feature space and dissimilar text samples to positions farther away.

In the specific implementation, after the target text blocks are extracted, knowledge responses can be conveniently and accurately output, for example, the extracted target text blocks are input into a trained selection generation model Transformer, and the selection generation model Transformer is used for connecting the target text blocks into a knowledge sequence according to semantics, so that the knowledge responses of the query sentences are obtained.

In the implementation, a selection generation type model transducer is utilized, knowledge extraction results of a plurality of target text blocks are used as the input of the model, and the knowledge extraction results are connected in series according to semantics to form an output knowledge text sequence. In training the choice of generating model convertors, the generation target can be set as the expected generated coherent knowledge text. The difference between the generated text and the target text can be measured using an appropriate loss function (e.g., cross entropy loss) and back propagation optimization model parameters can be performed.

In specific implementation, the specific process of implementing the knowledge extraction method is described in detail below, and the process includes the following steps:

step one: a query library (i.e., the query data set described above) is constructed. And collecting related query words based on the historical knowledge query questions input by the user, and associating the query words with the same semantics to form synonymous phrases.

For example, a user asks: what is PM2.5? "what" is a question. The same questions as "what" semantics are "define", "concept", "what", etc., i.e., "what=define=concept=what" forms a synonymous phrase.

Step two: the knowledge text data for query is partitioned. Two blocking methods are adopted in the blocking process. Firstly, knowledge hierarchical blocking method is adopted for knowledge, and text is divided into different knowledge hierarchies, wherein the knowledge hierarchies are divided into basic concepts (knowledge definition and principle), intermediate concepts (multi-domain knowledge cross expansion) and advanced concepts (related knowledge is applied and latest). And dividing the knowledge text data according to the knowledge text data.

Firstly, adopting a knowledge hierarchy partitioning method:

Based on semantic blocking, taking basic concepts as examples:

Both methods are implemented by training a large model. The model architecture is based on a transducer architecture, collecting text data, including content of different layers. Labeling data, associating each text sample with an appropriate hierarchical label to represent its location in the knowledge hierarchy. And training the sample input model to enable the large model to have the capacity of dividing text blocks.

Step three: the user question sentences are partitioned. And partitioning the questions (namely the query sentences of the query knowledge) input by the user, and carrying out association matching on the query words in the questions and the query word stock to determine corresponding synonymous word groups.

For example, the problem blocks entered by the user are: what/yes/PM 2.5, "what" is a query, and the query is matched with the query library to obtain a synonym phrase corresponding to the query, for example, "what" is "what=definition=concept=what".

Step four: large model text block extraction. And matching the synonym semantic words of the user questions and the questions after the segmentation with the text blocks of the knowledge to be extracted by using a large model (namely the matching model), and matching the text block with the highest similarity in the text blocks of the knowledge to be extracted (namely the target text block).

Similarity is calculated using a Triplet network. The Triplet network contains three sub-networks, each of which processes one input text sample. These three subnetworks are used to represent "anchor", "positive" and "negative" samples, respectively. The input text is encoded by the sub-network and feature vectors are calculated. Triples (anchor, positive, negative) are constructed from the dataset, where an "anchor" sample is a sample to be measured, a "positive" sample is a sample similar to the "anchor" sample, and a "negative" sample is a sample dissimilar to the "anchor" sample. The distance between the "anchor" sample and the "positive" sample, and the distance between the "anchor" sample and the "negative" sample are calculated. The distance may use euclidean distance, cosine distance, or the like. Training of the Triplet network uses ternary losses as an objective function. Ternary losses desirably the distance between the "anchor" sample and the "positive" sample is less than the distance between the "anchor" sample and the "negative" sample. During training, the Triplet network adjusts the weights by minimizing ternary losses so that similar samples are closer together and dissimilar samples are farther apart. Through the training process, the Triplet network learns a representation that can map similar text samples to locations in feature space that are closer together and dissimilar text samples to locations that are farther apart.

Step five: and (5) integrating related text block results. And inputting the extracted text blocks into the large model again, and integrating knowledge extraction results of different text blocks by the large model to generate a final knowledge extraction result.

The knowledge extraction results of a plurality of text blocks are used as the input of the model by using a selection generating model transducer, and are connected in series to form an input sequence. The generation target is set as the expected generated coherent knowledge text. The difference between the generated text and the target text is measured using an appropriate loss function (e.g., cross entropy loss) and back propagation optimization model parameters are performed.

In this embodiment, a computer device is provided, as shown in fig. 2, including a memory 201, a processor 202, and a computer program stored on the memory and executable on the processor, where the processor implements any of the knowledge extraction methods described above when executing the computer program.

In particular, the computer device may be a computer terminal, a server or similar computing means.

In the present embodiment, a computer-readable storage medium storing a computer program that performs any of the knowledge extraction methods described above is provided.

In particular, computer-readable storage media, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer-readable storage media include, but are not limited to, phase-change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable storage media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

Based on the same inventive concept, the embodiment of the invention also provides a knowledge extraction device, as described in the following embodiment. Since the principle of the knowledge extraction device for solving the problem is similar to that of the knowledge extraction method, the implementation of the knowledge extraction device can refer to the implementation of the knowledge extraction method, and the repetition is not repeated. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 3 is a block diagram of a knowledge extraction device according to an embodiment of the invention, and as shown in fig. 3, the device includes:

a data receiving module 301, configured to receive a query sentence of a query knowledge, divide the query sentence into a plurality of first text blocks, and extract a query word in the query sentence;

the matching module 302 is configured to match the query with a pre-stored query data set to obtain a synonym phrase of the query, where the synonym phrase includes a query with the same meaning as the query;

a data partitioning module 303, configured to partition knowledge text data for query into a plurality of second text blocks;

the extracting module 304 is configured to perform similarity matching on the query word and the first text block in the synonymous phrase and each second text block through a matching model, determine the second text block with similarity meeting a preset threshold as a target text block, and extract the target text block, where the matching model is obtained by training a large model;

and the integrating module 305 is configured to integrate the extracted target text blocks into knowledge responses of the query sentences.

In one embodiment, a data partitioning module includes:

A topic identification unit for identifying different topics in knowledge text data for query, wherein each topic comprises a plurality of words, each word being generated by one topic;

a calculation unit configured to calculate, for each document in knowledge text data for query, a topic probability distribution for each document and a word probability distribution for each topic, wherein the topic probability distribution includes probabilities of different topics appearing in the document, the word probability distribution includes probabilities of different words generated by the topics;

a determining unit, configured to determine, according to the topic probability distribution, topics having a probability greater than a first probability threshold as topics appearing in the document; determining, from the word probability distribution, words having probabilities greater than a second probability threshold as generated by the topic;

and the first block dividing unit is used for dividing the word generated by each theme into a plurality of second text blocks according to the semantic unit for each theme appearing in each document.

In one embodiment, the first partitioning unit is configured to partition words generated by each topic into different knowledge layers according to a concept scope of knowledge, where the knowledge layers include a knowledge definition principle layer, a multi-domain knowledge cross extension layer, and a knowledge application layer; in each knowledge level, the words of each knowledge level are divided into a plurality of second text blocks according to semantic units.

In an embodiment, the calculation unit is configured to calculate the topic probability distribution by:

the word probability distribution is calculated by the following formula:

In one embodiment, a data partitioning module includes:

the second block dividing unit is used for dividing knowledge text data for query into different knowledge layers according to the concept range of knowledge, wherein the knowledge layers comprise a knowledge definition principle layer, a multi-domain knowledge cross expansion layer and a knowledge application layer; in each knowledge level, dividing the data of each knowledge level into a plurality of second text blocks according to semantic units.

In one embodiment, the matching module is configured to input the query words and the second text blocks in the synonymous word group into a trained Triplet network, and calculate a first distance between each query word and each second text block through the Triplet network to obtain a plurality of first distances; mapping a plurality of first distances into a feature space in order from small to large; determining a second text block corresponding to a first distance smaller than a preset distance threshold value as a target text block in the feature space; inputting the first text blocks and the second text blocks into a trained Triplet network, and calculating a second distance between each first text block and each second text block through the Triplet network to obtain a plurality of second distances; mapping a plurality of the second distances into a feature space in order from small to large; and determining a second text block corresponding to a second distance smaller than a preset distance threshold value as a target text block in the feature space.

In one embodiment, the integrating module is configured to input each extracted target text block into a trained selectively generated model transducer, and connect each target text block in series according to semantics to form a knowledge sequence through the selectively generated model transducer, so as to obtain a knowledge response of the query sentence.

The embodiment of the invention realizes the following technical effects: the method has the advantages that the knowledge text is matched with the synonymous word groups of the questions in the form of text blocks, so that knowledge extraction can be ensured to be more focused and accurately carried out on the questions in the synonymous word groups in the knowledge extraction process, and the accuracy of the knowledge extraction is improved; meanwhile, in the process of extracting text blocks, the matching model can be helped to better understand the context and accurately analyze the meaning of the query statement based on the query words in the synonymous word groups, so that the matching model can better understand and accurately extract the target text blocks and the text blocks which are contextually associated with the target text blocks, ambiguity is reduced, and further knowledge extraction accuracy and answer accuracy are improved; in addition, the application of the large model, which is realized by training the large model, can concentrate the attention points of the large model on the text blocks related to the query words by combining the query words and the text blocks, so that the text quantity processed by the large model can be reduced, and the knowledge extraction efficiency is further improved.

It will be apparent to those skilled in the art that the modules or steps of the embodiments of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than what is shown or described, or they may be separately fabricated into individual integrated circuit modules, or a plurality of modules or steps in them may be fabricated into a single integrated circuit module. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations can be made to the embodiments of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A knowledge extraction method, comprising:

dividing knowledge text data for query into a plurality of second text blocks;

respectively carrying out similarity matching on the query words and the first text blocks in the synonymous word groups and each second text block through a matching model, determining the second text blocks with the similarity meeting a preset threshold as target text blocks, and extracting the target text blocks, wherein the matching model is obtained by training a large language model;

2. The knowledge extraction method of claim 1 wherein dividing knowledge text data for query into a plurality of second text blocks comprises:

Calculating, for each document in the knowledge text data for query, a topic probability distribution for each document and a word probability distribution for each topic, wherein the topic probability distribution includes probabilities of different topics appearing in the document, the word probability distribution including probabilities of different words generated by the topics;

3. The knowledge extraction method of claim 2, wherein dividing the word generated by each topic into a plurality of second text blocks by semantic unit comprises:

dividing words generated by each topic into different knowledge layers according to the concept range of knowledge, wherein the knowledge layers comprise a knowledge definition principle layer, a multi-domain knowledge cross expansion layer and a knowledge application layer;

in each knowledge level, the words of each knowledge level are divided into a plurality of second text blocks according to semantic units.

4. The knowledge extraction method of claim 2 wherein calculating a topic probability distribution and a word probability distribution for each document comprises:

the topic probability distribution is calculated by the following formula:

the word probability distribution is calculated by the following formula:

wherein,probability of the kth topic for the mth document,/->For the total number of words belonging to the kth topic in the mth document,/for the total number of words belonging to the kth topic in the mth document>For the kth value of the topic a priori parameter vector, is->The probability of the t-th word is generated for the kth topic,generating a total number of t words in all documents for the kth topic,/for the kth topic>The t-th value of the prior parameter vector of the word is V, the total number of words is V, and K is the total number of topics.

5. The knowledge extraction method of claim 1 wherein dividing knowledge text data for query into a plurality of second text blocks comprises:

in each knowledge level, dividing the data of each knowledge level into a plurality of second text blocks according to semantic units.

6. The knowledge extraction method as claimed in claim 1, wherein the matching the query term and the first text block in the synonymous phrase with each of the second text blocks by a matching model respectively performs similarity matching, and determining the second text block with similarity meeting a preset threshold as a target text block includes:

7. The knowledge extraction method as claimed in claim 1, wherein integrating the extracted respective target text blocks into a knowledge response of the question sentence comprises:

and inputting the extracted target text blocks into a trained selective generation type model transducer, and connecting the target text blocks into a knowledge sequence according to semantics through the selective generation type model transducer to obtain the knowledge response of the query sentence.

8. A knowledge extraction device, comprising:

the extraction module is used for carrying out similarity matching on the query words in the synonymous word groups and the first text blocks and each text block through a matching model, determining a second text block with similarity meeting a preset threshold value as a target text block, and extracting the target text block, wherein the matching model is obtained by training a large model;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the knowledge extraction method of any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program that executes the knowledge extraction method of any one of claims 1 to 7.