WO2021063086A1

WO2021063086A1 - 3-tuple extraction method, device, apparatus, and storage medium

Info

Publication number: WO2021063086A1
Application number: PCT/CN2020/103209
Authority: WO
Inventors: 戴泽辉
Original assignee: 北京国双科技有限公司
Priority date: 2019-09-30
Filing date: 2020-07-21
Publication date: 2021-04-08
Also published as: CN112668332A

Abstract

The present application discloses a 3-tuple extraction method, a device, an apparatus, and a storage medium. The method comprises: inputting respective word vectors of words in a text to undergo identification into a pre-determined model, and obtaining an identification result corresponding to each of the word vectors, each of the identification results of the word vectors representing the probability of a word that corresponds to the word vector being associated with objects comprised in a 3-tuple relationship; if the probability of any word being associated with any object is greater than a first pre-determined threshold, determining the word as a target word; and forming a 3-tuple from any target word pair in target words according to a relationship corresponding to the target word pair. Since any pair of target words comprise two words associated with objects comprised in the same 3-tuple relationship, a relationship of the pair of target words is a relationship of the objects associated with the pair of target words in the 3-tuple relationship. In this way, the technical solution provided by the present application achieves identification of all 3-tuples in a text without using models in multiple steps.

Description

Method, device, equipment and storage medium for extracting triples

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 30, 2019, the application number is 201910942438.4, and the invention title is "a method, device, equipment, and storage medium for extracting triples", and its entire contents Incorporated in this application by reference.

Technical field

The present invention relates to the field of electronic information technology, and more specifically, to a triple extraction method, device, equipment and storage medium.

Background technique

With the development and popularization of Internet technology, the Internet has become an indispensable part of most people's daily lives. For a large number of unstructured texts on the Internet, a knowledge graph can be established by extracting triples in the text, which has important significance and value for downstream retrieval, recommendation, query and other tasks.

The extraction of triples refers to the extraction of subjects, individuals, and their relationships in unstructured text based on rules. For example, the unstructured text is "Zhang Xiaoming married Li Xiaohong in 2018", for rule A: [person-husband-person], the triplet can be extracted from the unstructured text as: Zhang Xiaoming-husband-Li Xiaohong. At present, the method of extracting triples from unstructured text can apply a multi-step process method, for example, a two-step method can be used to extract triples. The specific steps can be: the first step is to extract the entities contained in the text; the second step is to classify and associate the extracted entities, and finally determine the relationship between the entities. For example, you can first extract entities from the above unstructured text, namely the names "Zhang Xiaoming" and "Li Xiaohong", and then classify and associate these names to determine that "Zhang Xiaoming" is the "husband" of "Li Xiaohong". A triple is: Zhang Xiaoming-husband-Li Xiaohong.

Because it is divided into multi-step extraction triples, and entity extraction and relationship determination are completed by the model, the errors of the model used in each step will accumulate and the information between multiple steps cannot be shared. Therefore, the above-mentioned existing applications are multi-step procedural The accuracy of the extraction results of the triples obtained by the method is low.

Summary of the invention

In view of the above problems, the present invention provides a triple extraction method, device, equipment, and storage medium that overcomes the above problems or at least partially solves the above problems, as follows:

A method for extracting triples, including:

Obtain the word vector of each word composing the text to be recognized;

The word vector is input into the preset model to obtain the recognition result corresponding to each word vector; the recognition result includes the first-type probability of the word vector, the first-type probability representation of any word vector, and the word The probability that the word corresponding to the vector belongs to the object contained in the triple relationship;

When any word satisfies a preset condition, determine that the word is a target word, and the preset condition includes that the probability that the word belongs to any object in the triple relationship is greater than the first preset threshold;

The target word pairs belonging to the objects in the same triple relationship are formed into triples according to the relationship between the objects in the triple relationship.

Optionally, the word vector includes at least one of the following: a word meaning vector and a part-of-speech vector.

Optionally, the objects included in the triple relationship include objects of the first type and objects of the second type, and the recognition result further includes:

The second-type probability of the word vector represents the second-type probability of any word vector, the probability that the word corresponding to the word vector belongs to the first-type object and the probability that it belongs to the second-type object.

Optionally, the preset conditions also include:

The probability that the word belongs to the first type of object or the second type of object is greater than the second preset threshold.

Optionally, the preset model includes an encoder and a decoder;

The encoder is used to obtain a feature vector from the word vector;

The decoder includes a first decoding module and a second decoding module;

The first decoding module is configured to determine the first type probability according to the feature vector, and the second decoding module is configured to determine the second type probability according to the feature vector.

Optionally, the first decoding module includes 2*n sigmoid functions, where n is the number of triples learned by the preset model, and the second decoding module includes 2 sigmoid functions.

Optionally, the training process of the preset model includes:

Input the sample word vector into the preset model to be trained to obtain the output recognition result, the recognition result includes the first type probability and the second type probability of the sample word vector; the sample word vector includes the words in the sample text Word vector

Use the difference between the first type probability of the sample in the label information and the first type probability, and the difference between the second type probability of the sample in the label information and the second type probability to obtain the parameters of the preset model The probability of the first type of the sample and the probability of the second type of the sample are determined according to the triplet marked in the sample text.

A triple extraction device, including:

The word vector acquiring unit is used to acquire the word vector of each word composing the text to be recognized;

The model prediction unit is used to input the word vector into a preset model to obtain the recognition result corresponding to each word vector; the recognition result includes the first type probability of the word vector, and the first probability of any word vector Class probability representation, the probability that the word corresponding to the word vector belongs to the object contained in the triple relationship;

The target word determining unit is configured to determine that any word meets a preset condition as the target word, and the preset condition includes that the probability that the word belongs to any object in the triple relationship is greater than the first preset Threshold

The triple determination unit is used to group the target word pairs belonging to the objects in the same triple relationship into triples according to the relationship between the objects in the triple relationship.

A triple extraction device, including: a memory and a processor;

The memory is used to store programs;

The processor is configured to execute the program to implement each step of the triple extraction method described above.

A storage medium having a program stored thereon, characterized in that, when the program is executed by a processor, each step of the method for extracting triples as described above is realized.

With the above technical solution, in the triple extraction method provided by the present invention, the word vector of each word in the text to be recognized is input into a preset model, and the recognition result corresponding to each word vector is obtained. The recognition result corresponding to each word vector represents the probability that the word corresponding to the word vector belongs to the object contained in the triple relationship. When the probability of any word belonging to any object is greater than the first preset threshold, it is determined This word is the target word, and any pair of target word pairs in the target word are formed into triples according to the corresponding relationship of the pair of target words. Because any pair of target word pairs includes two words belonging to the objects contained in the same triple relationship, the relationship of the target word pair is the relationship of the object to which the pair of target word pairs belong in the triple relationship. It can be seen that the technical solution provided by this application can output the probability that each word belongs to all the objects included in the triple relationship learned by the model based on the preset model, and further determine the rules according to the output of the model to obtain the information in the text to be recognized. All triples, that is, there is no need to use the model in multiple steps. In addition, the output of the model serves as the basis for the rule determination. Therefore, the accumulation of model errors is avoided and the accuracy of the results can be improved.

The above description is only an overview of the technical solution of the present invention. In order to understand the technical means of the present invention more clearly, it can be implemented in accordance with the content of the specification, and in order to make the above and other objectives, features and advantages of the present invention more obvious and understandable. In the following, specific embodiments of the present invention will be cited.

Description of the drawings

FIG. 1 is a schematic flowchart of a method for extracting triples according to an embodiment of the application;

2 is a schematic flowchart of another implementation of the triple extraction method provided by an embodiment of the application;

FIG. 3 is a schematic structural diagram of a preset model provided by an embodiment of this application;

FIG. 4 is a schematic diagram of a training process of a preset model provided by an embodiment of this application;

FIG. 5 is a schematic structural diagram of a triple extraction device provided by an embodiment of this application;

FIG. 6 is a schematic structural diagram of a triple extraction device provided by an embodiment of the application.

Detailed ways

The triple extraction method provided in the embodiments of the present application can be applied to smart devices, such as computers, tablets, or smart phones, or can be applied to a server with a text processing system preset.

Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although the drawings show exemplary embodiments of the present disclosure, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

Fig. 1 is a schematic flow chart of a method for extracting triples according to an embodiment of the present application. The method may specifically include:

S101: Obtain a word vector.

Specifically, the word vector includes the word vector of each word constituting the text to be recognized. Among them, the text to be recognized is an unstructured text that requires triple extraction. Generally, the text to be recognized may include at least one sentence, each sentence is composed of words, and may contain punctuation marks. After word segmentation is performed on the text to be recognized, all the words that make up the text to be recognized can be obtained. The word vector obtained in this step is the word vector generated by mapping each word constituting the text to be recognized to the vector space.

Among them, word segmentation processing refers to splitting a sentence of training text based on preset word segmentation standards, and removing punctuation marks. For example, the text W to be recognized is: "Zhang Xiaoming's new wife Li Xiaohong's favorite novel is Mr. Qian Zhongshu's "Besieged City"". After word segmentation, the words composing the text W to be recognized include: "Zhang Xiaoming" "的" "Newlyweds", "Li Xiaohong", "Favorite", "Fiction", "Yes", "Qian Zhongshu", "Mr.", "Besiege". It should be noted that word segmentation is an existing technology in the field of natural language processing. For example, existing tool software (such as Harbin Institute of Technology LTP or jieba) can be used to segment text sentences. This process will not be repeated in this case.

S102: Input a word vector into a preset model, and obtain a recognition result corresponding to each word vector.

Specifically, each triple relationship is a rule, for example, the above rule A [person-husband-person] can be used as a triple relationship. The preset model can learn multiple triad relations, and each triad relation can include two objects. For example, the above-mentioned "person" triple relationship A [person-husband-person] includes two objects that are both "persons". Assuming that the number of pre-configured triple relationships that require model learning is n, then the number of objects is 2*n.

Among them, each triple relationship can include the first type of object and the second type of object, and this method records the first object in each triple relationship as the first type of object, and the last object in each triple relationship An object is recorded as an object of the second type, and the object includes n objects of the first type and n objects of the second type.

The recognition result corresponding to any word vector output by the model includes the first-type probability of the word vector, and the first-type probability represents the probability that the word generating the word vector belongs to the object. Since the number of objects is 2*n, the probability of the first type includes 2*n probabilities, and each probability represents the probability that the word generating the word vector belongs to the object.

For example, the three tuple relations learned by the model are n, s1, s2, ..., sn, and the word vector is m, w1, w2, ..., wm. Then, the recognition result corresponding to w1 may include: the probability that the word generating w1 belongs to the first type of object in s1, p11, the probability of belonging to the second type of object in s1, p12, and the probability of belonging to the first type of object in s2, p21, belong to s2 The probability p22,..., the probability of belonging to the first type of object in sn1, the probability of belonging to the second type of object in sn, pn2.

It should be noted that the aforementioned probability of the first type can be predicted by the 2*n sigmoid functions in the preset model.

S103: When any word satisfies the preset condition, determine that the word is the target word.

Specifically, for any word, the preset condition includes that the probability that the word belongs to any object is greater than the first preset threshold. Wherein, the first preset threshold value is a threshold value set when the rule is determined according to the output of the model. Generally, the size of the threshold value can be set to 0.5. That is, when the first-type probability corresponding to the word vector generated by any word indicates that the probability that the word belongs to the object is greater than the first preset threshold, it is determined that the word belongs to the object. Therefore, use this word as a target word.

For example, the text W to be recognized is: "Zhang Xiaoming's new wife Li Xiaohong's favorite novel is Mr. Qian Zhongshu's "The Besieged City"", in which the word vector corresponding to the generated word "Zhang Xiaoming" represents "Zhang Xiaoming" in the first type of probability The probability of belonging to the first type of object in the triple relationship [person-husband-person] is greater than 0.5. In the first type of probability corresponding to the word vector of the generated word "Li Xiaohong", the probability that "Li Xiaohong" belongs to the second type of object in the triple relationship [person-husband-person] is greater than 0.5. That is, the words "Zhang Xiaoming" and "Li Xiaohong" both meet the preset conditions, and they are determined to be the target words.

It is understandable that any target word determined in this step can belong to multiple objects. For example, in the first-type probability corresponding to the word vector of the generated word "Zhang Xiaoming", the probability that "Zhang Xiaoming" belongs to the first-type object in the triple relationship [person-husband-person] is greater than 0.5, and the first-type probability The probability that "Zhang Xiaoming" belongs to the second type of object in the triple relationship [person-wife-person] is also greater than 0.5. Therefore, the target word "Zhang Xiaoming" belongs to two objects.

S104: Combine the target word pairs belonging to the objects in the same triplet to form a triplet according to the relationship between the objects in the triplet.

Specifically, a pair of target words includes two target words belonging to two objects in the same triple relationship. The objects in any triple relationship include objects of the first type and objects of the second type, and the relationship between the two objects can be represented by the relationship label in the triple. Then, the corresponding relationship of the target word pair is also the relationship indicated by the relationship label in the triple relationship.

For example, in the triple relationship C [book-author-person], the first object "book" is the first type of object, and the second object "person" is the second type of object. The relationship of the class object is that the "character" is the "author" of the "book". In the to-be-recognized text W, the probability that the character "besieged city" belongs to the object (that is, the first type of object "book" in the triple relationship C) is greater than 0.5, and the probability that the character "Qian Zhongshu" belongs to the object (that is, the triple relationship C The probability of the second type of object "person") is greater than 0.5. Therefore, "Besieged City" and "Qian Zhongshu" are a pair of target words, and the relationship of this target word pair is represented by the relationship label in the triple relation C, that is, "Qian Zhongshu" is the "author" of "Beige City". Furthermore, "Qian Zhongshu" and "Besieged City" form a triad according to the relationship "author" ["Siege"-Author-"Qian Zhongshu"].

It should be noted that in all target words, for any object, there may be multiple target words belonging to the object, so the following situation may occur: there are multiple target words belonging to the first type of object in a triple relationship, At the same time, there are multiple target words belonging to the second type of object in the triple relationship. In this case, the target word pair can be selected according to the position of the target word in the text to be recognized. Generally, the two closest target words belonging to the object in the same triple relationship are regarded as a pair of target word pairs .

With the above technical solution, in the triple extraction method provided by the present invention, the word vector of each word in the text to be recognized is input into a preset model, and the recognition result corresponding to each word vector is obtained. The characterization of the recognition result corresponding to each word vector, the probability that the word corresponding to the word vector belongs to the object contained in the triplet, when the probability of any word belonging to any object is greater than the first preset threshold, the The word is the target word, and any pair of target word pairs in the target word are formed into triples according to the corresponding relationship of the pair of target words. Because any pair of target word pairs includes two target words that belong to the objects included in the same triple relationship, the relationship of the target word pair is the relationship of the object to which the pair of target word pairs belong in the triple relationship. It can be seen that in the technical solution provided by this application, the model can output the probability that a word belongs to all objects included in the triple relationship learned by the model, and further rule judgments based on the output of the model can obtain all triples in the text to be recognized, namely There is no need to use the model in multiple steps. In addition, the output of the model serves as the basis for the rule determination. Therefore, the accumulation of model errors is avoided and the accuracy of the results can be improved.

Further, since the triple relationship learned by the model can be configured into multiple groups as needed, any target word can belong to multiple objects, that is, any target word can belong to objects in multiple triple relationships, thus Without multiple models, multiple triples to which any word generating word vector belongs can be extracted. It can be seen that the triple extraction method disclosed in this application greatly improves the efficiency of triple extraction.

For example, the inventor found in the process of implementing the invention that in order to improve the accuracy of triple extraction, it can be based on LSTM-CRF (Long Short-Term Memory-Conditional Random Field, Long Short-Term Memory Network-Conditional Random Field Algorithm) The model establishes a triple extraction system in advance, takes the text to be recognized as input, and directly extracts triples in the text data through the model to achieve end-to-end triple extraction. Therefore, the accumulation of errors in the multi-step process method is avoided, and the accuracy of the extraction result is improved.

However, since the decoder CRF in the LSTM-CRF model can only perform a single decoding, in the process of extracting triples, each word can only belong to one triple relationship, that is, extracting triples through the LSTM-CRF model Group, any word can only belong to one triple group. When the words contained in the text to be recognized belong to two or more triples, multiple LSTM-CRF models must be established at the same time to extract triples for different triple relationships, which results in low efficiency of triple extraction .

In the text W to be recognized as described above, for the triple relationship A: [person-husband-person], the triples can be extracted from the text W to be recognized as: [Zhang Xiaoming-husband-Li Xiaohong]. For the triple relationship B: [person-wife-person], the triple can be extracted from the text W to be recognized as: [Li Xiaohong-wife-Zhang Xiaoming]. It can be seen that the names of "Zhang Xiaoming" and "Li Xiaohong" belong to different triple relationships at the same time. Therefore, when a triple extraction system is established in advance based on the LSTM-CRF model, two LSTM-CRF models need to be trained to extract triples for triple relationship A and triple relationship B respectively. As a result, the execution efficiency is low.

In response to the above technical problems, this application provides the model in the Zhang Xiaoming-husband-Li Xiaohong triple extraction method that can be configured as multiple triple relationships as needed during the learning process, and any target word may belong to multiple objects at the same time. For example, in the to-be-recognized text W, in the first-type probability corresponding to the word vector of the generated word "Zhang Xiaoming", the probability that "Zhang Xiaoming" belongs to the first-type object in the triple relationship A [person-husband-person] is greater than 0.5 . At the same time, in the first-type probability corresponding to the word vector of the generated word "Zhang Xiaoming", the probability that "Zhang Xiaoming" belongs to the second-type object in the triple relationship B [person-wife-person] is also greater than 0.5. Therefore, "Zhang Xiaoming" can belong to the two triples [Zhang Xiaoming-husband-Li Xiaohong] and [Li Xiaohong-wife-Zhang Xiaoming] determined by the triple relationship A and the triple relationship B at the same time.

Obviously, the technical solution provided in this application has the advantages of high accuracy and high efficiency for the extraction of the above-mentioned triples with alternating and overlapping relationships.

It should be noted that the word vector obtained in S101 may include a word meaning vector and a part-of-speech vector. Among them, the word meaning vector of a word is the vector obtained by mapping the word meaning to the vector space, and this vector can represent the part-of-speech information of the word. The part-of-speech vector of a word is the vector obtained by mapping the part-of-speech to the vector space, and the vector can represent the part-of-speech information of the word.

Among them, the word meaning vector of each word can be obtained by searching the word vector mapping set (ie Word vector). Word vector is a set of word sense vector mapping correspondences generated by word vector training using the corpus of the field to which the text to be recognized belongs. Word vector can map words to a low-dimensional vector space, and can express the similar relationship between the word meanings of various words through the relationship between the word meaning vectors. The low-frequency words in the corpus can be marked as UNK. UNK has a unique vector expression in the word vector, and its dimension is consistent with the dimension of the word meaning vector corresponding to other words.

For example, after the word segmentation of the text E to be recognized, the k words that make up the text are e1, e2..., ek, and the word meaning vectors generated by all words can be obtained through Word vector, which are h1, h2,..., hk. .

It should be noted that when the constituent words of the training text are low-frequency words, the generated word meaning vector can be obtained as UNK according to the Word vector.

Obviously, the meaning of the same word expressed in different texts is different, and its part of speech may also be different, so the part of speech can also affect the recognition result of the triple relationship of the word. Therefore, the part-of-speech information of each word in the text to be recognized can be obtained. The part-of-speech information is the part-of-speech to which the meaning of each word in the text to be recognized belongs. The acquisition method can be expressed by a random vector of a certain dimension. For example, for a total of 30 parts of speech [A1,A2,...,A30], the part of speech vector a1 can be used to represent A1, the part of speech vector a2 can represent A2,..., and the part of speech vector a30 can represent A30. . The dimensions of the part-of-speech vectors a1, a2,..., a30 can be preset, and each of the dimensions is a randomly generated decimal close to 0.

Further, for any word, the word meaning vector and the part-of-speech vector generated therefrom are input to the preset model, and step 102 is executed.

Optionally, the word vector obtained in S101 may also only include the word meaning vector, that is, the word meaning vector of each word in the text to be recognized is directly input as the word vector to the preset model, and step 102 is executed.

FIG. 2 is a schematic flowchart of another implementation manner of the triple extraction method provided by an embodiment of the application. as follows:

S201: Obtain a word vector.

Wherein, the word vector includes a word vector generated by each word constituting the text to be recognized. As described in the above embodiment, the word vector includes a word meaning vector and a part-of-speech vector. For the acquisition method, refer to the above-mentioned embodiment, which will not be repeated here.

S202: Input the word vector into the preset model, and obtain the recognition result corresponding to each word vector. Wherein, the recognition result includes the first-type probability of the word vector, and also includes the second-type probability of the word vector.

Specifically, FIG. 3 shows a schematic structural diagram of a preset model, and the preset model includes an encoder and a decoder.

As shown in Figure 3, the encoder can be a two-way LSTM model. For the text E to be recognized, it is composed of e1, e2,..., ek. For any word ej, input the generated word sense vector hj and part of speech vector aj to the encoder, and then encode the word sense vector and part of speech vector through the encoder. The feature vector Ej. The feature vector is obtained by concatenating the word meaning vector and the part-of-speech vector of the word.

For example, the dimension of the word meaning vector is 100, and the dimension of the part-of-speech vector is 20. The dimension of the feature vector obtained by the encoder is 130. It should be noted that if the number of words in the text to be recognized is m, input m word vectors to the encoder to obtain m 120-dimensional feature vectors, and arrange these vectors to form an m*120 vector matrix. The matrix needs to be extended to a specific length by adding zeros.

It can be seen from the foregoing embodiments that the preset model can learn n triple relationships, and each triple relationship can include two objects. Since the first object in each triple relationship is the first type of object, and the last object in each triple relationship is the second type of object, the number of objects of the first type is n, and the number of objects of the second type is n. The second-type probability of any word vector represents the probability that the word generating the word vector belongs to the first-type object and the probability that it belongs to the second-type object, that is, the second-type probability of any word includes two probability values.

The decoder in the model includes a first decoding module and a second decoding module.

Among them, the first decoding module includes 2*n sigmoid functions, which are used to determine the probability of the first type according to the feature vector. The probability of the first type may refer to the description in step S102. As shown in Figure 3, taking the feature vector Ej as an example, the process of determining the probability of the first type is:

According to the feature vector Ej through the 2i-1 (i=1, 2,...n) sigmoid function (i.e. sig2i-1), the word corresponding to the feature vector belongs to the first category in the i-th triple relationship. The probability of the object. According to the feature vector Ej, the probability that the word corresponding to the feature vector belongs to the second type of object in the i-th triple relationship is output through the 2i-th sigmoid function (ie, sig2i).

The second decoding module includes two sigmoid functions, which are used to determine the second type of probability according to the feature vector. As shown in Figure 3, taking the feature vector Ej as an example, the process of determining the second type of probability is:

According to the feature vector Ej, the first sigmoid function (ie, sig1) is used to output the probability that the word corresponding to the feature vector belongs to the first type of object in any triple relationship. According to the feature vector Ej, the second sigmoid function (ie, sig2) is used to output the probability that the word corresponding to the feature vector belongs to the second type of object in any triple relationship.

S203: When any word satisfies the preset condition, determine that the word is the target word.

For any word, the preset condition includes that the probability of the word belonging to the first type of object or the probability of belonging to the second type of object is greater than the second preset threshold. Wherein, the second preset threshold value is a threshold value set when the rule is determined according to the output of the model, and generally, the size of the threshold value can be set to 0.5. That is, when the first-type probability corresponding to the word vector generated by any word indicates that the probability that the word belongs to the object is greater than the first preset threshold, or when the second-type probability corresponding to the word vector generated by any one word indicates that the word belongs to the first When the probability of the object of the first type or the object of the second type is greater than the second preset threshold, the word is determined to be the target word.

Since the probability of the first type and the probability of the second type can be mutually verified, an example is given to illustrate the specific method of determining the target word in this step, as follows:

For example, the three tuple relations learned by the model are n, s1, s2, ..., sn, and the word vector is m, w1, w2, ..., wm. Then, the recognition result corresponding to w1 may include the first-type probabilities, which are: the probability p11 that the word generating w1 belongs to the first-type object in s1, the probability p12 that belongs to the second-type object in s1, and the first-type object in s2 The probability p21, the probability p22 of belonging to the second type of object in s2,..., the probability of belonging to the first type of object in sn1, the probability of belonging to the second type of object in sn, pn2. The recognition result corresponding to w1 also includes the second-type probabilities, respectively: the probability p1 that the word generating w1 belongs to any first-type object and the probability p2 that the word generating w1 belongs to any second-type object.

Among them, if p1 is greater than 0.5 and p11 is greater than 0.5, it is determined that the word generating w1 belongs to the first type of object in s1, and the word is determined as the target word. If p1 is greater than 0.5 and p11 is less than or equal to 0.5, it is determined that the word generating w1 does not belong to the first type of object in s1. If p1 is less than or equal to 0.5 and p11 is less than or equal to 0.5, it is determined that the word generating w1 does not belong to the first type of object in s1.

It is understandable that when p1 is greater than 0.5, if any p1r is greater than 0.5, it is determined that the word generating w1 belongs to the first type of object in sr, and the word is determined to be the target word. When p2 is greater than 0.5, if any p2r is greater than 0.5, it is determined that the word generating w1 belongs to the second type of object in sr, and the word is determined as the target word.

S204: Combine the target word pairs belonging to the objects in the same triplet to form a triplet according to the relationship between the objects in the triplet.

For this step, refer to the above S204, which is not described in detail in this embodiment.

For the implementation of the triple extraction method introduced in the above embodiment, the training process of the preset model is further introduced. Figure 4 shows a schematic diagram of the training process of a preset model, including:

S401: Input the sample word vector into the preset model to be trained, and obtain the recognition result output by the preset model.

Among them, the sample word vector includes the word vector of the word in the sample text. The sample text may include multiple unstructured text fragments having a triple relationship, and the number of sample text fragments may be several thousand to several hundred thousand. It should be noted that, in order to better train the model, sample texts can be obtained according to the field of the text to be recognized. For example, if the field of the text to be recognized is financial, the text belonging to the financial field can be obtained as sample text.

It should be noted that, for any word in the sample text, the method for obtaining the sample word vector of the word can refer to the above method for obtaining the word vector of the word in the text to be recognized. The recognition result of the sample word vector output by the preset model includes the first type probability and the second type probability of the sample word vector. Specifically, for the meaning of the first type of probability and the second type of probability, reference may be made to the introduction of S202 above. This is not repeated in the embodiment of the application.

S402: Use the difference between the first-type probability and the first-type probability of the sample in the marking information and the difference between the second-type probability and the second-type probability of the sample in the marking information to obtain the parameters of the preset model.

Specifically, for any piece of sample text, all the triples included in the piece of sample text are manually marked. For example, the sample text F is: "Zhang Xiaoming married Li Xiaohong in 2018". For this sample file, the triples [Zhang Xiaoming-husband-Li Xiaohong] and [李小红-wife-Zhang Xiaoming] are manually marked.

The first probability of a word sample represents the probability that the word belongs to the object contained in the triple, and the second probability of the word sample represents the probability that the word belongs to the first type of object and the second type of object respectively.

Among them, the objects include all pre-configured objects contained in n triples that require model learning. The n triple relationships may include all possible triple relationships in the text to be recognized. It is understandable that if the field of the text to be recognized is different, the possible triad relationships will be different. Therefore, n triad relationships can be pre-configured according to the field of the text to be recognized.

The labeling information refers to the sample first-type probability and sample second-type probability of each word in the sample text obtained according to the manually labeled triples. The method for obtaining the probability of the first type of the sample and the probability of the second type of the sample is the prior art, and the embodiment of the present application only briefly introduces this with the following examples.

Taking the above sample text F as an example, the preset triple relationship that the model needs to learn is A: [person-husband-person], B: [person-wife-person] and C: [book-author-person] . According to the labeled triples [Zhang Xiaoming-husband-Li Xiaohong] and [Li Xiaohong-wife-Zhang Xiaoming].

The sample first-type probability of the word can be determined by whether each word in the sample text F belongs to the object contained in the marked triplet. Take "Zhang Xiaoming" as an example. The word belongs to the first type of object in A and the second type of object in B. Therefore, in the first-type probability of the word corresponding to the sample, the probability that it belongs to the first-type object in A is 1, the probability that it belongs to the second-type object in A is 0, and the probability that it belongs to the first-type object in B If it is 0, the probability that it belongs to the second type of object in B is 1, the probability that it is the first type of object in C is 0, and the probability that it is the second type of object in C is 0. For another example, the word "marry" does not belong to any object in the above-mentioned marked triplet, so each probability in the first-type probability of the sample of the word is 0.

In addition, each word in the sample text F can be determined by whether each word in the sample text F belongs to any first-type object in the marked triplet and whether it belongs to any second-type object in the marked triplet. Class two probability. Taking "Zhang Xiaoming" as an example, the word belongs to the first type of object in A, so the probability that the word belongs to any first type of object is 1. And because it belongs to the second type of object in B, the probability that the word belongs to any second type of object is 1.

It should be noted that the above method for determining the second-type probability of a sample is to determine that the word belongs to the first-type object in any three-tuple relationship if the word belongs to the first-type object in any labeled triplet; If the word belongs to the second type of object in any labeled triplet, it is determined that the word belongs to the second type of object in any triplet relationship.

Using the difference between the first probability of the sample and the first type of probability in the label information, and the difference between the second probability of the sample and the second type of probability, iteratively determine the parameters of the preset model, and after multiple parameter updates, Get a trained model.

From the above, the preset model in the embodiment of the present application can output the first-type probability and the second-type probability of each word. The first type of probability represents the probability that the word belongs to an object included in the triple, and the second type of probability represents the probability that the word belongs to any first type of object and the probability that the word belongs to any second type of object. Therefore, the two types of probabilities can verify each other to determine the target word, which further improves the accuracy of the triple extraction method.

The embodiment of the application also provides a triple extraction device. The triple extraction device provided by the embodiment of the application will be described below. The triple extraction device described below and the triple extraction method described above are mutually compatible with each other. Corresponding reference.

Please refer to FIG. 5, which shows a schematic structural diagram of a triple extraction device provided by an embodiment of the present application. As shown in FIG. 5, the device may include:

The word vector obtaining unit 501 is configured to obtain the word vector of each word composing the text to be recognized;

The model prediction unit 502 is configured to input the word vector into a preset model to obtain the recognition result corresponding to each word vector; the recognition result includes the first type probability of the word vector, and the first probability of any word vector A type of probability representation, the probability that the word corresponding to the word vector belongs to the object contained in the triple;

The target word determining unit 503 is configured to determine that any word meets a preset condition as a target word, and the preset condition includes that the probability of the word belonging to any object in the triple is greater than the first preset Threshold

The triple determination unit 504 is configured to group target word pairs belonging to objects in the same triple into a triple according to the relationship between the objects in the triple.

Optionally, the preset conditions also include:

Optionally, the preset model includes an encoder and a decoder;

The encoder is used to obtain a feature vector from the word vector;

The decoder includes a first decoding module and a second decoding module;

Optionally, the first decoding module includes 2*n sigmoid functions, where n is the number of triple relationships learned by the preset model, and the second decoding module includes 2 sigmoid functions.

Optionally, the device further includes: a preset model training module, which is used to train a preset model, which can be specifically used for:

The triple extraction device includes a processor and a memory. The word vector acquisition unit 501, the model prediction unit 502, the target word determination unit 503, and the triple determination unit 504 are all stored in the memory as program units and executed by the processor. The above-mentioned program units stored in the memory implement the corresponding functions.

The processor contains the kernel, and the kernel calls the corresponding program unit from the memory. One or more kernels can be set, and the accuracy of triple extraction can be improved by adjusting kernel parameters.

The embodiment of the present invention provides a storage medium on which a program is stored, and when the program is executed by a processor, the triplet extraction method is implemented.

The embodiment of the present invention provides a processor, the processor is used to run a program, wherein the triple extraction method is executed when the program is running.

The embodiment of the present application also provides a triplet extraction device. Please refer to FIG. 6, which shows a schematic structural diagram of the triplet extraction device (60). The device includes at least one processor 601 and a device connected to the processor. At least one memory 602 and a bus 603; wherein the processor 601 and the memory 602 communicate with each other through the bus 603; the processor is used to call program instructions in the memory to execute the above-mentioned triple extraction method. The equipment in this article can be a server, PC, PAD, mobile phone, etc.

This application also provides a computer program product, which when executed on a data processing device, is suitable for executing a program that initializes the following method steps:

Obtain the word vector of each word composing the text to be recognized;

Optionally, the preset conditions also include:

Optionally, the preset model includes an encoder and a decoder;

The encoder is used to obtain a feature vector from the word vector;

The decoder includes a first decoding module and a second decoding module;

Optionally, the training process of the preset model includes:

Use the difference between the first type probability of the sample in the label information and the first type probability, and the difference between the second type probability of the sample in the label information and the second type probability to obtain the parameters of the preset model The probability of the first type of the sample and the probability of the second type of the sample are determined according to the triples marked in the sample text.

This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

In a typical configuration, the device includes one or more processors (CPUs), memory, and buses. The device may also include input/output interfaces, network interfaces, and so on.

The memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), and the memory includes at least one Memory chip. The memory is an example of a computer-readable medium.

Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carriers.

It should also be noted that the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or equipment including a series of elements not only includes those elements, but also includes Other elements that are not explicitly listed, or they also include elements inherent to such processes, methods, commodities, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity, or equipment that includes the element.

Those skilled in the art should understand that the embodiments of the present application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

The above are only examples of the application, and are not used to limit the application. For those skilled in the art, this application can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims

A method for extracting triples, which is characterized in that it includes:

Obtain the word vector of each word composing the text to be recognized;

The word vector is input into the preset model to obtain the recognition result corresponding to each word vector; the recognition result includes the first-type probability of the word vector, the first-type probability representation of any word vector, and the word The probability that the word corresponding to the vector belongs to the object contained in the triple relationship;

When any word satisfies a preset condition, determine that the word is a target word, and the preset condition includes that the probability that the word belongs to any object in the triple relationship is greater than the first preset threshold;

The target word pairs belonging to the objects in the same triple relationship are formed into triples according to the relationship between the objects in the triple relationship.
The triple extraction method according to claim 1, wherein the word vector includes at least one of the following: a word meaning vector and a part-of-speech vector.
The method for extracting triples according to claim 1, wherein the objects included in the triple relationship include objects of the first type and objects of the second type, and the recognition result further includes:

The second-type probability of the word vector represents the second-type probability of any word vector, the probability that the word corresponding to the word vector belongs to the first-type object and the probability that it belongs to the second-type object.
The method for extracting triples according to claim 3, wherein the preset condition further comprises:

The probability that the word belongs to the first type of object or the second type of object is greater than the second preset threshold.
The triple extraction method according to claim 4, wherein the preset model includes an encoder and a decoder;

The encoder is used to obtain a feature vector from the word vector;

The decoder includes a first decoding module and a second decoding module;

The first decoding module is configured to determine the first type probability according to the feature vector, and the second decoding module is configured to determine the second type probability according to the feature vector.
The method for extracting triples according to claim 5, wherein the first decoding module includes 2*n sigmoid functions, where n is the number of triple relationships learned by the preset model, so The second decoding module includes two sigmoid functions.
The method for extracting triples according to claim 5 or 6, wherein the training process of the preset model comprises:

Input the sample word vector into the preset model to be trained to obtain the output recognition result, the recognition result includes the first type probability and the second type probability of the sample word vector; the sample word vector includes the words in the sample text Word vector

Use the difference between the first type probability of the sample in the label information and the first type probability, and the difference between the second type probability of the sample in the label information and the second type probability to obtain the parameters of the preset model The probability of the first type of the sample and the probability of the second type of the sample are determined according to the triplet marked in the sample text.
A device for extracting triples, which is characterized in that it comprises:

The word vector acquiring unit is used to acquire the word vector of each word composing the text to be recognized;

The model prediction unit is used to input the word vector into a preset model to obtain the recognition result corresponding to each word vector; the recognition result includes the first type probability of the word vector, and the first probability of any word vector Class probability representation, the probability that the word corresponding to the word vector belongs to the object contained in the triple relationship;

The target word determining unit is configured to determine that any word meets a preset condition as the target word, and the preset condition includes that the probability that the word belongs to any object in the triple relationship is greater than the first preset Threshold

The triple determination unit is used to group the target word pairs belonging to the objects in the same triple relationship into triples according to the relationship between the objects in the triple relationship.
A device for extracting triples, which is characterized by comprising: a memory and a processor;

The memory is used to store programs;

The processor is configured to execute the program to implement each step of the triple extraction method according to any one of claims 1-7.
A storage medium having a program stored thereon, wherein when the program is executed by a processor, each step of the triple extraction method according to any one of claims 1-7 is realized.