CN112668332A

CN112668332A - Triple extraction method, device, equipment and storage medium

Info

Publication number: CN112668332A
Application number: CN201910942438.4A
Authority: CN
Inventors: 戴泽辉
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2021-04-16
Also published as: WO2021063086A1

Abstract

The application discloses a triple extraction method, a triple extraction device, a triple extraction equipment and a storage medium, wherein word vectors of all words in a text to be recognized are input into a preset model, and a recognition result corresponding to each word vector is obtained. And representing the recognition result corresponding to each word vector, determining that the word corresponding to the word vector belongs to the probability of the object contained in the triple relationship under the condition that the probability of any word belonging to any object is greater than a first preset threshold, and forming a triple by using any pair of target words in the target words according to the relationship corresponding to the pair of target words. Because any pair of target word pairs includes two words belonging to objects contained in the same triple relationship, the relationship of the target word pair is the relationship of the object to which the pair of target word pairs belongs in the triple relationship. According to the technical scheme provided by the application, all triples in the text to be recognized can be obtained without using the model in multiple steps.

Description

Triple extraction method, device, equipment and storage medium

Technical Field

The present invention relates to the field of electronic information technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting triples.

Background

With the development and popularization of internet technology, networks have become an indispensable part of daily life of most people. For a large amount of unstructured texts existing on the Internet, a knowledge graph can be established by extracting triples in the texts, and the method has important significance and value for tasks such as downstream retrieval, recommendation, query and the like.

And the triple extraction refers to extracting the main body, the individual and the relationship among the main body and the individual contained in the unstructured text aiming at the rule. For example, the unstructured text is "a duel was welcomed in 2018 by a Zhang Xiao Ming", for rule A: [ person-husband-person ], a triplet may be extracted from the unstructured text as: zhang Xiaoming-husband-Li Xiao hong. Currently, the method for extracting triples from unstructured text can apply a multi-step flow method, for example, a two-step method can be used to extract triples. The specific steps can be as follows: firstly, extracting entities contained in a text; and secondly, classifying and associating the extracted entities, and finally determining the relationship between the entities. For example, the entities, i.e. the names "zhangming" and "lisuhong" can be extracted from the unstructured text, and then the names are classified and associated to determine that "zhangming" is the "husband" of "lisuhong", thereby obtaining a triple: zhang Xiaoming-husband-Li Xiao hong.

Because the triples are extracted in multiple steps, and the entity extraction and the relationship determination are both completed by the model, the error of the model used in each step is accumulated, and the information between the multiple steps cannot be shared, the accuracy of the extraction result of the triples obtained by the conventional multi-step flow method is low.

Disclosure of Invention

In view of the above problems, the present invention provides a method, an apparatus, a device and a storage medium for extracting triples, which overcome or at least partially solve the above problems, as follows:

a method of triplet extraction comprising:

acquiring word vectors of all words forming a text to be recognized;

inputting the word vectors into a preset model to obtain a recognition result corresponding to each word vector; the recognition result comprises a first class probability of the word vector, a first class probability representation of any word vector, and a probability that a word corresponding to the word vector belongs to an object contained in a triple relation;

determining that any word is a target word under the condition that the word meets a preset condition, wherein the preset condition comprises that the probability that the word belongs to any object in the triple relation is greater than a first preset threshold value;

and forming a triple by using the target word pairs belonging to the objects in the same triple relation according to the relation between the objects in the triple relation.

Optionally, the word vector comprises at least one of: word sense vectors and part-of-speech vectors.

Optionally, the objects included in the triple relationship include a first class object and a second class object, and the recognition result further includes:

the second class probability of the word vector is characterized by the second class probability of any word vector, and the probability that the word corresponding to the word vector belongs to the first class of objects and the probability that the word belongs to the second class of objects.

Optionally, the preset conditions further include:

the probability that the word belongs to the first class of objects or the probability that the word belongs to the second class of objects is greater than a second preset threshold.

Optionally, the preset model comprises an encoder and a decoder;

the encoder is used for obtaining a feature vector from the word vector;

the decoder comprises a first decoding module and a second decoding module;

the first decoding module is configured to determine the first class probability according to the feature vector, and the second decoding module is configured to determine the second class probability according to the feature vector.

Optionally, the first decoding module includes 2 × n sigmoid functions, where n is the number of triples learned by the preset model, and the second decoding module includes 2 sigmoid functions.

Optionally, the training process of the preset model includes:

inputting a sample word vector into a preset model to be trained to obtain an output recognition result, wherein the recognition result comprises a first class probability and a second class probability of the sample word vector; the sample word vector comprises a word vector of words in a sample text;

and obtaining parameters of the preset model by using the difference between the first class probability and the first class probability of the sample in the marking information and the difference between the second class probability and the second class probability of the sample in the marking information, wherein the first class probability and the second class probability of the sample are determined according to the marked triples in the sample text.

A triplet extraction device comprising:

the word vector acquiring unit is used for acquiring word vectors of all words forming the text to be recognized;

the model prediction unit is used for inputting the word vectors into a preset model to obtain a recognition result corresponding to each word vector; the recognition result comprises a first class probability of the word vector, a first class probability representation of any word vector, and a probability that a word corresponding to the word vector belongs to an object contained in a triple relation;

the target word determining unit is used for determining that any word is the target word under the condition that the word meets a preset condition, wherein the preset condition comprises that the probability that the word belongs to any object in the triple relation is larger than a first preset threshold value;

and the triple determining unit is used for forming a triple according to the relation between the objects in the triple relation by using the target word pairs belonging to the objects in the same triple relation.

A triplet extraction device comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the triplet extraction method as described above.

A storage medium having a program stored thereon, wherein the program, when executed by a processor, implements the steps of the triplet extraction method as described above.

By means of the technical scheme, in the triple extraction method provided by the invention, the word vectors of all words in the text to be recognized are input into the preset model, and the recognition result corresponding to each word vector is obtained. And representing the recognition result corresponding to each word vector, determining that the word corresponding to the word vector belongs to the probability of the object contained in the triple relationship under the condition that the probability of any word belonging to any object is greater than a first preset threshold, and forming a triple by using any pair of target words in the target words according to the relationship corresponding to the pair of target words. Because any pair of target word pairs includes two words belonging to objects contained in the same triple relationship, the relationship of the target word pair is the relationship of the object to which the pair of target word pairs belongs in the triple relationship. According to the technical scheme provided by the application, the probability that each word belongs to all objects contained in the triple relation learned by the model can be output based on the preset model, and further, the rule judgment is carried out according to the output of the model, so that all the triples in the text to be recognized can be obtained, namely, the model does not need to be used in multiple steps. Moreover, the output of the model is used as the basis for the rule judgment, so that the accumulation of model errors is avoided, and the accuracy of the result can be improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a schematic flowchart of a triple extraction method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of another implementation manner of a triple extraction method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a default model according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a training process of a preset model according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a triple extracting apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a triple extracting apparatus according to an embodiment of the present application.

Detailed Description

The triple extraction method provided by the embodiment of the application can be applied to intelligent equipment such as a computer, a tablet or a smart phone, or can be applied to a server with a preset text processing system.

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Fig. 1 is a schematic flow chart of a triple extraction method provided in an embodiment of the present application, where the method may specifically include:

s101: and acquiring a word vector.

Specifically, the word vector includes word vectors of respective words constituting the text to be recognized. The text to be recognized is an unstructured text which needs triple extraction. In general, the text to be recognized may comprise at least one sentence, each sentence consisting of words and possibly containing punctuation marks. After the word segmentation processing is carried out on the text to be recognized, all words forming the text to be recognized can be obtained. The word vector obtained in this step is a word vector generated by mapping each word constituting the text to be recognized to a vector space.

The word segmentation processing means splitting a sentence of training text based on a preset word segmentation standard and removing punctuation marks. For example, the text W to be recognized is: the best fiction of the red love of the marriage and marriage with the Zhang Ming Dynasty is the Bingcheng of Mr. Qianshou, and after the word segmentation processing, the words forming the text W to be recognized comprise: "Zhangxiaming", "of", "newly married wife", "plum little red", "most", "like", "of", "novel", "money book", "mr", "of" surrounding city ". It should be noted that word segmentation processing is the prior art in the field of natural language processing, for example, existing tool software (e.g., hakura LTP or jieba) may be used to perform word segmentation processing on a text sentence, and this process is not described in this disclosure.

S102: and inputting the word vectors into a preset model to obtain a recognition result corresponding to each word vector.

Specifically, each triple relationship is a rule, for example, the rule a [ person-husband-person ] can be a triple relationship. The preset model may learn a plurality of triplet relationships, each of which may include two objects. For example, the "person" triple relationship a [ person-husband-person ] includes two objects that are both "persons". And setting the number of the pre-configured triple relations needing model learning as n, and then setting the number of the objects as 2 x n.

Each triple relationship may include a first type object and a second type object, and the method records a previous object in each triple relationship as the first type object and records a subsequent object in each triple relationship as the second type object, so that the objects include n first type objects and n second type objects.

The recognition result corresponding to any word vector output by the model comprises a first class probability of the word vector, and the first class probability represents the probability that the word generating the word vector belongs to the object. Since the number of objects is 2 x n, the first class of probabilities includes 2 x n probabilities, each of which characterizes a probability that the word that generated the word vector belongs to an object.

For example, the model learns n triplet relationships, s1, s2, and sn, respectively, and m word vectors, w1, w2, and. Then, the recognition result corresponding to w1 may include: the probability p11 that the word w1 generated belongs to the first class of objects in s1, the probability p12 that the word belongs to the second class of objects in s1, the probability p21 that the word belongs to the first class of objects in s2, the probability p22 that the word belongs to the second class of objects in s2, the probability pn1 that the word belongs to the first class of objects in sn, and the probability pn2 that the word belongs to the second class of objects in sn, respectively.

It should be noted that the first class probability may be predicted by 2 × n sigmoid functions in a preset model.

S103: and under the condition that any word meets the preset condition, determining the word as a target word.

Specifically, for any word, the preset condition includes that the probability that the word belongs to any object is greater than a first preset threshold. The first preset threshold is a threshold set when a rule is determined according to the output of the model, and generally, the size of the threshold may be set to 0.5. That is, when the probability that the word belongs to the object, which is represented by the first class probability corresponding to the word vector generated by any word, is greater than a first preset threshold, it is determined that the word belongs to the object. Thus treating the word as a target word.

For example, the text W to be recognized is: the likeest novel of Zhang Ming's marriage and wife Li Xiao hong is the Bing Cheng of Mr. Qiang Booth, wherein in the first class probability corresponding to the word vector for generating the word "Zhang Ming", the probability representing that "Zhang Ming" belongs to the first class object in the triple relation (person-husband-person) is more than 0.5. In the first class of probabilities corresponding to the word vector for generating the word "Lixian", the probability that the representation "Lixian" belongs to the second class of objects in the triple relationship [ person-husband-person ] is greater than 0.5. That is, when both the words "zhangming" and "lissajou" satisfy the preset condition, it is determined to be the target word.

It is understood that any one of the target words determined in this step may belong to a plurality of objects. For example, in the first type of probability corresponding to the word vector for generating the word "zhangming", the probability representing that "zhangming" belongs to the first type of object in the triple relationship [ person-husband-person ] is greater than 0.5, and the probability representing that "zhangming" belongs to the second type of object in the triple relationship [ person-wife-person ] is also greater than 0.5. Therefore, the target word "zhangming" belongs to two objects.

S104: and forming the triples by the target word pairs belonging to the objects in the same triple according to the relation between the objects in the triples.

Specifically, a pair of target word pairs includes two target words belonging to two objects in the same triple relationship. The objects in any triple relationship include a first class of objects and a second class of objects, and the relationship between the two objects can be represented by a relationship tag in the triple. Then, the relationship corresponding to the pair of target word pairs is also the relationship represented by the relationship label in the triple relationship.

For example, in the triple relationship C [ book-author-person ], the first object "book" is a first type of object, the second object "person" is a second type of object, and the relationship between the first type of object and the second type of object is that "person" is an "author" of "book". In the text W to be recognized, the probability of characterizing that "the city of the town" belongs to an object (i.e., the first type of object "book" in the triple relationship C) is greater than 0.5, and the probability of characterizing that "the book of the money" belongs to an object (i.e., the second type of object "person" in the triple relationship C) is greater than 0.5. So "the city wall" and "the jingle" are a pair of target word pairs whose relationship is represented by the relationship label in the triple relationship C, i.e., "the jingle" is the "author" of "the city wall". Furthermore, a triple [ "the great wall" -the author- "the coin clock book" ] is composed of "the coin clock book" and "the great wall" according to the relation "the author".

It should be noted that, in all the target words, for any object, there may be a plurality of target words belonging to the object, so the following situations may occur: a plurality of target words belong to a first class of objects in a triple relationship, and a plurality of target words belong to a second class of objects in the triple relationship. In this case, the target word pair may be selected according to the position of the target word in the text to be recognized, and generally, two target words having the closest positions and belonging to the object in the same triple relationship are used as a pair of target word pairs.

By means of the technical scheme, in the triple extraction method provided by the invention, the word vectors of all words in the text to be recognized are input into the preset model, and the recognition result corresponding to each word vector is obtained. And representing the recognition result corresponding to each word vector, determining that the word corresponding to the word vector belongs to the probability of the object contained in the triple, and determining that the word is the target word under the condition that the probability of any word belonging to any object is greater than a first preset threshold, and forming the triple by using any pair of target words in the target words according to the corresponding relation of the pair of target words. Because any pair of target word pairs includes two target words belonging to objects contained in the same triple relationship, the relationship of the target word pair is the relationship of the object to which the pair of target word pairs belongs in the triple relationship. According to the technical scheme provided by the application, the model can output the probability that the word belongs to all objects included in the triple relation learned by the model, and further rule judgment is carried out according to the output of the model, so that all the triples in the text to be recognized can be obtained, namely, the model is not required to be used in multiple steps. Moreover, the output of the model is used as the basis for the rule judgment, so that the accumulation of model errors is avoided, and the accuracy of the result can be improved.

Further, since the triplet relationships learned by the models may be configured into multiple groups as needed, any target word may belong to multiple objects, that is, any target word may belong to objects in multiple triplet relationships, and thus multiple triplets to which any word generating a word vector belongs may be extracted without multiple models. Therefore, the triple extraction method disclosed by the application greatly improves the triple extraction efficiency.

For example, in the process of implementing the present invention, the inventor finds that, in order to improve the accuracy of extracting triples, a triples extraction system may be pre-established based on an LSTM-CRF (Long Short-Term Memory-Conditional Random Field) model, and a triples in text data is directly extracted through the model with a text to be recognized as an input, so as to implement end-to-end triples extraction. Therefore, error accumulation in a multi-step process method is avoided, and the accuracy of the extraction result is improved.

However, since the decoder CRF in the LSTM-CRF model can only perform single decoding, each word can only belong to one triplet relationship in the process of extracting triples, that is, any word can only belong to one triplet through the LSTM-CRF model. When a word contained in a text to be recognized belongs to two or more triples, a plurality of LSTM-CRF models must be established at the same time to perform triple extraction respectively aiming at different triple relations, so that the extraction efficiency of the triples is low.

In the text W to be recognized as described above, for the triple relationship a: [ person-husband-person ], a triplet may be extracted from the text W to be recognized as: [ Zhang Xiaoming-husband-Li Xiao hong ]. For triple relationship B: [ person-wife-person ], a triplet can be extracted from the text W to be recognized as: [ Li Xiaohong-wife-Zhangxianming ]. It can be seen that the names of the two people of Zhangming and Lixiahong belong to different three-group relations at the same time. Therefore, when a triple extraction system is pre-established based on the LSTM-CRF model, two LSTM-CRF models need to be trained, and the triple extraction is performed respectively for the triple relation a and the triple relation B. Thereby resulting in less efficient execution.

In view of the above technical problems, the model in the method for extracting triples of zhangming-husband-liplet provided by the present application may be configured as multiple triples relationships as needed during the learning process, and any target word may belong to multiple objects at the same time. For example, in the text W to be recognized, of the first class probabilities corresponding to the word vector for generating the word "zhangming", the probability representing that "zhangming" belongs to the first class object in the triple relationship a [ person-husband-person ] is greater than 0.5. Meanwhile, in the first class of probabilities corresponding to the word vector for generating the word "Zhang Xiao Ming", the probability that the representation "Zhang Xiao Ming" belongs to the second class of objects in the triple relation B [ person-wife-person ] is also larger than 0.5. Thus "Zhang Ming" can belong to both triplets [ Zhang Ming-husband-Zhang Ming ] and [ Li hang Ming-wife-Zhang Ming ] determined by triplet relationship A and triplet relationship B.

Obviously, when extracting the triples with the alternating and overlapping relationship, the technical scheme provided by the application has the advantages of high accuracy and high efficiency.

It should be noted that the word vector obtained in S101 may include a word sense vector and a part-of-speech vector. The word sense vector of a word is a vector obtained by mapping the word sense of the word to a vector space, and the vector can represent part-of-speech information of the word. The part-of-speech vector of a word is a vector obtained by mapping the part of speech of the word to a vector space, and the vector can represent part-of-speech information of the word.

Wherein, the Word sense vector of each Word can be obtained by looking up the Word vector mapping set (i.e. Word vector). Word vector is a Word sense vector mapping corresponding relation set generated by performing Word vector training by using a corpus of the field to which the text to be recognized belongs. Word vector can map words into a low-dimensional vector space, and can express the similarity relation between Word senses of each Word through the relation between Word sense vectors. For low-frequency words in the corpus, it can be denoted as UNK, where UNK has a unique vector expression in word vector, and its dimension is consistent with the sense vector dimension corresponding to other words.

For example, after the Word segmentation processing of the text E to be recognized, k words forming the text are E1, e2., ek, and Word sense vectors generated by all the words can be obtained through a Word vector, which are h1, h 2.

It should be noted that, when the constituent words of the training text are low-frequency words, the generated Word sense vector can be obtained as UNK according to Word vector.

Obviously, the word senses of the same word expressed in different texts are different, and the part of speech of the word may also be different, so that the part of speech may also influence the result of identifying the triple relationship of the word. The part-of-speech information of each word in the text to be recognized can be acquired. The part-of-speech information is the part-of-speech to which the meaning of each word expressed in the text to be recognized belongs. The obtaining method may use a random vector expression with a certain dimension, for example, for 30 parts of speech [ a1, a2, …, a30], a1 may be represented by a part of speech vector a1, a2 may be represented by a part of speech vector a2, and a30 may be represented by a part of speech vector a 30. The dimensions of the part-of-speech vectors a1, a2, a30 can be preset, and each dimension is a randomly generated decimal close to 0.

Further, for any word, the word sense vector and the part-of-speech vector generated by the word are input to a preset model, and step 102 is executed.

Optionally, the word vector obtained in S101 may also include only a word sense vector, that is, the word sense vector of each word in the text to be recognized is directly input to the preset model as a word vector, and step 102 is executed.

Fig. 2 is a schematic flowchart of another implementation manner of the triple extraction method according to the embodiment of the present application. The following were used:

s201: and acquiring a word vector.

The word vector includes a word vector generated by each word constituting the text to be recognized, and the word vector includes a word sense vector and a part-of-speech vector as described in the above embodiment. The obtaining method may refer to the foregoing embodiments, and details are not described herein.

S202: and inputting the word vectors into a preset model to obtain a recognition result corresponding to each word vector. The recognition result comprises the first class probability of the word vector and also comprises the second class probability of the word vector.

Specifically, fig. 3 shows a schematic structural diagram of a preset model, which includes an encoder and a decoder.

As shown in fig. 3, the encoder may be a bi-directional LSTM model. The text E to be recognized is composed of E1, E2, the words and ek, wherein any word Ej is input into an encoder through a word sense vector hj and a part of speech vector aj generated by the word sense vector hj, and the word sense vector and the part of speech vector are encoded through the encoder to obtain a feature vector Ej. The feature vector is obtained by splicing a word sense vector and a part of speech vector of the word.

For example, the sense vector has a dimension of 100 and the part-of-speech vector has a dimension of 20. The dimension of the feature vector obtained by encoding by the encoder is 130. It should be noted that, if the number of words in the text to be recognized is m, m word vectors are input to the encoder to obtain m 120-dimensional feature vectors, the vectors are arranged to form an m × 120 vector matrix, and the matrix can be expanded to a specific length by adding 0 as needed.

As can be seen from the above embodiments, the preset model may learn n triple relationships, and each triple relationship may include two objects. Since the former object in each triple relationship is the first class object and the latter object in each triple relationship is the second class object, the number of the first class objects is n and the number of the second class objects is n. The second class probability of any word vector represents the probability that the word generating the word vector belongs to the first class of objects and the probability that the word belongs to the second class of objects, i.e., the second class probability of any word includes two probability values.

The decoder in the model includes a first decoding module and a second decoding module.

The first decoding module comprises 2 x n sigmoid functions and is used for determining the first class probability according to the feature vectors. The first class probability may be as described with reference to step S102. As shown in fig. 3, taking the feature vector Ej as an example, the process of determining the first class probability is:

and outputting the probability that the word corresponding to the feature vector belongs to the first class object in the ith triplet relation through a 2i-1(i is 1, 2,.. n) th sigmoid function (namely sig2i-1) according to the feature vector Ej. And outputting the probability that the word corresponding to the feature vector belongs to the second class object in the ith triple relation through a 2i sigmoid function (namely sig2i) according to the feature vector Ej.

The second decoding module comprises 2 sigmoid functions for determining a second class of probabilities according to the feature vectors. As shown in fig. 3, taking the feature vector Ej as an example, the process of determining the second class of probability is:

and outputting the probability that the word corresponding to the feature vector belongs to the first class object in any triple relation through a1 st sigmoid function (namely sig1) according to the feature vector Ej. And outputting the probability that the word corresponding to the feature vector belongs to the second class object in any triple relation through a2 nd sigmoid function (namely sig2) according to the feature vector Ej.

S203: and under the condition that any word meets the preset condition, determining the word as a target word.

For any word, the preset condition includes that the probability that the word belongs to the first class of objects or the probability that the word belongs to the second class of objects is larger than a second preset threshold. The second preset threshold is a threshold set when the rule determination is performed according to the output of the model, and generally, the size of the threshold may be set to 0.5. That is, when the probability that the word belongs to the object, which is represented by the first class probability corresponding to the word vector generated by any word, is greater than the first preset threshold, or when the probability that the word belongs to the first class object or the second class object, which is represented by the second class probability corresponding to the word vector generated by any word, is greater than the second preset threshold, the word is determined to be the target word.

Since the first class probability and the second class probability can be verified with each other, a specific method for determining the target word in this step is described as follows:

for example, the model learns n triplet relationships, s1, s2, and sn, respectively, and m word vectors, w1, w2, and. Then, the recognition result corresponding to w1 may include first class probabilities, which are: the probability p11 that the word w1 belongs to the first class of objects in s1, the probability p12 that the word belongs to the second class of objects in s1, the probability p21 that the word belongs to the first class of objects in s2, the probability p22 that the word belongs to the second class of objects in s2, the probability pn1 that the word belongs to the first class of objects in sn, and the probability pn2 that the word belongs to the second class of objects in sn are generated. The recognition result corresponding to w1 further includes a second class of probabilities, which are: the probability p1 that the word generating w1 belongs to any of the first class objects and the probability p2 that the word generating w1 belongs to any of the second class objects.

If p1 is greater than 0.5 and p11 is greater than 0.5, the word generating w1 is determined to belong to the first class object in s1, and the word is determined to be the target word. If p1 is greater than 0.5 and p11 is less than or equal to 0.5, it is determined that the word generating w1 does not belong to the first class object in s 1. And if the p1 is less than or equal to 0.5 and the p11 is less than or equal to 0.5, determining that the word generating w1 does not belong to the first-class object in s 1.

It is understood that when p1 is greater than 0.5, if any p1r is greater than 0.5, the word generating w1 is determined to belong to the first class object in sr, and the word is determined to be the target word. When p2 is larger than 0.5, if any p2r is larger than 0.5, the word generating w1 is determined to belong to the second class object in sr, and the word is determined to be the target word.

S204: and forming the triples by the target word pairs belonging to the objects in the same triple according to the relation between the objects in the triples.

In this step, reference may be made to the above step S204, which is not described herein again.

With respect to the implementation manner of the triple extraction method described in the foregoing embodiment, a training process of a preset model is further described, and fig. 4 shows a schematic diagram of a training process of a preset model, which includes:

s401: and inputting the sample word vector into a preset model to be trained to obtain a recognition result output by the preset model.

Wherein the sample word vector comprises a word vector of words in the sample text. The sample text may include a plurality of unstructured text fragments having a triplet relationship, and the number of sample text fragments may be thousands to hundreds of thousands. It should be noted that, for better training of the model, the sample text may be obtained according to the field to which the text to be recognized belongs, for example, if the field to which the text to be recognized belongs is a financial class, the text belonging to the financial class field may be obtained as the sample text.

It should be noted that, for a word in any sample text, the method for obtaining a sample word vector of the word may refer to the above method for obtaining a word vector of a word in a text to be recognized. The preset model outputs the recognition result of the sample word vector, which comprises the first class probability and the second class probability of the sample word vector. Specifically, the meaning of the first class probability and the second class probability can refer to the introduction of S202 described above. This is not described in detail in the embodiments of the present application.

S402: and obtaining the parameters of the preset model by using the difference between the first class probability and the first class probability of the sample in the marking information and the difference between the second class probability and the second class probability of the sample in the marking information.

Specifically, for any piece of sample text, all triples included in the piece of sample text are manually marked. For example, sample text F is: "Zhang Xiaoming welcomed a Li Xiao hong in 2018", for which triplets [ Zhang Xiaoming-husband-Li Xiao hong ] and [ Li Xiao hong-wife-Zhang Xiaoming ] were manually marked.

A sample first probability of a word characterizes the probability that the word belongs to the objects comprised by the triplet, and a sample second probability of the word characterizes the probability that the word belongs to the first class of objects and the second class of objects, respectively.

The objects comprise all objects contained in n pre-configured triple relations needing model learning. The n triple relationships may include all possible triple relationships contained in the text to be recognized. It can be understood that, if the text to be recognized belongs to different fields, the triple relationships possibly included therein may be different, so that n triple relationships may be preconfigured according to the field to which the text to be recognized belongs.

The marking information refers to the sample first-class probability and the sample second-class probability of each word in the sample text obtained according to the manually marked triples. The method for obtaining the first class probability and the second class probability of the sample is the prior art, and the embodiment of the present application will be briefly described with the following examples.

Taking the sample text F as an example, the preset triplet relationship that the model needs to learn is a: [ person-husband-person ], B: [ person-wife-person ] and C: [ book-author-person ]. According to the marked triplets [ Zhang Xiaoming-husband-Li Xiao hong ] and [ Li Xiaohong-wife-Zhang Xiaoming ].

The sample first class probability for each word in sample text F may be determined by whether the word belongs to an object contained in a labeled triplet. Taking the example of "zhangming", the word belongs to the first class of objects in a and to the second class of objects in B. Therefore, in the sample first class probability corresponding to the word, the probability of characterizing the first class object belonging to a is 1, the probability of characterizing the second class object belonging to a is 0, the probability of characterizing the first class object belonging to B is 0, the probability of characterizing the second class object belonging to B is 1, the probability of characterizing the first class object belonging to C is 0, and the probability of characterizing the second class object belonging to C is 0. For another example, the word "welcome" does not belong to any object in the marked triples, so that each probability in the first class probabilities of the sample of the word is 0.

In addition, the sample second class probability of each word in the sample text F may be determined by whether the word belongs to any of the first class objects in the labeled triples and whether the word belongs to any of the second class objects in the labeled triples. Taking "zhangming" as an example, the word belongs to the first class object in a, so the probability that the word belongs to any one of the first class objects is 1. And because the word belongs to the second class object in B, the probability that the word belongs to any second class object is 1.

It should be noted that, the method for determining the second class probability of the sample is to determine that the word belongs to the first class object in any triple relationship if the word belongs to the first class object in any marked triple; and if the second type object belongs to any marked triple, determining that the word belongs to the second type object in any triple relation.

And iteratively determining parameters of a preset model by using the difference between the first probability of the sample and the first class probability and the difference between the second probability of the sample and the second class probability in the marking information, and obtaining the trained model after updating the parameters for multiple times.

As described above, the preset model in the embodiment of the present application may output the first class probability and the second class probability of each word. The first type of probability represents the probability that the word belongs to the objects contained in the triple, and the second type of probability represents the probability that the word belongs to any one of the first type of objects and the probability that the word belongs to any one of the second type of objects. Therefore, the two types of probabilities can mutually verify and determine the target words, and the accuracy of the triple extraction method is further improved.

The embodiments of the present application further provide a triple extraction device, which is described below, and the triple extraction device described below and the triple extraction method described above may be referred to in a corresponding manner.

Referring to fig. 5, a schematic structural diagram of a triplet extracting apparatus according to an embodiment of the present application is shown, and as shown in fig. 5, the apparatus may include:

a word vector obtaining unit 501, configured to obtain a word vector of each word constituting a text to be recognized;

the model prediction unit 502 is configured to input the word vectors into a preset model, so as to obtain a recognition result corresponding to each word vector; the recognition result comprises a first class probability of the word vector, a first class probability representation of any word vector, and a probability that a word corresponding to the word vector belongs to an object contained in the triple;

a target word determining unit 503, configured to determine that a word is a target word when any word meets a preset condition, where the preset condition includes that a probability that the word belongs to any object in a triplet is greater than a first preset threshold;

the triple determining unit 504 is configured to form a triple from a target word pair belonging to an object in the same triple according to a relationship between objects in the triple.

Optionally, the preset conditions further include:

Optionally, the preset model comprises an encoder and a decoder;

the encoder is used for obtaining a feature vector from the word vector;

the decoder comprises a first decoding module and a second decoding module;

Optionally, the first decoding module includes 2 × n sigmoid functions, where n is the number of triplet relationships learned by the preset model, and the second decoding module includes 2 sigmoid functions.

Optionally, the apparatus further comprises: the preset model training module is used for training a preset model, and specifically can be used for:

The triple extracting device comprises a processor and a memory, wherein the word vector acquiring unit 501, the model predicting unit 502, the target word determining unit 503, the triple determining unit 504 and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. One or more kernels can be set, and the accuracy of triple extraction is improved by adjusting kernel parameters.

An embodiment of the present invention provides a storage medium on which a program is stored, where the program, when executed by a processor, implements the triple extraction method.

The embodiment of the invention provides a processor, which is used for running a program, wherein the triple extraction method is executed when the program runs.

The embodiment of the present application further provides a triple extracting apparatus, please refer to fig. 6, which shows a schematic structural diagram of the triple extracting apparatus (60), where the apparatus includes at least one processor 601, at least one memory 602 connected to the processor, and a bus 603; the processor 601 and the memory 602 complete communication with each other through the bus 603; the processor is used for calling the program instructions in the memory to execute the triple extraction method. The device herein may be a server, a PC, a PAD, a mobile phone, etc.

The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device:

acquiring word vectors of all words forming a text to be recognized;

Optionally, the preset conditions further include:

Optionally, the preset model comprises an encoder and a decoder;

the encoder is used for obtaining a feature vector from the word vector;

the decoder comprises a first decoding module and a second decoding module;

Optionally, the training process of the preset model includes:

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of triplet extraction comprising:

acquiring word vectors of all words forming a text to be recognized;

2. The triplet extraction method of claim 1 wherein the word vector comprises at least one of: word sense vectors and part-of-speech vectors.

3. The triple extraction method according to claim 1, wherein the objects included in the triple relationship include a first class object and a second class object, and the recognition result further includes:

4. A triplet extraction method as claimed in claim 3, characterised in that said preset conditions further comprise:

5. The triplet extraction method according to claim 4, characterised in that the preset model comprises an encoder and a decoder;

the encoder is used for obtaining a feature vector from the word vector;

the decoder comprises a first decoding module and a second decoding module;

6. The triplet extraction method according to claim 5, characterised in that the first decoding module comprises 2 x n sigmoid functions, where n is the number of triplet relations learned by the preset model and the second decoding module comprises 2 sigmoid functions.

7. The triplet extraction method according to claim 5 or 6, characterized in that the training process of the preset model comprises:

8. A triplet extraction device, comprising:

9. A triplet extraction device, characterized in that it comprises: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the triplet extraction method according to any one of claims 1 to 7.

10. A storage medium having a program stored thereon, the program, when executed by a processor, implementing the steps of the triplet extraction method as claimed in any one of claims 1 to 7.