WO2021063086A1 - 3-tuple extraction method, device, apparatus, and storage medium - Google Patents

3-tuple extraction method, device, apparatus, and storage medium Download PDF

Info

Publication number
WO2021063086A1
WO2021063086A1 PCT/CN2020/103209 CN2020103209W WO2021063086A1 WO 2021063086 A1 WO2021063086 A1 WO 2021063086A1 CN 2020103209 W CN2020103209 W CN 2020103209W WO 2021063086 A1 WO2021063086 A1 WO 2021063086A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
probability
type
vector
triple
Prior art date
Application number
PCT/CN2020/103209
Other languages
French (fr)
Chinese (zh)
Inventor
戴泽辉
Original Assignee
北京国双科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京国双科技有限公司 filed Critical 北京国双科技有限公司
Publication of WO2021063086A1 publication Critical patent/WO2021063086A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • the present invention relates to the field of electronic information technology, and more specifically, to a triple extraction method, device, equipment and storage medium.
  • the extraction of triples refers to the extraction of subjects, individuals, and their relationships in unstructured text based on rules.
  • the unstructured text is "Zhang Xiaoming married Li Xiaohong in 2018", for rule A: [person-husband-person]
  • the triplet can be extracted from the unstructured text as: Zhang Xiaoming-husband-Li Xiaohong.
  • the method of extracting triples from unstructured text can apply a multi-step process method, for example, a two-step method can be used to extract triples.
  • the specific steps can be: the first step is to extract the entities contained in the text; the second step is to classify and associate the extracted entities, and finally determine the relationship between the entities.
  • the present invention provides a triple extraction method, device, equipment, and storage medium that overcomes the above problems or at least partially solves the above problems, as follows:
  • a method for extracting triples including:
  • the word vector is input into the preset model to obtain the recognition result corresponding to each word vector;
  • the recognition result includes the first-type probability of the word vector, the first-type probability representation of any word vector, and the word The probability that the word corresponding to the vector belongs to the object contained in the triple relationship;
  • any word satisfies a preset condition, determine that the word is a target word, and the preset condition includes that the probability that the word belongs to any object in the triple relationship is greater than the first preset threshold;
  • the target word pairs belonging to the objects in the same triple relationship are formed into triples according to the relationship between the objects in the triple relationship.
  • the word vector includes at least one of the following: a word meaning vector and a part-of-speech vector.
  • the objects included in the triple relationship include objects of the first type and objects of the second type, and the recognition result further includes:
  • the second-type probability of the word vector represents the second-type probability of any word vector, the probability that the word corresponding to the word vector belongs to the first-type object and the probability that it belongs to the second-type object.
  • the preset conditions also include:
  • the probability that the word belongs to the first type of object or the second type of object is greater than the second preset threshold.
  • the preset model includes an encoder and a decoder
  • the encoder is used to obtain a feature vector from the word vector
  • the decoder includes a first decoding module and a second decoding module
  • the first decoding module is configured to determine the first type probability according to the feature vector
  • the second decoding module is configured to determine the second type probability according to the feature vector
  • the first decoding module includes 2*n sigmoid functions, where n is the number of triples learned by the preset model, and the second decoding module includes 2 sigmoid functions.
  • the training process of the preset model includes:
  • the recognition result includes the first type probability and the second type probability of the sample word vector;
  • the sample word vector includes the words in the sample text Word vector
  • the probability of the first type of the sample and the probability of the second type of the sample are determined according to the triplet marked in the sample text.
  • a triple extraction device including:
  • the word vector acquiring unit is used to acquire the word vector of each word composing the text to be recognized;
  • the model prediction unit is used to input the word vector into a preset model to obtain the recognition result corresponding to each word vector;
  • the recognition result includes the first type probability of the word vector, and the first probability of any word vector Class probability representation, the probability that the word corresponding to the word vector belongs to the object contained in the triple relationship;
  • the target word determining unit is configured to determine that any word meets a preset condition as the target word, and the preset condition includes that the probability that the word belongs to any object in the triple relationship is greater than the first preset Threshold
  • the triple determination unit is used to group the target word pairs belonging to the objects in the same triple relationship into triples according to the relationship between the objects in the triple relationship.
  • a triple extraction device including: a memory and a processor
  • the memory is used to store programs
  • the processor is configured to execute the program to implement each step of the triple extraction method described above.
  • the word vector of each word in the text to be recognized is input into a preset model, and the recognition result corresponding to each word vector is obtained.
  • the recognition result corresponding to each word vector represents the probability that the word corresponding to the word vector belongs to the object contained in the triple relationship.
  • This word is the target word, and any pair of target word pairs in the target word are formed into triples according to the corresponding relationship of the pair of target words. Because any pair of target word pairs includes two words belonging to the objects contained in the same triple relationship, the relationship of the target word pair is the relationship of the object to which the pair of target word pairs belong in the triple relationship.
  • the technical solution provided by this application can output the probability that each word belongs to all the objects included in the triple relationship learned by the model based on the preset model, and further determine the rules according to the output of the model to obtain the information in the text to be recognized. All triples, that is, there is no need to use the model in multiple steps.
  • the output of the model serves as the basis for the rule determination. Therefore, the accumulation of model errors is avoided and the accuracy of the results can be improved.
  • FIG. 1 is a schematic flowchart of a method for extracting triples according to an embodiment of the application
  • FIG. 3 is a schematic structural diagram of a preset model provided by an embodiment of this application.
  • FIG. 4 is a schematic diagram of a training process of a preset model provided by an embodiment of this application.
  • FIG. 5 is a schematic structural diagram of a triple extraction device provided by an embodiment of this application.
  • FIG. 6 is a schematic structural diagram of a triple extraction device provided by an embodiment of the application.
  • the triple extraction method provided in the embodiments of the present application can be applied to smart devices, such as computers, tablets, or smart phones, or can be applied to a server with a text processing system preset.
  • Fig. 1 is a schematic flow chart of a method for extracting triples according to an embodiment of the present application.
  • the method may specifically include:
  • the word vector includes the word vector of each word constituting the text to be recognized.
  • the text to be recognized is an unstructured text that requires triple extraction.
  • the text to be recognized may include at least one sentence, each sentence is composed of words, and may contain punctuation marks. After word segmentation is performed on the text to be recognized, all the words that make up the text to be recognized can be obtained.
  • the word vector obtained in this step is the word vector generated by mapping each word constituting the text to be recognized to the vector space.
  • word segmentation processing refers to splitting a sentence of training text based on preset word segmentation standards, and removing punctuation marks.
  • the text W to be recognized is: "Zhang Xiaoming's new wife Li Xiaohong's favorite novel is Mr. Qian Zhongshu's "Besieged City””.
  • the words composing the text W to be recognized include: “Zhang Huaweing” “ ⁇ ” “Newlyweds”, “Li Xiaohong”, “Favorite”, “Fiction”, “Yes”, “Qian Zhongshu", “Mr.”, "Beête”.
  • word segmentation is an existing technology in the field of natural language processing.
  • existing tool software such as Harbin Institute of Technology LTP or jieba
  • S102 Input a word vector into a preset model, and obtain a recognition result corresponding to each word vector.
  • each triple relationship is a rule, for example, the above rule A [person-husband-person] can be used as a triple relationship.
  • the preset model can learn multiple triad relations, and each triad relation can include two objects.
  • the above-mentioned "person" triple relationship A [person-husband-person] includes two objects that are both "persons”. Assuming that the number of pre-configured triple relationships that require model learning is n, then the number of objects is 2*n.
  • each triple relationship can include the first type of object and the second type of object, and this method records the first object in each triple relationship as the first type of object, and the last object in each triple relationship An object is recorded as an object of the second type, and the object includes n objects of the first type and n objects of the second type.
  • the recognition result corresponding to any word vector output by the model includes the first-type probability of the word vector, and the first-type probability represents the probability that the word generating the word vector belongs to the object. Since the number of objects is 2*n, the probability of the first type includes 2*n probabilities, and each probability represents the probability that the word generating the word vector belongs to the object.
  • the recognition result corresponding to w1 may include: the probability that the word generating w1 belongs to the first type of object in s1, p11, the probability of belonging to the second type of object in s1, p12, and the probability of belonging to the first type of object in s2, p21, belong to s2
  • the aforementioned probability of the first type can be predicted by the 2*n sigmoid functions in the preset model.
  • the preset condition includes that the probability that the word belongs to any object is greater than the first preset threshold.
  • the first preset threshold value is a threshold value set when the rule is determined according to the output of the model.
  • the size of the threshold value can be set to 0.5. That is, when the first-type probability corresponding to the word vector generated by any word indicates that the probability that the word belongs to the object is greater than the first preset threshold, it is determined that the word belongs to the object. Therefore, use this word as a target word.
  • the text W to be recognized is: "Zhang Xiaoming's new wife Li Xiaohong's favorite novel is Mr. Qian Zhongshu's "The Besieged City”", in which the word vector corresponding to the generated word "Zhang Xiaoming” represents "Zhang Xiaoming” in the first type of probability
  • the probability of belonging to the first type of object in the triple relationship [person-husband-person] is greater than 0.5.
  • the probability that "Li Xiaohong” belongs to the second type of object in the triple relationship [person-husband-person] is greater than 0.5. That is, the words “Zhang Xiaoming" and “Li Xiaohong” both meet the preset conditions, and they are determined to be the target words.
  • any target word determined in this step can belong to multiple objects.
  • the probability that "Zhang Xiaoming” belongs to the first-type object in the triple relationship [person-husband-person] is greater than 0.5
  • the first-type probability The probability that "Zhang Xiaoming” belongs to the second type of object in the triple relationship [person-wife-person] is also greater than 0.5. Therefore, the target word "Zhang Xiaoming” belongs to two objects.
  • S104 Combine the target word pairs belonging to the objects in the same triplet to form a triplet according to the relationship between the objects in the triplet.
  • a pair of target words includes two target words belonging to two objects in the same triple relationship.
  • the objects in any triple relationship include objects of the first type and objects of the second type, and the relationship between the two objects can be represented by the relationship label in the triple.
  • the corresponding relationship of the target word pair is also the relationship indicated by the relationship label in the triple relationship.
  • the first object "book” is the first type of object
  • the second object “person” is the second type of object.
  • the relationship of the class object is that the "character” is the "author” of the "book”.
  • the probability that the character "besieged city” belongs to the object is greater than 0.5
  • the probability that the character "Qian Zhongshu” belongs to the object is greater than 0.5.
  • target words belonging to the object there may be multiple target words belonging to the object, so the following situation may occur: there are multiple target words belonging to the first type of object in a triple relationship, At the same time, there are multiple target words belonging to the second type of object in the triple relationship.
  • the target word pair can be selected according to the position of the target word in the text to be recognized.
  • the two closest target words belonging to the object in the same triple relationship are regarded as a pair of target word pairs .
  • the word vector of each word in the text to be recognized is input into a preset model, and the recognition result corresponding to each word vector is obtained.
  • the characterization of the recognition result corresponding to each word vector, the probability that the word corresponding to the word vector belongs to the object contained in the triplet, when the probability of any word belonging to any object is greater than the first preset threshold, the The word is the target word, and any pair of target word pairs in the target word are formed into triples according to the corresponding relationship of the pair of target words. Because any pair of target word pairs includes two target words that belong to the objects included in the same triple relationship, the relationship of the target word pair is the relationship of the object to which the pair of target word pairs belong in the triple relationship.
  • the model can output the probability that a word belongs to all objects included in the triple relationship learned by the model, and further rule judgments based on the output of the model can obtain all triples in the text to be recognized, namely There is no need to use the model in multiple steps.
  • the output of the model serves as the basis for the rule determination. Therefore, the accumulation of model errors is avoided and the accuracy of the results can be improved.
  • any target word can belong to multiple objects, that is, any target word can belong to objects in multiple triple relationships, thus Without multiple models, multiple triples to which any word generating word vector belongs can be extracted. It can be seen that the triple extraction method disclosed in this application greatly improves the efficiency of triple extraction.
  • the inventor found in the process of implementing the invention that in order to improve the accuracy of triple extraction, it can be based on LSTM-CRF (Long Short-Term Memory-Conditional Random Field, Long Short-Term Memory Network-Conditional Random Field Algorithm)
  • LSTM-CRF Long Short-Term Memory-Conditional Random Field, Long Short-Term Memory Network-Conditional Random Field Algorithm
  • each word can only belong to one triple relationship, that is, extracting triples through the LSTM-CRF model Group, any word can only belong to one triple group.
  • the words contained in the text to be recognized belong to two or more triples, multiple LSTM-CRF models must be established at the same time to extract triples for different triple relationships, which results in low efficiency of triple extraction .
  • the triples can be extracted from the text W to be recognized as: [Zhang Xiaoming-husband-Li Xiaohong].
  • the triple relationship B [person-wife-person] the triple can be extracted from the text W to be recognized as: [Li Xiaohong-wife-Zhang Xiaoming]. It can be seen that the names of "Zhang Xiaoming" and "Li Xiaohong” belong to different triple relationships at the same time. Therefore, when a triple extraction system is established in advance based on the LSTM-CRF model, two LSTM-CRF models need to be trained to extract triples for triple relationship A and triple relationship B respectively. As a result, the execution efficiency is low.
  • this application provides the model in the Zhang Xiaoming-husband-Li Xiaohong triple extraction method that can be configured as multiple triple relationships as needed during the learning process, and any target word may belong to multiple objects at the same time.
  • the probability that "Zhang Xiaoming” belongs to the first-type object in the triple relationship A [person-husband-person] is greater than 0.5 .
  • the probability that "Zhang Xiaoming” belongs to the second-type object in the triple relationship B [person-wife-person] is also greater than 0.5. Therefore, "Zhang Xiaoming” can belong to the two triples [Zhang Xiaoming-husband-Li Xiaohong] and [Li Xiaohong-wife-Zhang Xiaoming] determined by the triple relationship A and the triple relationship B at the same time.
  • the word vector obtained in S101 may include a word meaning vector and a part-of-speech vector.
  • the word meaning vector of a word is the vector obtained by mapping the word meaning to the vector space, and this vector can represent the part-of-speech information of the word.
  • the part-of-speech vector of a word is the vector obtained by mapping the part-of-speech to the vector space, and the vector can represent the part-of-speech information of the word.
  • the word meaning vector of each word can be obtained by searching the word vector mapping set (ie Word vector).
  • Word vector is a set of word sense vector mapping correspondences generated by word vector training using the corpus of the field to which the text to be recognized belongs.
  • Word vector can map words to a low-dimensional vector space, and can express the similar relationship between the word meanings of various words through the relationship between the word meaning vectors.
  • the low-frequency words in the corpus can be marked as UNK. UNK has a unique vector expression in the word vector, and its dimension is consistent with the dimension of the word meaning vector corresponding to other words.
  • the k words that make up the text are e1, e2..., ek, and the word meaning vectors generated by all words can be obtained through Word vector, which are h1, h2,..., hk. .
  • the generated word meaning vector can be obtained as UNK according to the Word vector.
  • the part-of-speech information is the part-of-speech to which the meaning of each word in the text to be recognized belongs.
  • the acquisition method can be expressed by a random vector of a certain dimension. For example, for a total of 30 parts of speech [A1,A2,...,A30], the part of speech vector a1 can be used to represent A1, the part of speech vector a2 can represent A2,..., and the part of speech vector a30 can represent A30. .
  • the dimensions of the part-of-speech vectors a1, a2,..., a30 can be preset, and each of the dimensions is a randomly generated decimal close to 0.
  • step 102 the word meaning vector and the part-of-speech vector generated therefrom are input to the preset model, and step 102 is executed.
  • the word vector obtained in S101 may also only include the word meaning vector, that is, the word meaning vector of each word in the text to be recognized is directly input as the word vector to the preset model, and step 102 is executed.
  • FIG. 2 is a schematic flowchart of another implementation manner of the triple extraction method provided by an embodiment of the application. as follows:
  • the word vector includes a word vector generated by each word constituting the text to be recognized.
  • the word vector includes a word meaning vector and a part-of-speech vector.
  • S202 Input the word vector into the preset model, and obtain the recognition result corresponding to each word vector.
  • the recognition result includes the first-type probability of the word vector, and also includes the second-type probability of the word vector.
  • FIG. 3 shows a schematic structural diagram of a preset model, and the preset model includes an encoder and a decoder.
  • the encoder can be a two-way LSTM model.
  • the text E to be recognized it is composed of e1, e2,..., ek.
  • ej input the generated word sense vector hj and part of speech vector aj to the encoder, and then encode the word sense vector and part of speech vector through the encoder.
  • the feature vector Ej is obtained by concatenating the word meaning vector and the part-of-speech vector of the word.
  • the dimension of the word meaning vector is 100
  • the dimension of the part-of-speech vector is 20
  • the dimension of the feature vector obtained by the encoder is 130. It should be noted that if the number of words in the text to be recognized is m, input m word vectors to the encoder to obtain m 120-dimensional feature vectors, and arrange these vectors to form an m*120 vector matrix. The matrix needs to be extended to a specific length by adding zeros.
  • the preset model can learn n triple relationships, and each triple relationship can include two objects. Since the first object in each triple relationship is the first type of object, and the last object in each triple relationship is the second type of object, the number of objects of the first type is n, and the number of objects of the second type is n.
  • the second-type probability of any word vector represents the probability that the word generating the word vector belongs to the first-type object and the probability that it belongs to the second-type object, that is, the second-type probability of any word includes two probability values.
  • the decoder in the model includes a first decoding module and a second decoding module.
  • the first decoding module includes 2*n sigmoid functions, which are used to determine the probability of the first type according to the feature vector.
  • the probability of the first type may refer to the description in step S102. As shown in Figure 3, taking the feature vector Ej as an example, the process of determining the probability of the first type is:
  • the word corresponding to the feature vector belongs to the first category in the i-th triple relationship.
  • the probability of the object is output through the 2i-th sigmoid function (ie, sig2i).
  • the second decoding module includes two sigmoid functions, which are used to determine the second type of probability according to the feature vector. As shown in Figure 3, taking the feature vector Ej as an example, the process of determining the second type of probability is:
  • the first sigmoid function (ie, sig1) is used to output the probability that the word corresponding to the feature vector belongs to the first type of object in any triple relationship.
  • the second sigmoid function (ie, sig2) is used to output the probability that the word corresponding to the feature vector belongs to the second type of object in any triple relationship.
  • the preset condition includes that the probability of the word belonging to the first type of object or the probability of belonging to the second type of object is greater than the second preset threshold.
  • the second preset threshold value is a threshold value set when the rule is determined according to the output of the model, and generally, the size of the threshold value can be set to 0.5. That is, when the first-type probability corresponding to the word vector generated by any word indicates that the probability that the word belongs to the object is greater than the first preset threshold, or when the second-type probability corresponding to the word vector generated by any one word indicates that the word belongs to the first When the probability of the object of the first type or the object of the second type is greater than the second preset threshold, the word is determined to be the target word.
  • the recognition result corresponding to w1 may include the first-type probabilities, which are: the probability p11 that the word generating w1 belongs to the first-type object in s1, the probability p12 that belongs to the second-type object in s1, and the first-type object in s2 The probability p21, the probability p22 of belonging to the second type of object in s2,..., the probability of belonging to the first type of object in sn1, the probability of belonging to the second type of object in sn, pn2.
  • the recognition result corresponding to w1 also includes the second-type probabilities, respectively: the probability p1 that the word generating w1 belongs to any first-type object and the probability p2 that the word generating w1 belongs to any second-type object.
  • p1 is greater than 0.5 and p11 is greater than 0.5, it is determined that the word generating w1 belongs to the first type of object in s1, and the word is determined as the target word. If p1 is greater than 0.5 and p11 is less than or equal to 0.5, it is determined that the word generating w1 does not belong to the first type of object in s1. If p1 is less than or equal to 0.5 and p11 is less than or equal to 0.5, it is determined that the word generating w1 does not belong to the first type of object in s1.
  • p1 is greater than 0.5
  • p2 is greater than 0.5
  • p2r it is determined that the word generating w1 belongs to the second type of object in sr, and the word is determined as the target word.
  • S204 Combine the target word pairs belonging to the objects in the same triplet to form a triplet according to the relationship between the objects in the triplet.
  • Figure 4 shows a schematic diagram of the training process of a preset model, including:
  • S401 Input the sample word vector into the preset model to be trained, and obtain the recognition result output by the preset model.
  • the sample word vector includes the word vector of the word in the sample text.
  • the sample text may include multiple unstructured text fragments having a triple relationship, and the number of sample text fragments may be several thousand to several hundred thousand. It should be noted that, in order to better train the model, sample texts can be obtained according to the field of the text to be recognized. For example, if the field of the text to be recognized is financial, the text belonging to the financial field can be obtained as sample text.
  • the method for obtaining the sample word vector of the word can refer to the above method for obtaining the word vector of the word in the text to be recognized.
  • the recognition result of the sample word vector output by the preset model includes the first type probability and the second type probability of the sample word vector. Specifically, for the meaning of the first type of probability and the second type of probability, reference may be made to the introduction of S202 above. This is not repeated in the embodiment of the application.
  • S402 Use the difference between the first-type probability and the first-type probability of the sample in the marking information and the difference between the second-type probability and the second-type probability of the sample in the marking information to obtain the parameters of the preset model.
  • the sample text F is: "Zhang Xiaoming married Li Xiaohong in 2018".
  • the triples [Zhang Xiaoming-husband-Li Xiaohong] and [ ⁇ -wife-Zhang Xiaoming] are manually marked.
  • the first probability of a word sample represents the probability that the word belongs to the object contained in the triple
  • the second probability of the word sample represents the probability that the word belongs to the first type of object and the second type of object respectively.
  • the objects include all pre-configured objects contained in n triples that require model learning.
  • the n triple relationships may include all possible triple relationships in the text to be recognized. It is understandable that if the field of the text to be recognized is different, the possible triad relationships will be different. Therefore, n triad relationships can be pre-configured according to the field of the text to be recognized.
  • the labeling information refers to the sample first-type probability and sample second-type probability of each word in the sample text obtained according to the manually labeled triples.
  • the method for obtaining the probability of the first type of the sample and the probability of the second type of the sample is the prior art, and the embodiment of the present application only briefly introduces this with the following examples.
  • the preset triple relationship that the model needs to learn is A: [person-husband-person], B: [person-wife-person] and C: [book-author-person] .
  • A [person-husband-person]
  • B [person-wife-person]
  • C [book-author-person] .
  • the sample first-type probability of the word can be determined by whether each word in the sample text F belongs to the object contained in the marked triplet. Take "Zhang Xiaoming" as an example.
  • the word belongs to the first type of object in A and the second type of object in B. Therefore, in the first-type probability of the word corresponding to the sample, the probability that it belongs to the first-type object in A is 1, the probability that it belongs to the second-type object in A is 0, and the probability that it belongs to the first-type object in B If it is 0, the probability that it belongs to the second type of object in B is 1, the probability that it is the first type of object in C is 0, and the probability that it is the second type of object in C is 0.
  • the word "marry" does not belong to any object in the above-mentioned marked triplet, so each probability in the first-type probability of the sample of the word is 0.
  • each word in the sample text F can be determined by whether each word in the sample text F belongs to any first-type object in the marked triplet and whether it belongs to any second-type object in the marked triplet.
  • Class two probability Taking "Zhang Xiaoming" as an example, the word belongs to the first type of object in A, so the probability that the word belongs to any first type of object is 1. And because it belongs to the second type of object in B, the probability that the word belongs to any second type of object is 1.
  • the above method for determining the second-type probability of a sample is to determine that the word belongs to the first-type object in any three-tuple relationship if the word belongs to the first-type object in any labeled triplet; If the word belongs to the second type of object in any labeled triplet, it is determined that the word belongs to the second type of object in any triplet relationship.
  • the preset model in the embodiment of the present application can output the first-type probability and the second-type probability of each word.
  • the first type of probability represents the probability that the word belongs to an object included in the triple
  • the second type of probability represents the probability that the word belongs to any first type of object and the probability that the word belongs to any second type of object. Therefore, the two types of probabilities can verify each other to determine the target word, which further improves the accuracy of the triple extraction method.
  • the embodiment of the application also provides a triple extraction device.
  • the triple extraction device provided by the embodiment of the application will be described below.
  • the triple extraction device described below and the triple extraction method described above are mutually compatible with each other. Corresponding reference.
  • FIG. 5 shows a schematic structural diagram of a triple extraction device provided by an embodiment of the present application.
  • the device may include:
  • the word vector obtaining unit 501 is configured to obtain the word vector of each word composing the text to be recognized;
  • the model prediction unit 502 is configured to input the word vector into a preset model to obtain the recognition result corresponding to each word vector; the recognition result includes the first type probability of the word vector, and the first probability of any word vector A type of probability representation, the probability that the word corresponding to the word vector belongs to the object contained in the triple;
  • the target word determining unit 503 is configured to determine that any word meets a preset condition as a target word, and the preset condition includes that the probability of the word belonging to any object in the triple is greater than the first preset Threshold
  • the triple determination unit 504 is configured to group target word pairs belonging to objects in the same triple into a triple according to the relationship between the objects in the triple.
  • the word vector includes at least one of the following: a word meaning vector and a part-of-speech vector.
  • the objects included in the triple relationship include objects of the first type and objects of the second type, and the recognition result further includes:
  • the second-type probability of the word vector represents the second-type probability of any word vector, the probability that the word corresponding to the word vector belongs to the first-type object and the probability that it belongs to the second-type object.
  • the preset conditions also include:
  • the probability that the word belongs to the first type of object or the second type of object is greater than the second preset threshold.
  • the preset model includes an encoder and a decoder
  • the encoder is used to obtain a feature vector from the word vector
  • the decoder includes a first decoding module and a second decoding module
  • the first decoding module is configured to determine the first type probability according to the feature vector
  • the second decoding module is configured to determine the second type probability according to the feature vector
  • the first decoding module includes 2*n sigmoid functions, where n is the number of triple relationships learned by the preset model, and the second decoding module includes 2 sigmoid functions.
  • the device further includes: a preset model training module, which is used to train a preset model, which can be specifically used for:
  • the recognition result includes the first type probability and the second type probability of the sample word vector;
  • the sample word vector includes the words in the sample text Word vector
  • the probability of the first type of the sample and the probability of the second type of the sample are determined according to the triplet marked in the sample text.
  • the triple extraction device includes a processor and a memory.
  • the word vector acquisition unit 501, the model prediction unit 502, the target word determination unit 503, and the triple determination unit 504 are all stored in the memory as program units and executed by the processor.
  • the above-mentioned program units stored in the memory implement the corresponding functions.
  • the processor contains the kernel, and the kernel calls the corresponding program unit from the memory.
  • One or more kernels can be set, and the accuracy of triple extraction can be improved by adjusting kernel parameters.
  • the embodiment of the present invention provides a storage medium on which a program is stored, and when the program is executed by a processor, the triplet extraction method is implemented.
  • the embodiment of the present invention provides a processor, the processor is used to run a program, wherein the triple extraction method is executed when the program is running.
  • the embodiment of the present application also provides a triplet extraction device.
  • FIG. 6, shows a schematic structural diagram of the triplet extraction device (60).
  • the device includes at least one processor 601 and a device connected to the processor.
  • the equipment in this article can be a server, PC, PAD, mobile phone, etc.
  • This application also provides a computer program product, which when executed on a data processing device, is suitable for executing a program that initializes the following method steps:
  • the word vector is input into the preset model to obtain the recognition result corresponding to each word vector;
  • the recognition result includes the first-type probability of the word vector, the first-type probability representation of any word vector, and the word The probability that the word corresponding to the vector belongs to the object contained in the triple relationship;
  • any word satisfies a preset condition, determine that the word is a target word, and the preset condition includes that the probability that the word belongs to any object in the triple relationship is greater than the first preset threshold;
  • the target word pairs belonging to the objects in the same triple relationship are formed into triples according to the relationship between the objects in the triple relationship.
  • the word vector includes at least one of the following: a word meaning vector and a part-of-speech vector.
  • the objects included in the triple relationship include objects of the first type and objects of the second type, and the recognition result further includes:
  • the second-type probability of the word vector represents the second-type probability of any word vector, the probability that the word corresponding to the word vector belongs to the first-type object and the probability that it belongs to the second-type object.
  • the preset conditions also include:
  • the probability that the word belongs to the first type of object or the second type of object is greater than the second preset threshold.
  • the preset model includes an encoder and a decoder
  • the encoder is used to obtain a feature vector from the word vector
  • the decoder includes a first decoding module and a second decoding module
  • the first decoding module is configured to determine the first type probability according to the feature vector
  • the second decoding module is configured to determine the second type probability according to the feature vector
  • the first decoding module includes 2*n sigmoid functions, where n is the number of triples learned by the preset model, and the second decoding module includes 2 sigmoid functions.
  • the training process of the preset model includes:
  • the recognition result includes the first type probability and the second type probability of the sample word vector;
  • the sample word vector includes the words in the sample text Word vector
  • the probability of the first type of the sample and the probability of the second type of the sample are determined according to the triples marked in the sample text.
  • the device includes one or more processors (CPUs), memory, and buses.
  • the device may also include input/output interfaces, network interfaces, and so on.
  • the memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), and the memory includes at least one Memory chip.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • the memory is an example of a computer-readable medium.
  • Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carriers.
  • this application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The present application discloses a 3-tuple extraction method, a device, an apparatus, and a storage medium. The method comprises: inputting respective word vectors of words in a text to undergo identification into a pre-determined model, and obtaining an identification result corresponding to each of the word vectors, each of the identification results of the word vectors representing the probability of a word that corresponds to the word vector being associated with objects comprised in a 3-tuple relationship; if the probability of any word being associated with any object is greater than a first pre-determined threshold, determining the word as a target word; and forming a 3-tuple from any target word pair in target words according to a relationship corresponding to the target word pair. Since any pair of target words comprise two words associated with objects comprised in the same 3-tuple relationship, a relationship of the pair of target words is a relationship of the objects associated with the pair of target words in the 3-tuple relationship. In this way, the technical solution provided by the present application achieves identification of all 3-tuples in a text without using models in multiple steps.

Description

一种三元组抽取方法、装置、设备及存储介质Method, device, equipment and storage medium for extracting triples
本申请要求于2019年09月30日提交中国专利局、申请号为201910942438.4、发明名称为“一种三元组抽取方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 30, 2019, the application number is 201910942438.4, and the invention title is "a method, device, equipment, and storage medium for extracting triples", and its entire contents Incorporated in this application by reference.
技术领域Technical field
本发明涉及电子信息技术领域,更具体地说,涉及一种三元组抽取方法、装置、设备及存储介质。The present invention relates to the field of electronic information technology, and more specifically, to a triple extraction method, device, equipment and storage medium.
背景技术Background technique
随着互联网技术的发展和普及,网络已经成为大多数人日常生活中必不可少的一部分。对于互联网上存在的大量的非结构化文本,通过抽取文本中的三元组可以建立知识图谱,对下游检索、推荐、查询等任务有着重要的意义和价值。With the development and popularization of Internet technology, the Internet has become an indispensable part of most people's daily lives. For a large number of unstructured texts on the Internet, a knowledge graph can be established by extracting triples in the text, which has important significance and value for downstream retrieval, recommendation, query and other tasks.
三元组抽取,指的是针对规则,抽取非结构化文本当中包含的主体、个体及其之间的关系。例如,非结构化文本为“张小明于2018年迎娶了李小红”,针对规则A:[人物-丈夫-人物],可以从该非结构化文本中抽取三元组为:张小明-丈夫-李小红。目前,从非结构化文本中抽取三元组的方法可以应用多步流程化方法,例如可以采用两步法抽取三元组。具体的步骤可以为:第一步、抽取出文本中包含的实体;第二步、将抽取的实体进行分类关联,最终确定实体之间的关系。例如,可以首先从上述非结构化文本中抽取实体,即人名“张小明”和“李小红”,然后再将这些人名进行分类关联,以确定“张小明”为“李小红”的“丈夫”,由此得到一个三元组为:张小明-丈夫-李小红。The extraction of triples refers to the extraction of subjects, individuals, and their relationships in unstructured text based on rules. For example, the unstructured text is "Zhang Xiaoming married Li Xiaohong in 2018", for rule A: [person-husband-person], the triplet can be extracted from the unstructured text as: Zhang Xiaoming-husband-Li Xiaohong. At present, the method of extracting triples from unstructured text can apply a multi-step process method, for example, a two-step method can be used to extract triples. The specific steps can be: the first step is to extract the entities contained in the text; the second step is to classify and associate the extracted entities, and finally determine the relationship between the entities. For example, you can first extract entities from the above unstructured text, namely the names "Zhang Xiaoming" and "Li Xiaohong", and then classify and associate these names to determine that "Zhang Xiaoming" is the "husband" of "Li Xiaohong". A triple is: Zhang Xiaoming-husband-Li Xiaohong.
因为分成多步抽取三元组,且实体抽取和关系确定均由模型完成,每一步使用的模型的误差会累积且多步之间的信息无法共享,所以,上述现有的应用多步流程化方法得到的三元组的抽取结果的准确率低。Because it is divided into multi-step extraction triples, and entity extraction and relationship determination are completed by the model, the errors of the model used in each step will accumulate and the information between multiple steps cannot be shared. Therefore, the above-mentioned existing applications are multi-step procedural The accuracy of the extraction results of the triples obtained by the method is low.
发明内容Summary of the invention
鉴于上述问题,本发明提供一种克服上述问题或者至少部分地解决上述问题的一种三元组抽取方法、装置、设备及存储介质,如下:In view of the above problems, the present invention provides a triple extraction method, device, equipment, and storage medium that overcomes the above problems or at least partially solves the above problems, as follows:
一种三元组抽取方法,包括:A method for extracting triples, including:
获取组成待识别文本的各个词的词向量;Obtain the word vector of each word composing the text to be recognized;
将所述词向量输入预设模型,得到每一个所述词向量对应的识别结果;所述识别结果包括所述词向量的第一类概率,任意一个词向量的第一类概率表征,该词向量所对应的词属于三元组关系中所包含对象的概率;The word vector is input into the preset model to obtain the recognition result corresponding to each word vector; the recognition result includes the first-type probability of the word vector, the first-type probability representation of any word vector, and the word The probability that the word corresponding to the vector belongs to the object contained in the triple relationship;
在任意一个词满足预设条件的情况下,确定该词为目标词,所述预设条件包括该词属于三元组关系中任意一个对象的概率大于第一预设阈值;When any word satisfies a preset condition, determine that the word is a target word, and the preset condition includes that the probability that the word belongs to any object in the triple relationship is greater than the first preset threshold;
将属于同一个三元组关系中对象的目标词对,按照该三元组关系中对象之间的关系,组成三元组。The target word pairs belonging to the objects in the same triple relationship are formed into triples according to the relationship between the objects in the triple relationship.
可选地,词向量包括以下至少之一:词义向量和词性向量。Optionally, the word vector includes at least one of the following: a word meaning vector and a part-of-speech vector.
可选地,三元组关系中所包含的对象包括第一类对象和第二类对象,并且,所述识别结果还包括:Optionally, the objects included in the triple relationship include objects of the first type and objects of the second type, and the recognition result further includes:
所述词向量的第二类概率,任一词向量的第二类概率表征,该词向量所对应的词属于第一类对象的概率和属于第二类对象的概率。The second-type probability of the word vector represents the second-type probability of any word vector, the probability that the word corresponding to the word vector belongs to the first-type object and the probability that it belongs to the second-type object.
可选地,预设条件还包括:Optionally, the preset conditions also include:
该词属于第一类对象的概率或属于第二类对象的概率大于第二预设阈值。The probability that the word belongs to the first type of object or the second type of object is greater than the second preset threshold.
可选地,预设模型包括编码器和解码器;Optionally, the preset model includes an encoder and a decoder;
所述编码器用于由所述词向量得到特征向量;The encoder is used to obtain a feature vector from the word vector;
所述解码器包括第一解码模块和第二解码模块;The decoder includes a first decoding module and a second decoding module;
其中,所述第一解码模块用于依据所述特征向量,确定所述第一类概率,所述第二解码模块用于所述依据所述特征向量,确定所述第二类概率。The first decoding module is configured to determine the first type probability according to the feature vector, and the second decoding module is configured to determine the second type probability according to the feature vector.
可选地,第一解码模块包括2*n个sigmoid函数,其中,n为所述预设模型学习的三元组的数量,所述第二解码模块包括2个sigmoid函数。Optionally, the first decoding module includes 2*n sigmoid functions, where n is the number of triples learned by the preset model, and the second decoding module includes 2 sigmoid functions.
可选地,预设模型的训练过程包括:Optionally, the training process of the preset model includes:
将样本词向量输入待训练的预设模型,得到输出的识别结果,所述识别结果包括所述样本词向量的第一类概率和第二类概率;所述样本词向量 包括样本文本中的词的词向量;Input the sample word vector into the preset model to be trained to obtain the output recognition result, the recognition result includes the first type probability and the second type probability of the sample word vector; the sample word vector includes the words in the sample text Word vector
使用标记信息中的样本第一类概率与所述第一类概率的差异,以及所述标记信息中的样本第二类概率与所述第二类概率的差异,得到所述预设模型的参数,所述样本第一类概率和样本第二类概率依据所述样本文本中标记的三元组确定。Use the difference between the first type probability of the sample in the label information and the first type probability, and the difference between the second type probability of the sample in the label information and the second type probability to obtain the parameters of the preset model The probability of the first type of the sample and the probability of the second type of the sample are determined according to the triplet marked in the sample text.
一种三元组抽取装置,包括:A triple extraction device, including:
词向量获取单元,用于获取组成待识别文本的各个词的词向量;The word vector acquiring unit is used to acquire the word vector of each word composing the text to be recognized;
模型预测单元,用于将所述词向量输入预设模型,得到每一个所述词向量对应的识别结果;所述识别结果包括所述词向量的第一类概率,任意一个词向量的第一类概率表征,该词向量所对应的词属于三元组关系中所包含对象的概率;The model prediction unit is used to input the word vector into a preset model to obtain the recognition result corresponding to each word vector; the recognition result includes the first type probability of the word vector, and the first probability of any word vector Class probability representation, the probability that the word corresponding to the word vector belongs to the object contained in the triple relationship;
目标词确定单元,用于在任意一个词满足预设条件的情况下,确定该词为目标词,所述预设条件包括该词属于三元组关系中任意一个对象的概率大于第一预设阈值;The target word determining unit is configured to determine that any word meets a preset condition as the target word, and the preset condition includes that the probability that the word belongs to any object in the triple relationship is greater than the first preset Threshold
三元组确定单元,用于将属于同一个三元组关系中对象的目标词对,按照该三元组关系中对象之间的关系,组成三元组。The triple determination unit is used to group the target word pairs belonging to the objects in the same triple relationship into triples according to the relationship between the objects in the triple relationship.
一种三元组抽取设备,包括:存储器和处理器;A triple extraction device, including: a memory and a processor;
所述存储器,用于存储程序;The memory is used to store programs;
所述处理器,用于执行所述程序,实现如上所述的三元组抽取方法的各个步骤。The processor is configured to execute the program to implement each step of the triple extraction method described above.
一种存储介质,其上存储有程序,其特征在于,所述程序被处理器执行时,实现如上所述的三元组抽取方法的各个步骤。A storage medium having a program stored thereon, characterized in that, when the program is executed by a processor, each step of the method for extracting triples as described above is realized.
借由上述技术方案,本发明提供的三元组抽取方法中,将待识别文本中的各个词的词向量输入预设模型,得到每一个词向量对应的识别结果。每一词向量对应的识别结果表征,该词向量所对应的词属于三元组关系中所包含对象的概率,在任意一个词属于任意一个对象的概率大于第一预设阈值的情况下,确定该词为目标词,将目标词中的任意一对目标词对,按照该对目标词对应的关系,组成三元组。因为任意一对目标词对包括两个属于同一三元组关系所包含的对象的词,目标词对的关系为,该对目标词对所属的对象在三元组关系中的关系。可见本申请提供的技术方案,基于 预设模型可以输出每一词属于模型学习的三元组关系中所包含的所有对象的概率,进一步依据模型的输出进行规则判定即可得到待识别文本中的所有三元组,即无需多步使用模型。并且,模型的输出作为规则判定的基础,因此,避免了模型误差的累计,能够提高结果的准确性。With the above technical solution, in the triple extraction method provided by the present invention, the word vector of each word in the text to be recognized is input into a preset model, and the recognition result corresponding to each word vector is obtained. The recognition result corresponding to each word vector represents the probability that the word corresponding to the word vector belongs to the object contained in the triple relationship. When the probability of any word belonging to any object is greater than the first preset threshold, it is determined This word is the target word, and any pair of target word pairs in the target word are formed into triples according to the corresponding relationship of the pair of target words. Because any pair of target word pairs includes two words belonging to the objects contained in the same triple relationship, the relationship of the target word pair is the relationship of the object to which the pair of target word pairs belong in the triple relationship. It can be seen that the technical solution provided by this application can output the probability that each word belongs to all the objects included in the triple relationship learned by the model based on the preset model, and further determine the rules according to the output of the model to obtain the information in the text to be recognized. All triples, that is, there is no need to use the model in multiple steps. In addition, the output of the model serves as the basis for the rule determination. Therefore, the accumulation of model errors is avoided and the accuracy of the results can be improved.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to understand the technical means of the present invention more clearly, it can be implemented in accordance with the content of the specification, and in order to make the above and other objectives, features and advantages of the present invention more obvious and understandable. In the following, specific embodiments of the present invention will be cited.
附图说明Description of the drawings
图1为本申请实施例提供的三元组抽取方法的流程示意图;FIG. 1 is a schematic flowchart of a method for extracting triples according to an embodiment of the application;
图2为本申请实施例提供的三元组抽取方法的另一种实现方式的流程示意图;2 is a schematic flowchart of another implementation of the triple extraction method provided by an embodiment of the application;
图3为本申请实施例提供的一种预设模型的结构示意图;FIG. 3 is a schematic structural diagram of a preset model provided by an embodiment of this application;
图4为本申请实施例提供的一种预设模型的训练过程示意图;FIG. 4 is a schematic diagram of a training process of a preset model provided by an embodiment of this application;
图5为本申请实施例提供的一种三元组抽取装置的结构示意图;FIG. 5 is a schematic structural diagram of a triple extraction device provided by an embodiment of this application;
图6为本申请实施例提供的一种三元组抽取设备的结构示意图。FIG. 6 is a schematic structural diagram of a triple extraction device provided by an embodiment of the application.
具体实施方式Detailed ways
本申请实施例提供的三元组抽取方法可以应用于智能设备,例如电脑、平板或智能手机,或者可以应用于预设有文本处理系统的服务器。The triple extraction method provided in the embodiments of the present application can be applied to smart devices, such as computers, tablets, or smart phones, or can be applied to a server with a text processing system preset.
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although the drawings show exemplary embodiments of the present disclosure, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.
图1为对本申请实施例提供的三元组抽取方法的流程示意图,本方法具体可以包括:Fig. 1 is a schematic flow chart of a method for extracting triples according to an embodiment of the present application. The method may specifically include:
S101:获取词向量。S101: Obtain a word vector.
具体地,词向量包括组成待识别文本的各个词的词向量。其中,待识别文本为需要进行三元组抽取的非结构化文本。一般地,待识别文本可以 包括至少一个句子,每个句子由词组成,并且可能会包含标点符号。在对待识别文本进行分词处理后,可以得到组成待识别文本的所有词。本步骤获取的词向量即为组成待识别文本的各个词映射至向量空间所生成的词向量。Specifically, the word vector includes the word vector of each word constituting the text to be recognized. Among them, the text to be recognized is an unstructured text that requires triple extraction. Generally, the text to be recognized may include at least one sentence, each sentence is composed of words, and may contain punctuation marks. After word segmentation is performed on the text to be recognized, all the words that make up the text to be recognized can be obtained. The word vector obtained in this step is the word vector generated by mapping each word constituting the text to be recognized to the vector space.
其中,分词处理指的是基于预设的分词标准将一句训练文本进行拆分,并去掉标点符号。例如,待识别文本W为:“张小明的新婚妻子李小红最喜欢的小说是钱钟书先生的《围城》”,经过分词处理后,组成该待识别文本W的词包括:“张小明”“的”“新婚妻子”“李小红”“最”“喜欢”“的”“小说”“是”“钱钟书”“先生”“的”“围城”。需要说明的是,分词处理是自然语言处理领域的现有技术,例如可以使用现有的工具软件(例如哈工大LTP或jieba)对文本句子进行分词处理,本案对此过程不做赘述。Among them, word segmentation processing refers to splitting a sentence of training text based on preset word segmentation standards, and removing punctuation marks. For example, the text W to be recognized is: "Zhang Xiaoming's new wife Li Xiaohong's favorite novel is Mr. Qian Zhongshu's "Besieged City"". After word segmentation, the words composing the text W to be recognized include: "Zhang Xiaoming" "的" "Newlyweds", "Li Xiaohong", "Favorite", "Fiction", "Yes", "Qian Zhongshu", "Mr.", "Besiege". It should be noted that word segmentation is an existing technology in the field of natural language processing. For example, existing tool software (such as Harbin Institute of Technology LTP or jieba) can be used to segment text sentences. This process will not be repeated in this case.
S102:将词向量输入预设模型,得到每一个所述词向量对应的识别结果。S102: Input a word vector into a preset model, and obtain a recognition result corresponding to each word vector.
具体地,每一三元组关系即为一个规则,例如上述规则A[人物-丈夫-人物]可以作为一个三元组关系。预设模型可以学习多个三元组关系,每个三元组关系可以包括两个对象。例如上述“人物”三元组关系A[人物-丈夫-人物]中包括两个对象均为“人物”。设预先配置的需要模型学习的三元组关系的个数为n,则对象的个数为2*n个。Specifically, each triple relationship is a rule, for example, the above rule A [person-husband-person] can be used as a triple relationship. The preset model can learn multiple triad relations, and each triad relation can include two objects. For example, the above-mentioned "person" triple relationship A [person-husband-person] includes two objects that are both "persons". Assuming that the number of pre-configured triple relationships that require model learning is n, then the number of objects is 2*n.
其中,每一三元组关系可以包括第一类对象以及第二类对象,并且本方法将每一三元组关系中前一个对象记为第一类对象,将每一三元组关系中后一个对象记为第二类对象,则对象中包括n个第一类对象以及n个第二类对象。Among them, each triple relationship can include the first type of object and the second type of object, and this method records the first object in each triple relationship as the first type of object, and the last object in each triple relationship An object is recorded as an object of the second type, and the object includes n objects of the first type and n objects of the second type.
由模型输出的任一词向量对应的识别结果包括该词向量的第一类概率,该第一类概率表征生成该词向量的词属于对象的概率。由于对象的个数为2*n个,所以第一类概率包括2*n个概率,每一概率都表征生成该词向量的词属于对象的概率。The recognition result corresponding to any word vector output by the model includes the first-type probability of the word vector, and the first-type probability represents the probability that the word generating the word vector belongs to the object. Since the number of objects is 2*n, the probability of the first type includes 2*n probabilities, and each probability represents the probability that the word generating the word vector belongs to the object.
例如,模型学习的三元组关系为n个,分别为s1、s2、...、sn,词向量为m个,分别为w1、w2、...、wm。则,w1对应的识别结果可以包括:生成w1的词分别属于s1中第一类对象的概率p11、属于s1中第二类对象的概率p12,属于s2中第一类对象的概率p21,属于s2中第二类对象的概 率p22,...,属于sn中第一类对象的概率pn1、属于sn中第二类对象的概率pn2。For example, the three tuple relations learned by the model are n, s1, s2, ..., sn, and the word vector is m, w1, w2, ..., wm. Then, the recognition result corresponding to w1 may include: the probability that the word generating w1 belongs to the first type of object in s1, p11, the probability of belonging to the second type of object in s1, p12, and the probability of belonging to the first type of object in s2, p21, belong to s2 The probability p22,..., the probability of belonging to the first type of object in sn1, the probability of belonging to the second type of object in sn, pn2.
需要说明的是,上述第一类概率可以由预设模型中的2*n个sigmoid函数预测得到。It should be noted that the aforementioned probability of the first type can be predicted by the 2*n sigmoid functions in the preset model.
S103:在任意一个词满足预设条件的情况下,确定该词为目标词。S103: When any word satisfies the preset condition, determine that the word is the target word.
具体地,对于任意一个词,预设条件包括该词属于任意一个对象的概率大于第一预设阈值。其中,第一预设阈值为依据模型的输出进行规则判定时设置的阈值,一般地,可以设该阈值的大小为0.5。即,当任意一个词生成的词向量对应的第一类概率表征该词属于对象的概率大于第一预设阈值时,判定该词属于该对象。因此将该词作为一个目标词。Specifically, for any word, the preset condition includes that the probability that the word belongs to any object is greater than the first preset threshold. Wherein, the first preset threshold value is a threshold value set when the rule is determined according to the output of the model. Generally, the size of the threshold value can be set to 0.5. That is, when the first-type probability corresponding to the word vector generated by any word indicates that the probability that the word belongs to the object is greater than the first preset threshold, it is determined that the word belongs to the object. Therefore, use this word as a target word.
例如,待识别文本W为:“张小明的新婚妻子李小红最喜欢的小说是钱钟书先生的《围城》”,其中生成词“张小明”的词向量对应的第一类概率中,表征“张小明”属于三元组关系[人物-丈夫-人物]中第一类对象的概率大于0.5。生成词“李小红”的词向量对应的第一类概率中,表征“李小红”属于三元组关系[人物-丈夫-人物]中第二类对象的概率大于0.5。即,在词“张小明”和“李小红”均满足预设条件,确定其为目标词。For example, the text W to be recognized is: "Zhang Xiaoming's new wife Li Xiaohong's favorite novel is Mr. Qian Zhongshu's "The Besieged City"", in which the word vector corresponding to the generated word "Zhang Xiaoming" represents "Zhang Xiaoming" in the first type of probability The probability of belonging to the first type of object in the triple relationship [person-husband-person] is greater than 0.5. In the first type of probability corresponding to the word vector of the generated word "Li Xiaohong", the probability that "Li Xiaohong" belongs to the second type of object in the triple relationship [person-husband-person] is greater than 0.5. That is, the words "Zhang Xiaoming" and "Li Xiaohong" both meet the preset conditions, and they are determined to be the target words.
可以理解的是,本步骤确定的任意一个目标词可以属于多个对象。例如,生成词“张小明”的词向量对应的第一类概率中,表征“张小明”属于三元组关系[人物-丈夫-人物]中第一类对象的概率大于0.5,并且该第一类概率中表征“张小明”属于三元组关系[人物-妻子-人物]中第二类对象的概率也大于0.5。所以,目标词“张小明”属于两个对象。It is understandable that any target word determined in this step can belong to multiple objects. For example, in the first-type probability corresponding to the word vector of the generated word "Zhang Xiaoming", the probability that "Zhang Xiaoming" belongs to the first-type object in the triple relationship [person-husband-person] is greater than 0.5, and the first-type probability The probability that "Zhang Xiaoming" belongs to the second type of object in the triple relationship [person-wife-person] is also greater than 0.5. Therefore, the target word "Zhang Xiaoming" belongs to two objects.
S104:将属于同一个三元组中对象的目标词对,按照该三元组中对象之间的关系,组成三元组。S104: Combine the target word pairs belonging to the objects in the same triplet to form a triplet according to the relationship between the objects in the triplet.
具体地,一对目标词对中包括属于同一三元组关系中两个对象的两个目标词。任一三元组关系中的对象包括第一类对象和第二类对象,该两个对象之间的关系可以由三元组中的关系标签表示。则,该对目标词对对应的关系也为该三元组关系中的关系标签所表示的关系。Specifically, a pair of target words includes two target words belonging to two objects in the same triple relationship. The objects in any triple relationship include objects of the first type and objects of the second type, and the relationship between the two objects can be represented by the relationship label in the triple. Then, the corresponding relationship of the target word pair is also the relationship indicated by the relationship label in the triple relationship.
例如,三元组关系C[书籍-作者-人物]中,第一个对象“书籍”为第一类对象,第二个对象“人物”为第二类对象,该第一类对象和第二类对象的关系为“人物”为“书籍”的“作者”。在待识别文本W中,表征“围城”属于对象 (即三元组关系C中第一类对象“书籍”)的概率大于0.5,并且表征“钱钟书”属于对象(即三元组关系C中第二类对象“人物”)的概率大于0.5。所以“围城”和“钱钟书”为一对目标词对,该目标词对的关系由三元组关系C中的关系标签表示,即“钱钟书”为“围城”的“作者”。进一步,由“钱钟书”、“围城”按照关系“作者”组成一条三元组[“围城”-作者-“钱钟书”]。For example, in the triple relationship C [book-author-person], the first object "book" is the first type of object, and the second object "person" is the second type of object. The relationship of the class object is that the "character" is the "author" of the "book". In the to-be-recognized text W, the probability that the character "besieged city" belongs to the object (that is, the first type of object "book" in the triple relationship C) is greater than 0.5, and the probability that the character "Qian Zhongshu" belongs to the object (that is, the triple relationship C The probability of the second type of object "person") is greater than 0.5. Therefore, "Besieged City" and "Qian Zhongshu" are a pair of target words, and the relationship of this target word pair is represented by the relationship label in the triple relation C, that is, "Qian Zhongshu" is the "author" of "Beige City". Furthermore, "Qian Zhongshu" and "Besieged City" form a triad according to the relationship "author" ["Siege"-Author-"Qian Zhongshu"].
需要说明的是,在所有目标词中,针对任一对象,可能存在多个目标词属于该对象,所以可能出现以下情况:存在多个目标词属于一个三元组关系中的第一类对象,同时存在多个目标词属于该三元组关系中的第二类对象。在此种情况下,可以按照目标词在待识别文本中的位置选择目标词对,一般地,将位置最接近的两个属于同一三元组关系中对象的目标词作为一对目标词对。It should be noted that in all target words, for any object, there may be multiple target words belonging to the object, so the following situation may occur: there are multiple target words belonging to the first type of object in a triple relationship, At the same time, there are multiple target words belonging to the second type of object in the triple relationship. In this case, the target word pair can be selected according to the position of the target word in the text to be recognized. Generally, the two closest target words belonging to the object in the same triple relationship are regarded as a pair of target word pairs .
借由上述技术方案,本发明提供的三元组抽取方法中,将待识别文本中的各个词的词向量输入预设模型,得到每一个词向量对应的识别结果。每一词向量对应的识别结果表征,该词向量所对应的词属于三元组中所包含对象的概率,在任意一个词属于任意一个对象的概率大于第一预设阈值的情况下,确定该词为目标词,将目标词中的任意一对目标词对,按照该对目标词对应的关系,组成三元组。因为任意一对目标词对包括两个属于同一三元组关系包含的对象的目标词,目标词对的关系为,该对目标词对所属的对象在三元组关系中的关系。可见本申请提供的技术方案,模型可以输出词属于模型学习的三元组关系中包括的所有对象的概率,进一步依据模型的输出进行规则判定即可得到待识别文本中的所有三元组,即无需多步使用模型。并且,模型的输出作为规则判定的基础,因此,避免了模型误差的累计,能够提高结果的准确性。With the above technical solution, in the triple extraction method provided by the present invention, the word vector of each word in the text to be recognized is input into a preset model, and the recognition result corresponding to each word vector is obtained. The characterization of the recognition result corresponding to each word vector, the probability that the word corresponding to the word vector belongs to the object contained in the triplet, when the probability of any word belonging to any object is greater than the first preset threshold, the The word is the target word, and any pair of target word pairs in the target word are formed into triples according to the corresponding relationship of the pair of target words. Because any pair of target word pairs includes two target words that belong to the objects included in the same triple relationship, the relationship of the target word pair is the relationship of the object to which the pair of target word pairs belong in the triple relationship. It can be seen that in the technical solution provided by this application, the model can output the probability that a word belongs to all objects included in the triple relationship learned by the model, and further rule judgments based on the output of the model can obtain all triples in the text to be recognized, namely There is no need to use the model in multiple steps. In addition, the output of the model serves as the basis for the rule determination. Therefore, the accumulation of model errors is avoided and the accuracy of the results can be improved.
进一步,由于模型学习的三元组关系可以根据需要进行配置为多组,所以任一目标词可以属于多个对象,即,任一目标词可以属于多个三元组关系中的对象,由此无需多个模型即可抽取任一生成词向量的词所属的多个三元组。可见本申请公开的三元组抽取方法大大提高了三元组抽取效率。Further, since the triple relationship learned by the model can be configured into multiple groups as needed, any target word can belong to multiple objects, that is, any target word can belong to objects in multiple triple relationships, thus Without multiple models, multiple triples to which any word generating word vector belongs can be extracted. It can be seen that the triple extraction method disclosed in this application greatly improves the efficiency of triple extraction.
例如,发明人在实现本发明创造的过程中发现,为了提高三元组抽取的准确度,可以基于LSTM-CRF(Long Short-Term Memory-Conditional Random Field,长短期记忆网络-条件随机场算法)模型预先建立三元组抽 取系统,以待识别文本为输入,通过模型直接抽取文本数据中的三元组,实现端到端的三元组抽取。从而避免了多步流程化方法中的错误累加,提高了抽取结果的准确率。For example, the inventor found in the process of implementing the invention that in order to improve the accuracy of triple extraction, it can be based on LSTM-CRF (Long Short-Term Memory-Conditional Random Field, Long Short-Term Memory Network-Conditional Random Field Algorithm) The model establishes a triple extraction system in advance, takes the text to be recognized as input, and directly extracts triples in the text data through the model to achieve end-to-end triple extraction. Therefore, the accumulation of errors in the multi-step process method is avoided, and the accuracy of the extraction result is improved.
但是,由于LSTM-CRF模型中的解码器CRF只能进行单一的解码,所以在抽取三元组的过程中,每个词只能属于一个三元组关系,即通过LSTM-CRF模型抽取三元组,任一词只能属于一个三元组。当待识别文本中包含的词属于两个及以上三元组时,必须同时建立多个LSTM-CRF模型分别针对不同的三元组关系进行三元组抽取,因此导致三元组的抽取效率低。However, since the decoder CRF in the LSTM-CRF model can only perform a single decoding, in the process of extracting triples, each word can only belong to one triple relationship, that is, extracting triples through the LSTM-CRF model Group, any word can only belong to one triple group. When the words contained in the text to be recognized belong to two or more triples, multiple LSTM-CRF models must be established at the same time to extract triples for different triple relationships, which results in low efficiency of triple extraction .
如上所述的待识别文本W中,针对三元组关系A:[人物-丈夫-人物],可以从该待识别文本W中抽取三元组为:[张小明-丈夫-李小红]。针对三元组关系B:[人物-妻子-人物],可以从该待识别文本W中抽取三元组为:[李小红-妻子-张小明]。可见“张小明”和“李小红”两个人名同时属于不同的三元组关系。因此,基于LSTM-CRF模型预先建立三元组抽取系统时,需要训练两个LSTM-CRF模型,分别针对三元组关系A和三元组关系B进行三元组的抽取。由此导致执行效率较低。In the text W to be recognized as described above, for the triple relationship A: [person-husband-person], the triples can be extracted from the text W to be recognized as: [Zhang Xiaoming-husband-Li Xiaohong]. For the triple relationship B: [person-wife-person], the triple can be extracted from the text W to be recognized as: [Li Xiaohong-wife-Zhang Xiaoming]. It can be seen that the names of "Zhang Xiaoming" and "Li Xiaohong" belong to different triple relationships at the same time. Therefore, when a triple extraction system is established in advance based on the LSTM-CRF model, two LSTM-CRF models need to be trained to extract triples for triple relationship A and triple relationship B respectively. As a result, the execution efficiency is low.
针对上述技术问题,本申请提供张小明-丈夫-李小红的三元组抽取方法中的模型在学习过程中可以根据需要配置为多个三元组关系,并且任一个目标词可能同时属于多个对象。例如,待识别文本W中,生成词“张小明”的词向量对应的第一类概率中,表征“张小明”属于三元组关系A[人物-丈夫-人物]中第一类对象的概率大于0.5。同时,生成词“张小明”的词向量对应的第一类概率中,表征“张小明”属于三元组关系B[人物-妻子-人物]中第二类对象的概率也大于0.5。由此“张小明”可以同时属于由三元组关系A和三元组关系B确定的两个三元组[张小明-丈夫-李小红]以及[李小红-妻子-张小明]。In response to the above technical problems, this application provides the model in the Zhang Xiaoming-husband-Li Xiaohong triple extraction method that can be configured as multiple triple relationships as needed during the learning process, and any target word may belong to multiple objects at the same time. For example, in the to-be-recognized text W, in the first-type probability corresponding to the word vector of the generated word "Zhang Xiaoming", the probability that "Zhang Xiaoming" belongs to the first-type object in the triple relationship A [person-husband-person] is greater than 0.5 . At the same time, in the first-type probability corresponding to the word vector of the generated word "Zhang Xiaoming", the probability that "Zhang Xiaoming" belongs to the second-type object in the triple relationship B [person-wife-person] is also greater than 0.5. Therefore, "Zhang Xiaoming" can belong to the two triples [Zhang Xiaoming-husband-Li Xiaohong] and [Li Xiaohong-wife-Zhang Xiaoming] determined by the triple relationship A and the triple relationship B at the same time.
显然,针对上述具有交替、重叠关系的三元组抽取时,本申请提供的技术方案具有准确度高且效率高的优点。Obviously, the technical solution provided in this application has the advantages of high accuracy and high efficiency for the extraction of the above-mentioned triples with alternating and overlapping relationships.
需要说明的是,上述S101获取的词向量可以包括词义向量和词性向量。其中,一个词的词义向量为其词义映射至向量空间得到的向量,该向 量可以表征该词的词性信息。一个词的词性向量为其词性映射至向量空间得到的向量,该向量可以表征该词的词性信息。It should be noted that the word vector obtained in S101 may include a word meaning vector and a part-of-speech vector. Among them, the word meaning vector of a word is the vector obtained by mapping the word meaning to the vector space, and this vector can represent the part-of-speech information of the word. The part-of-speech vector of a word is the vector obtained by mapping the part-of-speech to the vector space, and the vector can represent the part-of-speech information of the word.
其中,可以通过查找词向量映射集合(即Word vector)获取每一词的词义向量。Word vector为使用待识别文本所属领域的语料库进行词向量训练生成的词义向量映射对应关系集合。Word vector可以将词映射到一个低维的向量空间中,可以通过词义向量之间的关系表达各个词的词义之间的相似关系。对于语料库中的低频词,可以记为UNK,UNK在word vector中具有唯一的向量表达,其维度与其它词对应的词义向量维度一致。Among them, the word meaning vector of each word can be obtained by searching the word vector mapping set (ie Word vector). Word vector is a set of word sense vector mapping correspondences generated by word vector training using the corpus of the field to which the text to be recognized belongs. Word vector can map words to a low-dimensional vector space, and can express the similar relationship between the word meanings of various words through the relationship between the word meaning vectors. The low-frequency words in the corpus can be marked as UNK. UNK has a unique vector expression in the word vector, and its dimension is consistent with the dimension of the word meaning vector corresponding to other words.
例如,待识别文本E分词处理后得到组成文本的k个词分别为e1,e2...,ek,通过Word vector可以获得所有词生成的词义向量,分别为h1,h2,...,hk。For example, after the word segmentation of the text E to be recognized, the k words that make up the text are e1, e2..., ek, and the word meaning vectors generated by all words can be obtained through Word vector, which are h1, h2,..., hk. .
需要说明的是,当训练文本的组成词为低频词时,根据Word vector可以获得其生成的词义向量为UNK。It should be noted that when the constituent words of the training text are low-frequency words, the generated word meaning vector can be obtained as UNK according to the Word vector.
显然,同一词在不同文本中表达的词义不同,其词性也可能不同,因此词性也可以影响该词的三元组关系识别结果。所以可以获取待识别文本中各个词的词性信息。词性信息为每个词在待识别文本中表达的词义所属的词性。获取方法可以采用一定维度的随机向量表达,例如对于共计30种词性[A1,A2,…,A30],可以用词性向量a1表示A1,词性向量a2表示A2,...,词性向量a30表示A30。其中词性向量a1、a2、...、a30的维度可以预先设置,其中的每一个维度都是一个随机生成的接近于0的小数。Obviously, the meaning of the same word expressed in different texts is different, and its part of speech may also be different, so the part of speech can also affect the recognition result of the triple relationship of the word. Therefore, the part-of-speech information of each word in the text to be recognized can be obtained. The part-of-speech information is the part-of-speech to which the meaning of each word in the text to be recognized belongs. The acquisition method can be expressed by a random vector of a certain dimension. For example, for a total of 30 parts of speech [A1,A2,...,A30], the part of speech vector a1 can be used to represent A1, the part of speech vector a2 can represent A2,..., and the part of speech vector a30 can represent A30. . The dimensions of the part-of-speech vectors a1, a2,..., a30 can be preset, and each of the dimensions is a randomly generated decimal close to 0.
进一步,针对任一词,将其生成的词义向量以及词性向量输入至预设模型,执行步骤102。Further, for any word, the word meaning vector and the part-of-speech vector generated therefrom are input to the preset model, and step 102 is executed.
可选地,上述S101获取的词向量还可以只包括词义向量,即,将待识别文本中各个词的词义向量直接作为词向量输入至预设模型,执行步骤102。Optionally, the word vector obtained in S101 may also only include the word meaning vector, that is, the word meaning vector of each word in the text to be recognized is directly input as the word vector to the preset model, and step 102 is executed.
图2为本申请实施例提供的三元组抽取方法的另一种实现方式的流程示意图。如下:FIG. 2 is a schematic flowchart of another implementation manner of the triple extraction method provided by an embodiment of the application. as follows:
S201:获取词向量。S201: Obtain a word vector.
其中,词向量包括组成待识别文本的各个词生成的词向量,如上述实 施例所述,该词向量包括词义向量和词性向量。获取方法可以参考上述个实施例,在此不做赘述。Wherein, the word vector includes a word vector generated by each word constituting the text to be recognized. As described in the above embodiment, the word vector includes a word meaning vector and a part-of-speech vector. For the acquisition method, refer to the above-mentioned embodiment, which will not be repeated here.
S202:将词向量输入预设模型,得到每一个词向量对应的识别结果。其中,识别结果包括该词向量的第一类概率,还包括该词向量的第二类概率。S202: Input the word vector into the preset model, and obtain the recognition result corresponding to each word vector. Wherein, the recognition result includes the first-type probability of the word vector, and also includes the second-type probability of the word vector.
具体地,图3示出了预设模型的结构示意图,预设模型包括编码器和解码器。Specifically, FIG. 3 shows a schematic structural diagram of a preset model, and the preset model includes an encoder and a decoder.
如图3所示,编码器可以为双向LSTM模型。针对待识别文本E由e1、e2、...、ek组成,其中任一词ej,将其生成的词义向量hj和词性向量aj输入至编码器,经过编码器将词义向量和词性向量编码得到特征向量Ej。该特征向量由该词的词义向量和词性向量拼接得到。As shown in Figure 3, the encoder can be a two-way LSTM model. For the text E to be recognized, it is composed of e1, e2,..., ek. For any word ej, input the generated word sense vector hj and part of speech vector aj to the encoder, and then encode the word sense vector and part of speech vector through the encoder. The feature vector Ej. The feature vector is obtained by concatenating the word meaning vector and the part-of-speech vector of the word.
例如,词义向量的维度为100,词性向量的维度为20。经过编码器编码得到的特征向量的维度为130。需要说明的是,若待识别文本中词个数为m,将m个词向量输入至编码器可以得到m个120维的特征向量,将这些向量排列组成一个m*120的向量矩阵,可以根据需要通过补0将该矩阵扩充至特定长度。For example, the dimension of the word meaning vector is 100, and the dimension of the part-of-speech vector is 20. The dimension of the feature vector obtained by the encoder is 130. It should be noted that if the number of words in the text to be recognized is m, input m word vectors to the encoder to obtain m 120-dimensional feature vectors, and arrange these vectors to form an m*120 vector matrix. The matrix needs to be extended to a specific length by adding zeros.
由上述各实施例可知,预设模型可以学习n个三元组关系,每个三元组关系可以包括两个对象。由于每一三元组关系中前一个对象为第一类对象,每一三元组关系中后一个对象为第二类对象,所以第一类对象个数为n,第二类对象个数为n。任一词向量的第二类概率表征生成该词向量的词属于第一类对象的概率和属于第二类对象的概率,即任一词的第二类概率包括两个概率值。It can be seen from the foregoing embodiments that the preset model can learn n triple relationships, and each triple relationship can include two objects. Since the first object in each triple relationship is the first type of object, and the last object in each triple relationship is the second type of object, the number of objects of the first type is n, and the number of objects of the second type is n. The second-type probability of any word vector represents the probability that the word generating the word vector belongs to the first-type object and the probability that it belongs to the second-type object, that is, the second-type probability of any word includes two probability values.
模型中的解码器包括第一解码模块和第二解码模块。The decoder in the model includes a first decoding module and a second decoding module.
其中,第一解码模块包括2*n个sigmoid函数,用于依据特征向量,确定第一类概率。第一类概率可以参照步骤S102中所述。由图3所示,以特征向量Ej为例,确定第一类概率的过程为:Among them, the first decoding module includes 2*n sigmoid functions, which are used to determine the probability of the first type according to the feature vector. The probability of the first type may refer to the description in step S102. As shown in Figure 3, taking the feature vector Ej as an example, the process of determining the probability of the first type is:
依据特征向量Ej通过第2i-1(i=1、2、...n)个sigmoid函数(即sig2i-1)输出该特征向量对应的词属于第i个三元组关系中的第一类对象的概率。依据特征向量Ej通过第2i个sigmoid函数(即sig2i)输出该特征向量对应的词属于第i个三元组关系中的第二类对象的概率。According to the feature vector Ej through the 2i-1 (i=1, 2,...n) sigmoid function (i.e. sig2i-1), the word corresponding to the feature vector belongs to the first category in the i-th triple relationship. The probability of the object. According to the feature vector Ej, the probability that the word corresponding to the feature vector belongs to the second type of object in the i-th triple relationship is output through the 2i-th sigmoid function (ie, sig2i).
第二解码模块包括2个sigmoid函数,用于依据特征向量,确定第二类概率。由图3所示,以特征向量Ej为例,确定第二类概率的过程为:The second decoding module includes two sigmoid functions, which are used to determine the second type of probability according to the feature vector. As shown in Figure 3, taking the feature vector Ej as an example, the process of determining the second type of probability is:
依据特征向量Ej通过第1个sigmoid函数(即sig1)输出该特征向量对应的词属于任一三元组关系中的第一类对象的概率。依据特征向量Ej通过第2个sigmoid函数(即sig2)输出该特征向量对应的词属于任一三元组关系中第二类对象的概率。According to the feature vector Ej, the first sigmoid function (ie, sig1) is used to output the probability that the word corresponding to the feature vector belongs to the first type of object in any triple relationship. According to the feature vector Ej, the second sigmoid function (ie, sig2) is used to output the probability that the word corresponding to the feature vector belongs to the second type of object in any triple relationship.
S203:在任意一个词满足预设条件的情况下,确定该词为目标词。S203: When any word satisfies the preset condition, determine that the word is the target word.
其中,对于任一词,预设条件包括该词属于第一类对象的概率或属于第二类对象的概率大于第二预设阈值。其中,第二预设阈值为依据模型的输出进行规则判定时设置的阈值,一般地,可以设该阈值的大小为0.5。即,当任意一个词生成的词向量对应的第一类概率表征该词属于对象的概率大于第一预设阈值,或当任意一个词生成的词向量对应的第二类概率表征该词属于第一类对象或第二类对象的概率大于第二预设阈值的情况下,确定该词为目标词。For any word, the preset condition includes that the probability of the word belonging to the first type of object or the probability of belonging to the second type of object is greater than the second preset threshold. Wherein, the second preset threshold value is a threshold value set when the rule is determined according to the output of the model, and generally, the size of the threshold value can be set to 0.5. That is, when the first-type probability corresponding to the word vector generated by any word indicates that the probability that the word belongs to the object is greater than the first preset threshold, or when the second-type probability corresponding to the word vector generated by any one word indicates that the word belongs to the first When the probability of the object of the first type or the object of the second type is greater than the second preset threshold, the word is determined to be the target word.
由于,第一类概率和第二类概率可以互相验证,则举例对本步骤确定目标词的具体方法进行说明,如下:Since the probability of the first type and the probability of the second type can be mutually verified, an example is given to illustrate the specific method of determining the target word in this step, as follows:
例如,模型学习的三元组关系为n个,分别为s1、s2、...、sn,词向量为m个,分别为w1、w2、...、wm。则,w1对应的识别结果可以包括第一类概率,分别为:生成w1的词属于s1中第一类对象的概率p11、属于s1中第二类对象的概率p12,属于s2中第一类对象的概率p21,属于s2中第二类对象的概率p22,...,属于sn中第一类对象的概率pn1、属于sn中第二类对象的概率pn2。w1对应的识别结果还包括第二类概率,分别为:生成w1的词属于任一第一类对象的概率p1和生成w1的词属于任一第二类对象的概率p2。For example, the three tuple relations learned by the model are n, s1, s2, ..., sn, and the word vector is m, w1, w2, ..., wm. Then, the recognition result corresponding to w1 may include the first-type probabilities, which are: the probability p11 that the word generating w1 belongs to the first-type object in s1, the probability p12 that belongs to the second-type object in s1, and the first-type object in s2 The probability p21, the probability p22 of belonging to the second type of object in s2,..., the probability of belonging to the first type of object in sn1, the probability of belonging to the second type of object in sn, pn2. The recognition result corresponding to w1 also includes the second-type probabilities, respectively: the probability p1 that the word generating w1 belongs to any first-type object and the probability p2 that the word generating w1 belongs to any second-type object.
其中,若p1大于0.5且p11大于0.5的情况下,确定生成w1的词属于s1中第一类对象,确定该词为目标词。若p1大于0.5且p11小于等于0.5的情况下,确定生成w1的词不属于s1中第一类对象。若p1小于等于0.5且p11小于等于0.5的情况下,确定生成w1的词不属于s1中第一类对象。Among them, if p1 is greater than 0.5 and p11 is greater than 0.5, it is determined that the word generating w1 belongs to the first type of object in s1, and the word is determined as the target word. If p1 is greater than 0.5 and p11 is less than or equal to 0.5, it is determined that the word generating w1 does not belong to the first type of object in s1. If p1 is less than or equal to 0.5 and p11 is less than or equal to 0.5, it is determined that the word generating w1 does not belong to the first type of object in s1.
可以理解的是,当p1大于0.5时,若任一p1r大于0.5,则确定生成w1的词属于sr中第一类对象,确定该词为目标词。当p2大于0.5时,若任一p2r 大于0.5,则确定生成w1的词属于sr中第二类对象,确定该词为目标词。It is understandable that when p1 is greater than 0.5, if any p1r is greater than 0.5, it is determined that the word generating w1 belongs to the first type of object in sr, and the word is determined to be the target word. When p2 is greater than 0.5, if any p2r is greater than 0.5, it is determined that the word generating w1 belongs to the second type of object in sr, and the word is determined as the target word.
S204:将属于同一个三元组中对象的目标词对,按照该三元组中对象之间的关系,组成三元组。S204: Combine the target word pairs belonging to the objects in the same triplet to form a triplet according to the relationship between the objects in the triplet.
本步骤可以参考上述S204,本实施例在此不做赘述。For this step, refer to the above S204, which is not described in detail in this embodiment.
针对上述实施例所介绍的三元组抽取方法的实施方式,进一步对预设模型的训练过程进行介绍,图4示出了一种预设模型的训练过程示意图,包括:For the implementation of the triple extraction method introduced in the above embodiment, the training process of the preset model is further introduced. Figure 4 shows a schematic diagram of the training process of a preset model, including:
S401:将样本词向量输入待训练的预设模型,得到预设模型输出的识别结果。S401: Input the sample word vector into the preset model to be trained, and obtain the recognition result output by the preset model.
其中,样本词向量包括样本文本中的词的词向量。样本文本可以包括多条具有三元组关系的非结构化文本片段,样本文本片段的数量可以是几千至十几万。需要说明的是,为了更好的训练模型,可以根据待识别文本的所属领域获取样本文本,例如,待识别文本的所属领域为金融类,则可以获取属于金融类领域的文本为样本文本。Among them, the sample word vector includes the word vector of the word in the sample text. The sample text may include multiple unstructured text fragments having a triple relationship, and the number of sample text fragments may be several thousand to several hundred thousand. It should be noted that, in order to better train the model, sample texts can be obtained according to the field of the text to be recognized. For example, if the field of the text to be recognized is financial, the text belonging to the financial field can be obtained as sample text.
需要说明的是,针对任一样本文本中的词,获取该词的样本词向量的方法可以参考上述获取待识别文本中词的词向量的方法。预设模型输出该样本词向量的识别结果包括该样本词向量的第一类概率和第二类概率。具体地,第一类概率和第二类概率的含义可以参考上述S202的介绍。本申请实施例对此不做赘述。It should be noted that, for any word in the sample text, the method for obtaining the sample word vector of the word can refer to the above method for obtaining the word vector of the word in the text to be recognized. The recognition result of the sample word vector output by the preset model includes the first type probability and the second type probability of the sample word vector. Specifically, for the meaning of the first type of probability and the second type of probability, reference may be made to the introduction of S202 above. This is not repeated in the embodiment of the application.
S402:使用标记信息中的样本第一类概率与第一类概率的差异,以及标记信息中的样本第二类概率与第二类概率的差异,得到预设模型的参数。S402: Use the difference between the first-type probability and the first-type probability of the sample in the marking information and the difference between the second-type probability and the second-type probability of the sample in the marking information to obtain the parameters of the preset model.
具体地,针对任一段样本文本,由人工标记该段样本文本中包括的所有三元组。例如样本文本F为:“张小明于2018年迎娶了李小红”,针对该样本文件,人工对其标记三元组[张小明-丈夫-李小红]和[李小红-妻子-张小明]。Specifically, for any piece of sample text, all the triples included in the piece of sample text are manually marked. For example, the sample text F is: "Zhang Xiaoming married Li Xiaohong in 2018". For this sample file, the triples [Zhang Xiaoming-husband-Li Xiaohong] and [李小红-wife-Zhang Xiaoming] are manually marked.
一个词的样本第一概率表征该词属于三元组包含的对象的概率,该词的样本第二概率表征该词分别属于第一类对象和第二类对象的概率。The first probability of a word sample represents the probability that the word belongs to the object contained in the triple, and the second probability of the word sample represents the probability that the word belongs to the first type of object and the second type of object respectively.
其中,对象包括所有预先配置需要模型学习的n个三元组关系中包含的对象。该n个三元组关系可以包括待识别文本中所有可能包含的三元组关系。可以理解的是,待识别文本所属领域不同,则其中可能包括的三元 组关系会有所不同,所以可以根据待识别文本所属领域预先配置n个三元组关系。Among them, the objects include all pre-configured objects contained in n triples that require model learning. The n triple relationships may include all possible triple relationships in the text to be recognized. It is understandable that if the field of the text to be recognized is different, the possible triad relationships will be different. Therefore, n triad relationships can be pre-configured according to the field of the text to be recognized.
标记信息指的是根据人工标记的三元组得到的样本文本中各个词的样本第一类概率和样本第二类概率。获取该样本第一类概率和样本第二类概率的方法为现有技术,本申请实施例仅以下述的实例对此进行简单介绍。The labeling information refers to the sample first-type probability and sample second-type probability of each word in the sample text obtained according to the manually labeled triples. The method for obtaining the probability of the first type of the sample and the probability of the second type of the sample is the prior art, and the embodiment of the present application only briefly introduces this with the following examples.
以上述样本文本F为例,预设的该模型需要学习的三元组关系为A:[人物-丈夫-人物]、B:[人物-妻子-人物]和C:[书籍-作者-人物]。根据标记的三元组[张小明-丈夫-李小红]和[李小红-妻子-张小明]。Taking the above sample text F as an example, the preset triple relationship that the model needs to learn is A: [person-husband-person], B: [person-wife-person] and C: [book-author-person] . According to the labeled triples [Zhang Xiaoming-husband-Li Xiaohong] and [Li Xiaohong-wife-Zhang Xiaoming].
可以通过样本文本F中的每个词是否属于标记的三元组中包含的对象来确定该词的样本第一类概率。以“张小明”为例,该词属于A中的第一类对象,属于B中的第二类对象。所以该词对应的样本第一类概率中表征属于A中的第一类对象的概率为1,表征属于A中的第二类对象的概率为0,表征属于B中的第一类对象的概率为0,表征属于B中的第二类对象的概率为1,表征属于C中的第一类对象的概率为0,表征属于C中的第二类对象的概率为0。再例如,词“迎娶”不属于上述标记的三元组中任一对象,所以该词的样本第一类概率中的各个概率均为0。The sample first-type probability of the word can be determined by whether each word in the sample text F belongs to the object contained in the marked triplet. Take "Zhang Xiaoming" as an example. The word belongs to the first type of object in A and the second type of object in B. Therefore, in the first-type probability of the word corresponding to the sample, the probability that it belongs to the first-type object in A is 1, the probability that it belongs to the second-type object in A is 0, and the probability that it belongs to the first-type object in B If it is 0, the probability that it belongs to the second type of object in B is 1, the probability that it is the first type of object in C is 0, and the probability that it is the second type of object in C is 0. For another example, the word "marry" does not belong to any object in the above-mentioned marked triplet, so each probability in the first-type probability of the sample of the word is 0.
另外,可以通过样本文本F中的每个词是否属于标记的三元组中的任一个第一类对象以及是否属于标记的三元组中的任一个第二类对象来确定该词的样本第二类概率。以“张小明”为例,该词属于A中的第一类对象,所以,该词属于任一第一类对象的概率为1。又因为属于B中的第二类对象,所以,该词属于任一第二类对象的概率为1。In addition, each word in the sample text F can be determined by whether each word in the sample text F belongs to any first-type object in the marked triplet and whether it belongs to any second-type object in the marked triplet. Class two probability. Taking "Zhang Xiaoming" as an example, the word belongs to the first type of object in A, so the probability that the word belongs to any first type of object is 1. And because it belongs to the second type of object in B, the probability that the word belongs to any second type of object is 1.
需要说明的是,上述确定样本第二类概率的方法为若该属于任一个标记的三元组中的第一类对象,则确定该词属于任一三元组关系中的第一类对象;若该属于任一个标记的三元组中的第二类对象,则确定该词属于任一三元组关系中的第二类对象。It should be noted that the above method for determining the second-type probability of a sample is to determine that the word belongs to the first-type object in any three-tuple relationship if the word belongs to the first-type object in any labeled triplet; If the word belongs to the second type of object in any labeled triplet, it is determined that the word belongs to the second type of object in any triplet relationship.
使用标记信息中的样本第一概率与所述第一类概率的差异,以及样本第二概率与所述第二类概率的差异,迭代确定预设模型的参数,经过多次参数的更新后,得到训练好的模型。Using the difference between the first probability of the sample and the first type of probability in the label information, and the difference between the second probability of the sample and the second type of probability, iteratively determine the parameters of the preset model, and after multiple parameter updates, Get a trained model.
由上所述,本申请实施例中的预设模型可以输出每一词的第一类概率和第二类概率。其中第一类概率表征该词属于三元组包含的对象的概率, 第二类概率表征该词属于任一第一类对象的概率和该词属于任一第二类对象的概率。所以两类概率可以互相验证确定目标词,进一步提高了三元组抽取方法的准确度。From the above, the preset model in the embodiment of the present application can output the first-type probability and the second-type probability of each word. The first type of probability represents the probability that the word belongs to an object included in the triple, and the second type of probability represents the probability that the word belongs to any first type of object and the probability that the word belongs to any second type of object. Therefore, the two types of probabilities can verify each other to determine the target word, which further improves the accuracy of the triple extraction method.
本申请实施例还提供了一种三元组抽取装置,下面对本申请实施例提供的三元组抽取装置进行描述,下文描述的三元组抽取装置与上文描述的三元组抽取方法可相互对应参照。The embodiment of the application also provides a triple extraction device. The triple extraction device provided by the embodiment of the application will be described below. The triple extraction device described below and the triple extraction method described above are mutually compatible with each other. Corresponding reference.
请参阅图5,示出了本申请实施例提供的一种三元组抽取装置的结构示意图,如图5所示,该装置可以包括:Please refer to FIG. 5, which shows a schematic structural diagram of a triple extraction device provided by an embodiment of the present application. As shown in FIG. 5, the device may include:
词向量获取单元501,用于获取组成待识别文本的各个词的词向量;The word vector obtaining unit 501 is configured to obtain the word vector of each word composing the text to be recognized;
模型预测单元502,用于将所述词向量输入预设模型,得到每一个所述词向量对应的识别结果;所述识别结果包括所述词向量的第一类概率,任意一个词向量的第一类概率表征,该词向量所对应的词属于三元组中所包含对象的概率;The model prediction unit 502 is configured to input the word vector into a preset model to obtain the recognition result corresponding to each word vector; the recognition result includes the first type probability of the word vector, and the first probability of any word vector A type of probability representation, the probability that the word corresponding to the word vector belongs to the object contained in the triple;
目标词确定单元503,用于在任意一个词满足预设条件的情况下,确定该词为目标词,所述预设条件包括该词属于三元组中任意一个对象的概率大于第一预设阈值;The target word determining unit 503 is configured to determine that any word meets a preset condition as a target word, and the preset condition includes that the probability of the word belonging to any object in the triple is greater than the first preset Threshold
三元组确定单元504,用于将属于同一个三元组中对象的目标词对,按照该三元组中对象之间的关系,组成三元组。The triple determination unit 504 is configured to group target word pairs belonging to objects in the same triple into a triple according to the relationship between the objects in the triple.
可选地,词向量包括以下至少之一:词义向量和词性向量。Optionally, the word vector includes at least one of the following: a word meaning vector and a part-of-speech vector.
可选地,三元组关系中所包含的对象包括第一类对象和第二类对象,并且,所述识别结果还包括:Optionally, the objects included in the triple relationship include objects of the first type and objects of the second type, and the recognition result further includes:
所述词向量的第二类概率,任一词向量的第二类概率表征,该词向量所对应的词属于第一类对象的概率和属于第二类对象的概率。The second-type probability of the word vector represents the second-type probability of any word vector, the probability that the word corresponding to the word vector belongs to the first-type object and the probability that it belongs to the second-type object.
可选地,预设条件还包括:Optionally, the preset conditions also include:
该词属于第一类对象的概率或属于第二类对象的概率大于第二预设阈值。The probability that the word belongs to the first type of object or the second type of object is greater than the second preset threshold.
可选地,预设模型包括编码器和解码器;Optionally, the preset model includes an encoder and a decoder;
所述编码器用于由所述词向量得到特征向量;The encoder is used to obtain a feature vector from the word vector;
所述解码器包括第一解码模块和第二解码模块;The decoder includes a first decoding module and a second decoding module;
其中,所述第一解码模块用于依据所述特征向量,确定所述第一类概率,所述第二解码模块用于所述依据所述特征向量,确定所述第二类概率。The first decoding module is configured to determine the first type probability according to the feature vector, and the second decoding module is configured to determine the second type probability according to the feature vector.
可选地,第一解码模块包括2*n个sigmoid函数,其中,n为所述预设模型学习的三元组关系的数量,所述第二解码模块包括2个sigmoid函数。Optionally, the first decoding module includes 2*n sigmoid functions, where n is the number of triple relationships learned by the preset model, and the second decoding module includes 2 sigmoid functions.
可选地,本装置还包括:预设模型训练模块,用于训练预设模型,具体可以用于:Optionally, the device further includes: a preset model training module, which is used to train a preset model, which can be specifically used for:
将样本词向量输入待训练的预设模型,得到输出的识别结果,所述识别结果包括所述样本词向量的第一类概率和第二类概率;所述样本词向量包括样本文本中的词的词向量;Input the sample word vector into the preset model to be trained to obtain the output recognition result, the recognition result includes the first type probability and the second type probability of the sample word vector; the sample word vector includes the words in the sample text Word vector
使用标记信息中的样本第一类概率与所述第一类概率的差异,以及所述标记信息中的样本第二类概率与所述第二类概率的差异,得到所述预设模型的参数,所述样本第一类概率和样本第二类概率依据所述样本文本中标记的三元组确定。Use the difference between the first type probability of the sample in the label information and the first type probability, and the difference between the second type probability of the sample in the label information and the second type probability to obtain the parameters of the preset model The probability of the first type of the sample and the probability of the second type of the sample are determined according to the triplet marked in the sample text.
所述三元组抽取装置包括处理器和存储器,词向量获取单元501、模型预测单元502、目标词确定单元503和三元组确定单元504等均作为程序单元存储在存储器中,由处理器执行存储在存储器中的上述程序单元来实现相应的功能。The triple extraction device includes a processor and a memory. The word vector acquisition unit 501, the model prediction unit 502, the target word determination unit 503, and the triple determination unit 504 are all stored in the memory as program units and executed by the processor. The above-mentioned program units stored in the memory implement the corresponding functions.
处理器中包含内核,由内核去存储器中调取相应的程序单元。内核可以设置一个或以上,通过调整内核参数来提高三元组抽取的准确率。The processor contains the kernel, and the kernel calls the corresponding program unit from the memory. One or more kernels can be set, and the accuracy of triple extraction can be improved by adjusting kernel parameters.
本发明实施例提供了一种存储介质,其上存储有程序,该程序被处理器执行时实现所述三元组抽取方法。The embodiment of the present invention provides a storage medium on which a program is stored, and when the program is executed by a processor, the triplet extraction method is implemented.
本发明实施例提供了一种处理器,所述处理器用于运行程序,其中,所述程序运行时执行所述三元组抽取方法。The embodiment of the present invention provides a processor, the processor is used to run a program, wherein the triple extraction method is executed when the program is running.
本申请实施例还提供了一种三元组抽取设备,请参阅图6,示出了该三元组抽取设备(60)的结构示意图,设备包括至少一个处理器601、以及与处理器连接的至少一个存储器602、总线603;其中,处理器601、存储器602通过总线603完成相互间的通信;处理器用于调用存储器中的程序指令,以执行上述的三元组抽取方法。本文中的设备可以是服务器、PC、PAD、手机等。The embodiment of the present application also provides a triplet extraction device. Please refer to FIG. 6, which shows a schematic structural diagram of the triplet extraction device (60). The device includes at least one processor 601 and a device connected to the processor. At least one memory 602 and a bus 603; wherein the processor 601 and the memory 602 communicate with each other through the bus 603; the processor is used to call program instructions in the memory to execute the above-mentioned triple extraction method. The equipment in this article can be a server, PC, PAD, mobile phone, etc.
本申请还提供了一种计算机程序产品,当在数据处理设备上执行时, 适于执行初始化有如下方法步骤的程序:This application also provides a computer program product, which when executed on a data processing device, is suitable for executing a program that initializes the following method steps:
获取组成待识别文本的各个词的词向量;Obtain the word vector of each word composing the text to be recognized;
将所述词向量输入预设模型,得到每一个所述词向量对应的识别结果;所述识别结果包括所述词向量的第一类概率,任意一个词向量的第一类概率表征,该词向量所对应的词属于三元组关系中所包含对象的概率;The word vector is input into the preset model to obtain the recognition result corresponding to each word vector; the recognition result includes the first-type probability of the word vector, the first-type probability representation of any word vector, and the word The probability that the word corresponding to the vector belongs to the object contained in the triple relationship;
在任意一个词满足预设条件的情况下,确定该词为目标词,所述预设条件包括该词属于三元组关系中任意一个对象的概率大于第一预设阈值;When any word satisfies a preset condition, determine that the word is a target word, and the preset condition includes that the probability that the word belongs to any object in the triple relationship is greater than the first preset threshold;
将属于同一个三元组关系中对象的目标词对,按照该三元组关系中对象之间的关系,组成三元组。The target word pairs belonging to the objects in the same triple relationship are formed into triples according to the relationship between the objects in the triple relationship.
可选地,词向量包括以下至少之一:词义向量和词性向量。Optionally, the word vector includes at least one of the following: a word meaning vector and a part-of-speech vector.
可选地,三元组关系中所包含的对象包括第一类对象和第二类对象,并且,所述识别结果还包括:Optionally, the objects included in the triple relationship include objects of the first type and objects of the second type, and the recognition result further includes:
所述词向量的第二类概率,任一词向量的第二类概率表征,该词向量所对应的词属于第一类对象的概率和属于第二类对象的概率。The second-type probability of the word vector represents the second-type probability of any word vector, the probability that the word corresponding to the word vector belongs to the first-type object and the probability that it belongs to the second-type object.
可选地,预设条件还包括:Optionally, the preset conditions also include:
该词属于第一类对象的概率或属于第二类对象的概率大于第二预设阈值。The probability that the word belongs to the first type of object or the second type of object is greater than the second preset threshold.
可选地,预设模型包括编码器和解码器;Optionally, the preset model includes an encoder and a decoder;
所述编码器用于由所述词向量得到特征向量;The encoder is used to obtain a feature vector from the word vector;
所述解码器包括第一解码模块和第二解码模块;The decoder includes a first decoding module and a second decoding module;
其中,所述第一解码模块用于依据所述特征向量,确定所述第一类概率,所述第二解码模块用于所述依据所述特征向量,确定所述第二类概率。The first decoding module is configured to determine the first type probability according to the feature vector, and the second decoding module is configured to determine the second type probability according to the feature vector.
可选地,第一解码模块包括2*n个sigmoid函数,其中,n为所述预设模型学习的三元组的数量,所述第二解码模块包括2个sigmoid函数。Optionally, the first decoding module includes 2*n sigmoid functions, where n is the number of triples learned by the preset model, and the second decoding module includes 2 sigmoid functions.
可选地,预设模型的训练过程包括:Optionally, the training process of the preset model includes:
将样本词向量输入待训练的预设模型,得到输出的识别结果,所述识别结果包括所述样本词向量的第一类概率和第二类概率;所述样本词向量包括样本文本中的词的词向量;Input the sample word vector into the preset model to be trained to obtain the output recognition result, the recognition result includes the first type probability and the second type probability of the sample word vector; the sample word vector includes the words in the sample text Word vector
使用标记信息中的样本第一类概率与所述第一类概率的差异,以及所述标记信息中的样本第二类概率与所述第二类概率的差异,得到所述预设 模型的参数,所述样本第一类概率和样本第二类概率依据所述样本文本中标记的三元组确定。Use the difference between the first type probability of the sample in the label information and the first type probability, and the difference between the second type probability of the sample in the label information and the second type probability to obtain the parameters of the preset model The probability of the first type of the sample and the probability of the second type of the sample are determined according to the triples marked in the sample text.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
在一个典型的配置中,设备包括一个或多个处理器(CPU)、存储器和总线。设备还可以包括输入/输出接口、网络接口等。In a typical configuration, the device includes one or more processors (CPUs), memory, and buses. The device may also include input/output interfaces, network interfaces, and so on.
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM),存储器包括至少一个存储芯片。存储器是计算机可读介质的示例。The memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), and the memory includes at least one Memory chip. The memory is an example of a computer-readable medium.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载 波。Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carriers.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or equipment including a series of elements not only includes those elements, but also includes Other elements that are not explicitly listed, or they also include elements inherent to such processes, methods, commodities, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity, or equipment that includes the element.
本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
以上仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。The above are only examples of the application, and are not used to limit the application. For those skilled in the art, this application can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims (10)

  1. 一种三元组抽取方法,其特征在于,包括:A method for extracting triples, which is characterized in that it includes:
    获取组成待识别文本的各个词的词向量;Obtain the word vector of each word composing the text to be recognized;
    将所述词向量输入预设模型,得到每一个所述词向量对应的识别结果;所述识别结果包括所述词向量的第一类概率,任意一个词向量的第一类概率表征,该词向量所对应的词属于三元组关系中所包含对象的概率;The word vector is input into the preset model to obtain the recognition result corresponding to each word vector; the recognition result includes the first-type probability of the word vector, the first-type probability representation of any word vector, and the word The probability that the word corresponding to the vector belongs to the object contained in the triple relationship;
    在任意一个词满足预设条件的情况下,确定该词为目标词,所述预设条件包括该词属于三元组关系中任意一个对象的概率大于第一预设阈值;When any word satisfies a preset condition, determine that the word is a target word, and the preset condition includes that the probability that the word belongs to any object in the triple relationship is greater than the first preset threshold;
    将属于同一个三元组关系中对象的目标词对,按照该三元组关系中对象之间的关系,组成三元组。The target word pairs belonging to the objects in the same triple relationship are formed into triples according to the relationship between the objects in the triple relationship.
  2. 根据权利要求1所述的三元组抽取方法,其特征在于,所述词向量包括以下至少之一:词义向量和词性向量。The triple extraction method according to claim 1, wherein the word vector includes at least one of the following: a word meaning vector and a part-of-speech vector.
  3. 根据权利要求1所述的三元组抽取方法,其特征在于,所述三元组关系中所包含的对象包括第一类对象和第二类对象,并且,所述识别结果还包括:The method for extracting triples according to claim 1, wherein the objects included in the triple relationship include objects of the first type and objects of the second type, and the recognition result further includes:
    所述词向量的第二类概率,任一词向量的第二类概率表征,该词向量所对应的词属于第一类对象的概率和属于第二类对象的概率。The second-type probability of the word vector represents the second-type probability of any word vector, the probability that the word corresponding to the word vector belongs to the first-type object and the probability that it belongs to the second-type object.
  4. 根据权利要求3所述的三元组抽取方法,其特征在于,所述预设条件还包括:The method for extracting triples according to claim 3, wherein the preset condition further comprises:
    该词属于第一类对象的概率或属于第二类对象的概率大于第二预设阈值。The probability that the word belongs to the first type of object or the second type of object is greater than the second preset threshold.
  5. 根据权利要求4所述的三元组抽取方法,其特征在于,所述预设模型包括编码器和解码器;The triple extraction method according to claim 4, wherein the preset model includes an encoder and a decoder;
    所述编码器用于由所述词向量得到特征向量;The encoder is used to obtain a feature vector from the word vector;
    所述解码器包括第一解码模块和第二解码模块;The decoder includes a first decoding module and a second decoding module;
    其中,所述第一解码模块用于依据所述特征向量,确定所述第一类概率,所述第二解码模块用于所述依据所述特征向量,确定所述第二类概率。The first decoding module is configured to determine the first type probability according to the feature vector, and the second decoding module is configured to determine the second type probability according to the feature vector.
  6. 根据权利要求5所述的三元组抽取方法,其特征在于,所述第一解码模块包括2*n个sigmoid函数,其中,n为所述预设模型学习的三元组关 系的数量,所述第二解码模块包括2个sigmoid函数。The method for extracting triples according to claim 5, wherein the first decoding module includes 2*n sigmoid functions, where n is the number of triple relationships learned by the preset model, so The second decoding module includes two sigmoid functions.
  7. 根据权利要求5或6所述的三元组抽取方法,其特征在于,所述预设模型的训练过程包括:The method for extracting triples according to claim 5 or 6, wherein the training process of the preset model comprises:
    将样本词向量输入待训练的预设模型,得到输出的识别结果,所述识别结果包括所述样本词向量的第一类概率和第二类概率;所述样本词向量包括样本文本中的词的词向量;Input the sample word vector into the preset model to be trained to obtain the output recognition result, the recognition result includes the first type probability and the second type probability of the sample word vector; the sample word vector includes the words in the sample text Word vector
    使用标记信息中的样本第一类概率与所述第一类概率的差异,以及所述标记信息中的样本第二类概率与所述第二类概率的差异,得到所述预设模型的参数,所述样本第一类概率和样本第二类概率依据所述样本文本中标记的三元组确定。Use the difference between the first type probability of the sample in the label information and the first type probability, and the difference between the second type probability of the sample in the label information and the second type probability to obtain the parameters of the preset model The probability of the first type of the sample and the probability of the second type of the sample are determined according to the triplet marked in the sample text.
  8. 一种三元组抽取装置,其特征在于,包括:A device for extracting triples, which is characterized in that it comprises:
    词向量获取单元,用于获取组成待识别文本的各个词的词向量;The word vector acquiring unit is used to acquire the word vector of each word composing the text to be recognized;
    模型预测单元,用于将所述词向量输入预设模型,得到每一个所述词向量对应的识别结果;所述识别结果包括所述词向量的第一类概率,任意一个词向量的第一类概率表征,该词向量所对应的词属于三元组关系中所包含对象的概率;The model prediction unit is used to input the word vector into a preset model to obtain the recognition result corresponding to each word vector; the recognition result includes the first type probability of the word vector, and the first probability of any word vector Class probability representation, the probability that the word corresponding to the word vector belongs to the object contained in the triple relationship;
    目标词确定单元,用于在任意一个词满足预设条件的情况下,确定该词为目标词,所述预设条件包括该词属于三元组关系中任意一个对象的概率大于第一预设阈值;The target word determining unit is configured to determine that any word meets a preset condition as the target word, and the preset condition includes that the probability that the word belongs to any object in the triple relationship is greater than the first preset Threshold
    三元组确定单元,用于将属于同一个三元组关系中对象的目标词对,按照该三元组关系中对象之间的关系,组成三元组。The triple determination unit is used to group the target word pairs belonging to the objects in the same triple relationship into triples according to the relationship between the objects in the triple relationship.
  9. 一种三元组抽取设备,其特征在于,包括:存储器和处理器;A device for extracting triples, which is characterized by comprising: a memory and a processor;
    所述存储器,用于存储程序;The memory is used to store programs;
    所述处理器,用于执行所述程序,实现如权利要求1~7中任一项所述的三元组抽取方法的各个步骤。The processor is configured to execute the program to implement each step of the triple extraction method according to any one of claims 1-7.
  10. 一种存储介质,其上存储有程序,其特征在于,所述程序被处理器执行时,实现如权利要求1~7中任一项所述的三元组抽取方法的各个步骤。A storage medium having a program stored thereon, wherein when the program is executed by a processor, each step of the triple extraction method according to any one of claims 1-7 is realized.
PCT/CN2020/103209 2019-09-30 2020-07-21 3-tuple extraction method, device, apparatus, and storage medium WO2021063086A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910942438.4 2019-09-30
CN201910942438.4A CN112668332A (en) 2019-09-30 2019-09-30 Triple extraction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2021063086A1 true WO2021063086A1 (en) 2021-04-08

Family

ID=75336763

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/103209 WO2021063086A1 (en) 2019-09-30 2020-07-21 3-tuple extraction method, device, apparatus, and storage medium

Country Status (2)

Country Link
CN (1) CN112668332A (en)
WO (1) WO2021063086A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866498A (en) * 2014-02-24 2015-08-26 华为技术有限公司 Information processing method and device
CN108021705A (en) * 2017-12-27 2018-05-11 中科鼎富(北京)科技发展有限公司 A kind of answer generation method and device
CN110196913A (en) * 2019-05-23 2019-09-03 北京邮电大学 Multiple entity relationship joint abstracting method and device based on text generation formula
WO2019173085A1 (en) * 2018-03-06 2019-09-12 Microsoft Technology Licensing, Llc Intelligent knowledge-learning and question-answering

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107783960B (en) * 2017-10-23 2021-07-23 百度在线网络技术(北京)有限公司 Method, device and equipment for extracting information
CN108280062A (en) * 2018-01-19 2018-07-13 北京邮电大学 Entity based on deep learning and entity-relationship recognition method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866498A (en) * 2014-02-24 2015-08-26 华为技术有限公司 Information processing method and device
CN108021705A (en) * 2017-12-27 2018-05-11 中科鼎富(北京)科技发展有限公司 A kind of answer generation method and device
WO2019173085A1 (en) * 2018-03-06 2019-09-12 Microsoft Technology Licensing, Llc Intelligent knowledge-learning and question-answering
CN110196913A (en) * 2019-05-23 2019-09-03 北京邮电大学 Multiple entity relationship joint abstracting method and device based on text generation formula

Also Published As

Publication number Publication date
CN112668332A (en) 2021-04-16

Similar Documents

Publication Publication Date Title
WO2021003819A1 (en) Man-machine dialog method and man-machine dialog apparatus based on knowledge graph
US20180158078A1 (en) Computer device and method for predicting market demand of commodities
CN110619044B (en) Emotion analysis method, system, storage medium and equipment
CN110704622A (en) Text emotion classification method and device and electronic equipment
WO2020114100A1 (en) Information processing method and apparatus, and computer storage medium
US11580119B2 (en) System and method for automatic persona generation using small text components
CN107291840B (en) User attribute prediction model construction method and device
CN110704576A (en) Text-based entity relationship extraction method and device
CN112231569A (en) News recommendation method and device, computer equipment and storage medium
WO2022156065A1 (en) Text sentiment analysis method and apparatus, device and storage medium
US20190108280A1 (en) Image search and index building
CN109597982B (en) Abstract text recognition method and device
WO2022095370A1 (en) Text matching method and apparatus, terminal device, and storage medium
CN113342935A (en) Semantic recognition method and device, electronic equipment and readable storage medium
WO2021063086A1 (en) 3-tuple extraction method, device, apparatus, and storage medium
CN116483979A (en) Dialog model training method, device, equipment and medium based on artificial intelligence
CN110888983A (en) Positive and negative emotion analysis method, terminal device and storage medium
CN113591881B (en) Intention recognition method and device based on model fusion, electronic equipment and medium
US20220414123A1 (en) Systems and methods for categorization of ingested database entries to determine topic frequency
WO2022227196A1 (en) Data analysis method and apparatus, computer device, and storage medium
CN109344388A (en) A kind of comment spam recognition methods, device and computer readable storage medium
CN115270818A (en) Intention identification method and device, storage medium and computer equipment
CN114969253A (en) Market subject and policy matching method and device, computing device and medium
WO2021063060A1 (en) Text information extraction method and apparatus, storage medium and device
CN114492410A (en) Contract information extraction method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20871994

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20871994

Country of ref document: EP

Kind code of ref document: A1