CN110032650B - Training sample data generation method and device and electronic equipment - Google Patents

Training sample data generation method and device and electronic equipment Download PDF

Info

Publication number
CN110032650B
CN110032650B CN201910312576.4A CN201910312576A CN110032650B CN 110032650 B CN110032650 B CN 110032650B CN 201910312576 A CN201910312576 A CN 201910312576A CN 110032650 B CN110032650 B CN 110032650B
Authority
CN
China
Prior art keywords
word
matching
target
training sample
statement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910312576.4A
Other languages
Chinese (zh)
Other versions
CN110032650A (en
Inventor
郑孙聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910312576.4A priority Critical patent/CN110032650B/en
Publication of CN110032650A publication Critical patent/CN110032650A/en
Application granted granted Critical
Publication of CN110032650B publication Critical patent/CN110032650B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The invention discloses a method and a device for generating training sample data and electronic equipment, wherein the method for generating the training sample data comprises the following steps: determining triple data with the same relation in the knowledge graph to obtain a plurality of relation triple sets; acquiring matching sentences corresponding to the triple data in the plurality of relation triple sets in the corpus to obtain a matching sentence set corresponding to each relation triple set; determining the feature words of the relationship corresponding to each three-relationship tuple set according to the matching sentences in the matching sentence set; and acquiring target matching sentences matched with the feature words from the matching sentence set to obtain training sample data. The method filters the matched sentences in the matched sentence set based on the feature words of the relation in the knowledge graph, avoids introducing a large amount of noise data into training sample data, improves the quality of the training sample data, and ensures the reliability of the extracted model obtained based on the training sample data and the training speed of the extracted model.

Description

Training sample data generation method and device and electronic equipment
Technical Field
The invention relates to the technical field of computers, in particular to a training sample data generation method and device and electronic equipment.
Background
A knowledge graph, in essence, is a semantic network that describes the relationships between various entities or concepts that exist in the real world, with the various entities or concepts as nodes of the semantic network and the relationships as edges connecting the nodes. With the rapid development of artificial intelligence technology, knowledge maps have become important data resources of intelligent tools such as question-answering systems, search engines and the like.
The basic composition unit of the knowledge map is triple data, and the expression form of the triple data is generally (head entity, relationship, tail entity), wherein the head entity is a subject, and the tail entity is an object, such as triple data (Zhang III, wife, li IV). Most of the triple data are in unstructured statements, so it is important to construct an efficient or reliable extraction model to extract the corresponding triple data from the statements.
When the extraction model is constructed, training sample data is required to train the extraction model, and the quality of the training sample data is crucial to the effectiveness or reliability of the extraction model. In the method for acquiring training sample data in the prior art, a large amount of noise data is often introduced, so that the quality of the training sample data is low, further, the error of an extracted model obtained by training based on the training sample data is large, and the training speed of the extracted model is reduced.
Disclosure of Invention
In order to solve the problems in the prior art, embodiments of the present invention provide a method and an apparatus for generating training sample data, and an electronic device. The technical scheme is as follows:
in one aspect, a method for generating training sample data is provided, where the method includes:
determining triple data with the same relation in the knowledge graph to obtain a plurality of relation triple sets;
acquiring matching sentences corresponding to the triple data in the plurality of relation triple sets in the corpus to obtain a matching sentence set corresponding to each relation triple set;
determining the feature words of the relationship corresponding to each three-relationship tuple set according to the matching sentences in the matching sentence set;
and acquiring a target matching statement matched with the feature word from the matching statement set to obtain training sample data.
In another aspect, an apparatus for generating training sample data is provided, the apparatus including:
the first determining module is used for determining the triple group data with the same relation in the knowledge graph to obtain a plurality of relation triple group sets;
the first acquisition module is used for acquiring matching sentences corresponding to the triple data in the plurality of relation triple sets in the corpus to obtain a matching sentence set corresponding to each relation triple set;
the second determining module is used for determining the feature words of the relationship corresponding to each three-relationship tuple set according to the matching sentences in the matching sentence set;
and the second acquisition module is used for acquiring the target matching sentences matched with the feature words from the matching sentence set to obtain training sample data.
In another aspect, an electronic device is provided, which includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, and the at least one instruction, the at least one program, the code set, or the set of instructions are loaded and executed by the processor to implement the method for generating training sample data.
According to the embodiment of the invention, triple data with the same relation in the knowledge graph are determined to obtain a plurality of relation triple sets, matching sentences corresponding to the triple data in the plurality of triple sets in the corpus are obtained to obtain a matching sentence set corresponding to each relation triple set, feature words corresponding to the relation in each relation triple set are determined according to the matching sentences in the matching sentence set, and target matching sentences matched with the feature words are obtained from the matching sentence set to obtain training sample data, so that the matching sentences in the matching sentence sets are filtered based on the feature words of the relation, a large amount of noise data is prevented from being introduced into the training sample data, the quality of the training sample data is improved, and the reliability of an extraction model obtained based on the training sample data and the training speed of the extraction model are ensured.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for generating training sample data according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating a method for determining feature words of a relationship corresponding to each relationship triplet set according to matching statements in the matching statement set according to an embodiment of the present invention;
fig. 3 is a flowchart illustrating a method for determining a relevance between a corresponding relationship in the word set and a first word in the word set according to a first matching statement including the first word in the matching statement set according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a method for determining a second weight of a first term in the term set according to a first matching term comprising the first term in the matching term set according to an embodiment of the present invention;
fig. 5 is a schematic flowchart of another method for generating training sample data according to an embodiment of the present invention;
fig. 6 is a schematic flowchart of a method for obtaining training sample data from the matching statement set by obtaining a target matching statement matching the first target feature word according to the embodiment of the present invention;
fig. 7 is a schematic structural diagram of an apparatus for generating training sample data according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a second determining module according to an embodiment of the present invention;
FIG. 9 is a schematic structural diagram of a fourth determining module according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of a second obtaining module according to an embodiment of the present invention;
fig. 11 is a schematic structural diagram of a fifth obtaining module according to an embodiment of the present invention;
fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
The knowledge graph is used as an important data resource, the basic composition unit of the knowledge graph is ternary group data, most of the ternary group data is extracted from the sentence by using an effective extraction model in the unstructured sentence, the error size of the extraction model is important for the accuracy of the extracted ternary group data, and the quality of training sample data determines the effectiveness or reliability of the extracted model obtained by training.
A method for generating training sample data is to use known triple data in knowledge map to match sentences in corpus, if the sentences contain subject, i.e. head entity and object, i.e. tail entity in triple data, then the sentences and corresponding triple data form a training sample data. For example, (zhang san, wife, liqu) is a triplet data in the knowledge map, the triplet data can be matched into two sentences in the corpus, respectively, sentence 1"xxxx year, zhang san meets li xi and sentence 2" xxxx year x xx day when the football team reaches city a with hong kong star, zhang san and liqu meet in the registration of las vegas in usa. According to the existing method for generating training sample data, sentences 1 and 2 both contain three and four, and sentences 1 and 2 can be used as training sample data for training an extraction model. However, it is obvious that although sentence 1 contains zhang san and lie xi simultaneously, there is no description of any couple relationship between the zhang san and lie xi, that is, the existing method for generating training sample data introduces noise data into the training sample data, the quality of the training sample data is low, and further, the error of the extracted model obtained by training based on the training sample data is large, and the training speed of the extracted model is reduced.
In view of this, an embodiment of the present invention provides a method for generating training sample data, where the method for generating training sample data is applicable to a device for generating training sample data according to an embodiment of the present invention, and the device for generating training sample data may be configured in an electronic device, where the electronic device may be a terminal or a server. The terminal can be a hardware device with various operating systems, such as a smart phone, a desktop computer, a tablet computer, a notebook computer, and the like. The server may comprise a server operating independently, or a distributed server, or a server cluster consisting of a plurality of servers.
Referring to fig. 1, which is a schematic flow chart illustrating a method for generating training sample data according to an embodiment of the present invention, it should be noted that the present specification provides the method operation steps as described in the embodiment or the flow chart, but more or less operation steps may be included based on conventional or non-creative labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of sequences, and does not represent a unique order of performance. In actual system or product execution, sequential execution or parallel execution (e.g., parallel processor or multi-threaded environment) may be possible according to the embodiments or methods shown in the figures. Specifically, as shown in fig. 1, the method includes:
s101, determining the triple data with the same relation in the knowledge graph to obtain a plurality of relation triple sets.
In the embodiment of the present specification, the representation of the triplet data in the knowledge-graph is (head entity, relationship, tail entity), wherein the head entity is a subject, the tail entity is an object, and the relationship is used for representing the relationship between the head entity and the tail entity. For example, "wife" and "birth place" in Zhang three, wife, li Si IV and (Zhang three, birth place, hong Kong) are relationships corresponding to three groups of data.
In the embodiment of the specification, the triple data with the same relation in the knowledge graph is determined to be a relation triple set, so that a plurality of relation triple sets can be obtained. For example, (zhang, spouse, lie) and (goujing, spouse, huangrong) have the same relationship "spouse" and thus can be determined as triple data in the relationship triplet set belonging to the relationship "spouse".
S103, obtaining matching sentences corresponding to the triple data in the multiple relation triple sets in the corpus, and obtaining a matching sentence set corresponding to each relation triple set.
In the present embodiment, the corpus refers to a large-scale electronic text library which is scientifically sampled and processed, and in which language materials which actually appear in practical use of languages, such as literature, sentence paragraphs of journal, and the like, are stored, and the corpus includes a plurality of sentences, each sentence including at least one word, that is, a sentence including one word or a group of words which are syntactically related.
In the embodiment of the present specification, the matching statement corresponding to the triple data is a statement that contains both a head entity and a tail entity of the triple data in the corpus, for example, the statement "wife with three years old is lie four" in the corpus includes the subject words of three years old and the subject word of lie four, so that the statement "wife with three years old is lie four" is the matching statement corresponding to (three years old, spouse, lie four). In practical application, each triple data may be matched to multiple matching statements in the corpus, that is, each triple data corresponds to one matching statement subset, and the matching statement subsets corresponding to the triple data in the relationship triple set constitute a matching statement set of the relationship triple set.
In some embodiments, before step 103, matching statements corresponding to each triplet data in the knowledge graph may be obtained from the corpus, to obtain a matching statement subset corresponding to each triplet data, and the triplet data and the matching statement subsets are stored in the designated storage space in a one-to-one correspondence manner, so when step 103 is executed, matching statements corresponding to triplet data in the multiple relational triplet sets in the corpus may be obtained from the designated storage space, to obtain a matching statement set corresponding to each relational triplet set.
And S105, determining the feature words of the relationship corresponding to each three relationship tuple set according to the matching sentences in the matching sentence set.
In the embodiment of the present specification, the feature words are words characterizing a relationship in the triple data, and if the relationship is "couple", the feature words for characterizing "couple" may include marriage, marrying, marriage, wife, husband, and the like.
In an embodiment of this specification, the determining, according to a matching statement in the matching statement set, a feature word of a relationship corresponding to each relationship triplet set may employ a method shown in fig. 2, where the method may include:
s201, performing word segmentation processing on the matching sentences in the matching sentence set to obtain a word set of a corresponding relation three-tuple set.
Specifically, an existing word segmentation tool may be used to perform word segmentation on each matching statement in the matching statement set, so as to obtain a word set of a relation triplet set corresponding to the matching statement set, where the word segmentation tool may include, but is not limited to, an AnsjSeg tool and an IKAnalyzer tool. In order to improve the effectiveness of the words in the word set, the stop words obtained can be removed by using the stop word bank in the word segmentation processing process.
S203, determining a first weight of the words in the word set according to the frequency of the words in the word set.
In the embodiment of the specification, on one hand, a matching statement set of a relation three-tuple set is regarded as a document, a matching statement in the matching statement set is regarded as the document content of the document, and the importance of a word in the matching statement is measured from the perspective of the document.
In some embodiments, when determining the first weight of a word in the word set according to the frequency of occurrence of the word in the word set, the word frequency of the word in the word set and the inverse document word frequency may be determined, and then the product of the word frequency and the inverse document word frequency is calculated and used as the first weight of the word in the word set. The specific calculation formula is as follows:
Figure GDA0002054076730000071
Figure GDA0002054076730000072
tfidf j,i =tf j,i ×idf i (3)
wherein tf in the formula (1) j,i Expressing the normalized word frequency of the word i in the word set corresponding to the relation three-element set with the relation j; n is j,i Representing the times of appearance of the words i in the word set corresponding to the relation three-element set with the relation j; sigma k n k,i And the sum of the times of occurrence of the word i in the word set corresponding to the relation three-tuple set representing each relation.
Idf in equation (2) i An inverse document word frequency representing word i; the | D | represents the total number of the documents, and the value of the | D | can be the number of the relation three-tuple sets as the matching statement set of each relation three-tuple set is regarded as one document; l j t i ∈d j I denotes a text containing the word iNumber of bins, i.e., number of word sets containing word i.
Tfidf in equation (3) j,i And representing the first weight of the word i in the word set corresponding to the relation three-tuple set with the relation j.
It should be noted that the foregoing is only an example of determining the first weight of a word in the word set based on the angle of the document, and in practical applications, other calculation methods may also be used to determine the first weight of a word in the word set, such as a TexTrank algorithm, and the like, which is not limited in this respect.
S205, according to a first matching statement including a first word in the matching statement set, determining the association degree of the corresponding relation of the word set and the first word in the word set, and determining a second weight of the first word in the word set.
In the embodiment of the present specification, on the other hand, the relationship of the three-tuple set of relationships corresponding to the matching statement set is used as a class label of the matching statement in the matching statement set, and the importance of the word in the matching statement is measured from the perspective of the classification task.
In some embodiments, the determining, according to a first matching statement including a first word in the matching statement set, the association degree of the relationship corresponding to the word set and the first word in the word set may adopt a method shown in fig. 3, where the method may include:
s301, selecting a target matching statement set from the matching statement sets.
Wherein, the target matching statement set can be any one matching statement set in all matching statement sets.
S303, determining a first number of the first matching sentences in the target matching sentence set and a second number of second matching sentences which do not include the first words in the target matching sentence set.
In this specification embodiment, matching sentences in a matching sentence set may be divided into two types, one type is a first matching sentence including a first word, and the other type is a second matching sentence not including the first word, where the first word is any one of words in a corresponding sentence set. The sum of the first number of the first matching sentences and the second number of the second matching sentences in the target matching sentence set is the total number of matching sentences contained in the target matching sentences.
S305, determining a third number of the first matching sentences in the remaining matching sentence set and a fourth number of the second matching sentences in the remaining matching sentence set.
In this embodiment of the present specification, the remaining matching statement sets are matching statement sets other than the target matching statement set, and may be all remaining matching statement sets other than the target matching statement set. The sum of the third number of the first matching sentences and the fourth number of the second matching sentences in the remaining matching sentence set is the total number of the matching sentences contained in the remaining matching sentence set.
S307, calculating the association degree of the corresponding relation of the term set of the target matching statement set and the first term in the term set according to the first quantity, the second quantity, the third quantity and the fourth quantity.
Specifically, according to the first number, the second number, the third number, and the fourth number, the association degree between the relationship corresponding to the term set of the target matching statement set and the first term in the term set is calculated according to the following formula (4):
Figure GDA0002054076730000081
wherein w represents a first term in a set of terms of the target matching statement set; cj represents the corresponding relation of the word sets of the target matching statement set; chi shape 2 (w, cj) represents the degree of association of the first term w with the relationship cj; a represents a first number of first matching statements in the target matching statement set; b represents a third number of first matching statements in the remaining set of matching statements; c represents a second number of second matching statements in the target matching statement set; d represents a fourth number of second matching sentences in the remaining set of matching sentences; n represents the total number of matching sentences contained in the target matching sentence set and the remaining matching sentence sets, i.e., N = a + B + C + D.
In some embodiments, determining the second weight of the first term in the set of terms according to the first matching term including the first term in the set of matching terms may employ a method shown in fig. 4, which may include:
s401, determining a first proportion of a first matching statement in each matching statement set in the matching statement set.
Specifically, the ratio of the number of the first matching sentences in each matching sentence set and the total number of the matching sentences contained in the matching sentence set is calculated, and the ratio is the first proportion.
And S403, determining the second proportion of the first matching statement in all the matching statement sets in each matching statement set.
Specifically, the total number of matching statements included in all the matching statement sets may be obtained first, then each matching statement set is traversed, and when each matching statement set is traversed, a ratio of the number of first matching statements in the matching statement set to the total number of matching statements included in all the matching statement sets is calculated, where the ratio is the corresponding second ratio.
S405, determining the number of relations in the knowledge graph.
Specifically, the specific number of sets of relationship triplets may be determined as the number of relationships in the knowledge-graph.
S407, calculating a second weight of the first term in the term set according to the first proportion, the second proportion and the number of the relations in the knowledge graph.
Specifically, according to the first proportion, the second proportion and the number of the relations in the knowledge graph, the second weight of the first term in the term set is calculated according to the following formula (5):
Figure GDA0002054076730000091
wherein I (w) represents a second weight of a first word w in the set of words; p i Representing the first match in a set of matching statementsThe number of configuration statements; p is a radical of i Representing a first fraction; f (w) represents a second proportion; k represents the number of relationships in the knowledge-graph.
It should be understood that the above only gives two examples of measuring the importance of each word in the word set from the perspective of the classification task, and in practical applications, the importance of each word in the word set may also be measured based on other methods of the classification task.
S207, determining a target word in the word set according to the first weight, the association degree and the second weight, and taking the target word as a feature word of a relation corresponding to the relation three-tuple set.
In some embodiments, when determining the target word in the word set according to the first weight, the association degree, and the second weight, a sum of the first weight, the association degree, and the second weight may be calculated and used as a feature value of a word in the word set, and then a word whose feature value in the word set satisfies a preset condition may be determined as the target word. The preset condition may be that the preset condition is greater than a preset feature value threshold, that is, words in the word set whose feature value is greater than the feature value threshold are determined as target words.
In other embodiments, when the target words in the word sets are determined according to the first weight, the association degree and the second weight, the words in the word sets may be sorted in a descending order according to the first weight, the association degree and the second weight, respectively, to obtain three corresponding sorted word sets; and then respectively acquiring the words with the preset number from the three sequencing word sets to obtain three corresponding candidate target word sets, and acquiring an intersection of the three candidate target word sets, wherein the words in the intersection are the target words. The specific numerical values of the top preset number of words corresponding to the three sorted word sets may be set to the same numerical value or may be set to different numerical values.
S107, obtaining target matching sentences matched with the feature words from the matching sentence set to obtain training sample data.
Specifically, it may be determined whether a matching statement of the matching statement set includes a corresponding feature word, and if the matching statement includes the feature word, it is determined that the matching statement matches the feature word, and the matching statement is used as a target matching statement, so that a target matching statement corresponding to triple data in each relationship triple set may be obtained, and all the triple data and the target matching statements corresponding to the triple data may constitute training sample data. Because the target matching statement is a matching statement including the feature words of the corresponding relationship, the target matching statement is a matching statement capable of effectively reflecting the relationship, and therefore matching statements which cannot effectively reflect the corresponding relationship in a matching statement set are filtered out.
For example, the feature words determined by the foregoing method of the embodiments of the present specification as "couple" include (marriage, wife, husband), and the set of matching sentences corresponding to "couple" relationship includes the following two matching sentences: match statement 1 in xxxx, zhang san along with hong kong star football team went city a, met lie si "and match statement 2 in xxxx year xx month xx, zhang san and lie si registered marriage in las vegas usa. The matching statement 2 comprises a feature word 'marriage', namely the matching statement 2 is a target matching statement, the matching statement 2 is obtained, so that the matching statement 1 can be filtered, the matching statement 1 does not exist in the finally obtained training sample data, and the quality of the training sample data is improved.
According to the technical scheme of the embodiment of the invention, the embodiment of the invention filters the matched sentences in the matched sentence set based on the feature words of the relation in the knowledge graph, avoids introducing a large amount of noise data into the training sample data, improves the quality of the training sample data, and further ensures the reliability of the extracted model obtained based on the training sample data and the training speed of the extracted model.
In addition, when the feature words corresponding to each relationship are determined, the importance of the words of the matching sentences in the matching sentence set corresponding to the relationship is measured from multiple angles, so that the accuracy of the feature words of each relationship is ensured, the subsequent screening of the matching sentences based on the feature words is more accurate, and the quality of the finally obtained training sample data is improved.
In practical applications, the corresponding characteristic words related to the "spouse" may include "acquaintance", but obviously, acquaintance is not necessarily the spouse. In order to further improve the accuracy of the feature words corresponding to each relationship to ensure the quality of training sample data, an embodiment of the present invention provides another method for generating training sample data, as shown in fig. 5, where the method may include:
s501, determining the triple data with the same relation in the knowledge graph to obtain a plurality of relation triple sets.
S503, obtaining the matching sentences corresponding to the triple data in the plurality of relation triple sets in the corpus, and obtaining the matching sentence sets corresponding to each relation triple set.
And S505, determining the feature words of the relationship corresponding to each three-relationship tuple set according to the matching sentences in the matching sentence set.
Specifically, the details of steps S501 to S505 may refer to the contents of the corresponding steps in the embodiment of the method shown in fig. 1, and are not repeated herein.
And S507, obtaining a word vector of the feature word.
The word vectors of the feature words are used for describing the vector expression of the feature word characteristics, and the semantic characteristics of the feature words can be well described. In the embodiments of the present specification, a word vector refers to a representation of a word vector constructed based on a word embedding technique. For example, a Word embedding technology Word2Vec method in a neural network language model can be adopted to obtain a Word vector of the feature words. Of course, other methods capable of obtaining word vectors of words may be adopted, and the present invention is not limited to this.
S509, calculating an average word vector of the word vectors of the feature words, and taking the average word vector of the word vectors of the feature words as a first reference vector of a corresponding relationship.
S511, according to the similarity between the word vector of each feature word and the first reference vector, obtaining a first target feature word in the feature words.
In the embodiment of the present specification, the degree of the characteristic word belonging to the relationship is measured by the similarity between the word vector of the characteristic word and the first reference vector of the corresponding relationship, and when the similarity between the word vector of the characteristic word and the first reference vector satisfies a preset condition, it may be determined that the characteristic word belongs to the relationship, that is, the characteristic word is determined to be the first target characteristic word.
In practical application, the similarity between two vectors can be characterized by the distance between the two vectors, and if the distance between the two vectors is smaller than a preset distance, the similarity between the two vectors meets a preset condition. The preset distance can be set according to actual needs, generally, the smaller the preset distance is set, the higher the accuracy of the acquired first target feature word is, and conversely, the larger the preset distance is set, the lower the accuracy of the acquired first target feature word is. Specifically, the distance between two vectors may be a cosine distance, an euclidean distance, or the like, which is not particularly limited in the present invention.
S513, obtaining the target matching statement matched with the first target feature word from the matching statement set to obtain training sample data.
Specifically, whether the matching sentences of the matching sentence set include the corresponding first target feature words or not can be judged, if the matching sentences include the first target feature words, the matching sentences are determined to be matched with the first target feature words, the matching sentences are used as target matching sentences, accordingly, the target matching sentences corresponding to the triple data in each relation triple set can be obtained, and all the triple data and the target matching sentences corresponding to the triple data can form training sample data.
The technical scheme of the embodiment of the invention further improves the accuracy of the feature words corresponding to each relation, further ensures the accuracy of the training sample data and ensures the higher quality of the training sample data.
In practical applications, some small words with a relatively low occurrence probability during actual use of a language may often be omitted from feature words, for example, "wife", in order to ensure comprehensiveness of the feature words and enable quality of training sample data to be better, in this embodiment of the specification, as shown in fig. 6, the obtaining of the target matching statement matching the first target feature word from the matching statement set may include:
s601, obtaining a word vector of the first target feature word.
In the embodiments of the present specification, a word vector refers to a representation of a word vector constructed based on a word embedding technique. For example, a Word embedding technology Word2Vec method in a neural network language model can be adopted to obtain a Word vector of the feature words. Of course, other methods capable of obtaining word vectors of words may be adopted, and the present invention is not limited to this.
S603, calculating an average word vector of the word vectors of the first target characteristic words, and taking the average word vector of the word vectors of the first target characteristic words as a second reference vector of the corresponding relation.
S605, obtaining the target word in the word vector library according to the similarity between the word vector in the word vector library and the second reference vector.
The words and word vectors corresponding to the words in the language material are relatively comprehensively stored in the word vector library. In this embodiment of the present specification, each word vector in the word vector library may be traversed, the similarity between the second reference vector and each word vector in the word vector library is calculated, the similarity meeting the preset condition is screened out from the similarities, and the word vector in the word vector library corresponding to the screened similarity is used as the target word vector, so as to obtain the target word corresponding to the target word vector in the word vector library.
In practical application, the similarity between two vectors can be represented by the distance between the two vectors, if the distance between the two vectors is smaller than a preset distance, it indicates that the similarity between the two vectors meets a preset condition, and the preset distance can be set as required, for example, 0.3 or 0.5. Specifically, the distance between two vectors may be a cosine distance, a euclidean distance, or the like, which is not limited in the present invention.
And S607, combining the target words in the word vector library and the first target feature words to obtain second target feature words.
In this embodiment of the present specification, by combining a target word in a word vector library and a first target feature word, a target word obtained from the word vector library may be extended into the first target feature word to obtain a second target feature word.
S609, acquiring a target matching statement matched with the second target feature word from the matching statement set to obtain training sample data.
Specifically, whether the matching sentences of the matching sentence set include the corresponding second target feature words or not can be judged, if the matching sentences include the second target feature words, the matching sentences are determined to be matched with the second target feature words, the matching sentences are used as target matching sentences, accordingly, the target matching sentences corresponding to the triple data in each relation triple set can be obtained, and all the triple data and the target matching sentences corresponding to the triple data can form training sample data.
According to the technical scheme of the embodiment of the invention, the feature words corresponding to each relation are more comprehensive, omission of a few small-crowd feature words with low occurrence probability is avoided, and the quality of training sample data obtained based on the expanded feature words is higher.
Corresponding to the methods for generating training sample data provided in the foregoing embodiments, embodiments of the present invention further provide a device for generating training sample data, and since the device for generating training sample data provided in embodiments of the present invention corresponds to the methods for generating training sample data provided in the foregoing embodiments, embodiments of the method for generating training sample data described above are also applicable to the device for generating training sample data provided in this embodiment, and will not be described in detail in this embodiment.
Referring to fig. 7, a schematic structural diagram of a training sample data generating device according to an embodiment of the present invention is shown, and as shown in fig. 7, the device may include: a first determining module 710, a first obtaining module 720, a second determining module 730, and a second obtaining module 740, wherein,
a first determining module 710, configured to determine triple data having the same relationship in a knowledge graph, to obtain multiple triple sets of relationships;
a first obtaining module 720, configured to obtain matching statements corresponding to triple data in the multiple triple sets of relationships in the corpus, to obtain a matching statement set corresponding to each triple set of relationships;
a second determining module 730, configured to determine, according to the matching statements in the matching statement set, feature words of a relationship corresponding to each relationship triplet set;
a second obtaining module 740, configured to obtain, from the matched sentence set, a target matched sentence matched with the feature word, to obtain training sample data.
Optionally, as shown in fig. 8, the second determining module 730 may include:
a word segmentation processing module 7310, configured to perform word segmentation processing on the matching sentences in the matching sentence set to obtain a word set of a corresponding relation triplet set;
a third determining module 7320, configured to determine a first weight of a word in the word set according to the frequency of occurrence of the word in the word set;
a fourth determining module 7330, configured to determine, according to a first matching statement that includes a first term in the matching statement set, a degree of association between a relationship corresponding to the term set and the first term in the term set, and a second weight of the first term in the term set;
a fifth determining module 7340, configured to determine a target word in the word set according to the first weight, the association degree, and the second weight, and use the target word as a feature word of a relationship corresponding to the relationship triplet set.
Optionally, the third determining module 7320 is specifically configured to: determining the word frequency and the inverse document word frequency of the words in the word set; and calculating the product of the word frequency and the inverse document word frequency, and taking the product as the first weight of the words in the word set.
Alternatively, as shown in fig. 9, the fourth determining module 7330 may include:
a selecting module 7331, configured to select a target matching statement set from the matching statement set;
a sixth determining module 7332, configured to determine a first number of the first matching sentences in the target matching sentence set and a second number of second matching sentences which do not include the first word in the target matching sentence set;
a seventh determining module 7333, configured to determine a third number of the first matching sentences in the remaining matching sentence set and a fourth number of the second matching sentences in the remaining matching sentence set; the residual matching statement set is a matching statement set except the target matching statement set;
a first calculating module 7334, configured to calculate, according to the first number, the second number, the third number, and the fourth number, a degree of association between a relationship corresponding to a term set of the target matching statement set and the first term in the term set.
An eighth determining module 7335, configured to determine a first ratio of the first matching statement in each matching statement set in the matching statement set;
a ninth determining module 7336, configured to determine a second ratio of the first matching statement in each matching statement set to all matching statement sets;
a tenth determining module 7337 for determining the number of relationships in the knowledge-graph;
a second calculating module 7338, configured to calculate a second weight of the first term in the term set according to the first percentage, the second percentage, and the number of relationships in the knowledge-graph.
Optionally, the fifth determining module 7340, when determining the target word in the word set according to the first weight, the association degree, and the second weight, calculates a sum of the first weight, the association degree, and the second weight, and uses the sum as a feature value of the word in the word set; determining the words in the word set of which the characteristic values meet preset conditions as the target words.
In some embodiments, as shown in fig. 10, the second obtaining module 740 may include:
a third obtaining module 7410, configured to obtain a word vector of the feature word;
a third calculating module 7420, configured to calculate an average word vector of the word vectors of the feature words, and use the average word vector of the word vectors of the feature words as a first reference vector of a corresponding relationship;
a fourth obtaining module 7430, configured to obtain a first target feature word in the feature words according to the similarity between the word vector of each feature word and the first reference vector;
a fifth obtaining module 7440, configured to obtain a target matching statement matched with the first target feature word from the matching statement set, so as to obtain training sample data.
In other embodiments, as shown in fig. 11, the fifth obtaining module 7440 may include:
a sixth obtaining module 7441, configured to obtain a word vector of the first target feature word;
a fourth calculating module 7442, configured to calculate an average word vector of the word vectors of the first target feature words, and use the average word vector of the word vectors of the first target feature words as a second reference vector of a corresponding relationship;
a seventh obtaining module 7443, configured to obtain a target word in the word vector library according to a similarity between a word vector in the word vector library and the second reference vector;
a combination module 7444, configured to combine the target word in the word vector library with the first target feature word to obtain a second target feature word;
an eighth obtaining module 7445, configured to obtain, from the matching statement set, a target matching statement matched with the second target feature word, so as to obtain training sample data.
It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above.
According to the embodiment of the invention, the matched sentences in the matched sentence set are filtered based on the feature words of the relation in the knowledge map, so that a large amount of noise data is prevented from being introduced into training sample data, the quality of the training sample data is improved, and the reliability of an extracted model obtained based on the training sample data and the training speed of the extracted model are further ensured.
In addition, when the feature words corresponding to each relationship are determined, the importance of the words of the matching sentences in the matching sentence set corresponding to the relationship is measured from multiple angles, so that the accuracy of the feature words of each relationship is ensured, the subsequent screening of the matching sentences based on the feature words is more accurate, and the quality of finally obtained training sample data is improved.
Please refer to fig. 12, which is a schematic structural diagram of an electronic device according to an embodiment of the present invention, the electronic device is configured to implement the method for generating training sample data provided in the foregoing embodiment. The electronic device may be a terminal device such as a PC (personal computer), a mobile phone, a PDA (tablet personal computer), or a service device such as an application server, a cluster server, or the like. Referring to fig. 12, the internal structure of the electronic device may include, but is not limited to: a processor, a network interface, and a memory. The processor, the network interface, and the memory in the electronic device may be connected by a bus or by other means, and the connection by the bus is taken as an example in fig. 12 shown in the embodiment of the present specification.
The processor (or CPU) is a computing core and a control core of the electronic device. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI, mobile communication interface, etc.). A Memory (Memory) is a Memory device in an electronic device for storing programs and data. It is understood that the memory herein may be a high-speed RAM storage device, or may be a non-volatile storage device (non-volatile memory), such as at least one magnetic disk storage device; optionally, at least one memory device located remotely from the processor. The memory provides storage space that stores an operating system of the electronic device, which may include, but is not limited to: a Windows system (an operating system), a Linux system (an operating system), an Android system, an IOS system, etc., which are not limited in the present invention; also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor. In this embodiment of the present specification, the processor loads and executes one or more instructions stored in the memory to implement the method for generating training sample data provided in the above method embodiment.
An embodiment of the present invention further provides a storage medium, where the storage medium may be disposed in an electronic device to store at least one instruction, at least one program, a code set, or a set of instructions related to a method for generating training sample data in the method embodiment, where the at least one instruction, the at least one program, the code set, or the set of instructions may be loaded and executed by a processor of the electronic device to implement the method for generating training sample data provided in the method embodiment.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one of 8230, and" comprising 8230does not exclude the presence of additional like elements in a process, method, article, or apparatus comprising the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for relevant points.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A method for generating training sample data, the method comprising:
determining triple data with the same relation in the knowledge graph to obtain a plurality of relation triple sets;
acquiring matching sentences corresponding to the triple data in the plurality of relation triple sets in the corpus to obtain a matching sentence set corresponding to each relation triple set;
performing word segmentation processing on the matched sentences in the matched sentence sets to obtain a word set of a corresponding relation three-tuple set;
determining a first weight of a word in the word set according to the frequency of the word in the word set;
determining, according to a first matching statement including a first term in the matching statement set, a degree of association between a relationship corresponding to the term set and the first term in the term set, and a second weight of the first term in the term set;
determining a target word in the word set according to the first weight, the association degree and the second weight, and taking the target word as a feature word of a relation corresponding to the relation three-tuple set;
and acquiring a target matching statement matched with the feature word from the matching statement set to obtain training sample data.
2. The method of generating training sample data according to claim 1, wherein determining the first weight of a word in the word set according to the frequency of occurrence of the word in the word set comprises:
determining the word frequency and the inverse document word frequency of the words in the word set;
and calculating the product of the word frequency and the inverse document word frequency, and taking the product as the first weight of the words in the word set.
3. The method according to claim 1, wherein the determining, according to a first matching statement in the matching statement set that includes a first word, a degree of association between a relationship corresponding to the word set and the first word in the word set includes:
selecting a target matching statement set from the matching statement set;
determining a first number of the first matching sentences in the target matching sentence set and a second number of second matching sentences not including the first words in the target matching sentence set;
determining a third number of the first matching statements in a remaining set of matching statements and a fourth number of the second matching statements in the remaining set of matching statements; the residual matching statement set is a matching statement set except the target matching statement set;
and calculating the association degree of the relation corresponding to the term set of the target matching statement set and the first term in the term set according to the first quantity, the second quantity, the third quantity and the fourth quantity.
4. The method of generating training sample data according to claim 1, wherein said determining a second weight of a first word in the set of words from a first matching statement that includes the first word in the set of matching statements comprises:
determining a first proportion of a first matching statement in each matching statement set in the matching statement set;
determining a second proportion of the first matching statement in all the matching statement sets in each matching statement set;
determining the number of relationships in the knowledge-graph;
and calculating a second weight of the first term in the term set according to the first proportion, the second proportion and the number of the relations in the knowledge graph.
5. The method of generating training sample data according to claim 1, wherein the determining a target word in the word set according to the first weight, the degree of association, and the second weight comprises:
calculating the sum of the first weight, the association degree and the second weight, and taking the sum as a characteristic value of the words in the word set;
determining the words in the word set of which the characteristic values meet preset conditions as the target words.
6. The method according to claim 1, wherein the obtaining of the target matching statement matching the feature word from the matching statement set includes:
obtaining a word vector of the feature word;
calculating an average word vector of the word vectors of the characteristic words, and taking the average word vector of the word vectors of the characteristic words as a first reference vector of a corresponding relation;
acquiring a first target feature word in the feature words according to the similarity between the word vector of each feature word and the first reference vector;
and acquiring a target matching statement matched with the first target feature word from the matching statement set to obtain training sample data.
7. The method according to claim 6, wherein the obtaining the target matching statement matching the first target feature word from the matching statement set to obtain training sample data comprises:
obtaining a word vector of the first target feature word;
calculating an average word vector of the word vectors of the first target characteristic words, and taking the average word vector of the word vectors of the first target characteristic words as a second reference vector of the corresponding relation;
obtaining a target word in the word vector library according to the similarity between the word vector in the word vector library and the second reference vector;
combining the target words in the word vector library with the first target feature words to obtain second target feature words;
and acquiring the target matching sentences matched with the second target characteristic words from the matching sentence set to obtain training sample data.
8. An apparatus for generating training sample data, the apparatus comprising:
the first determining module is used for determining the triple data with the same relation in the knowledge graph to obtain a plurality of relation triple sets;
the first acquisition module is used for acquiring matching sentences corresponding to the triple data in the plurality of relation triple sets in the corpus to obtain a matching sentence set corresponding to each relation triple set;
the second determining module is used for performing word segmentation processing on the matched sentences in the matched sentence set to obtain a word set of a corresponding relation three-tuple set; determining a first weight of a word in the word set according to the frequency of the occurrence of the word in the word set; determining the association degree of the relation corresponding to the word set and the first word in the word set and the second weight of the first word in the word set according to a first matching sentence comprising the first word in the matching sentence set; determining a target word in the word set according to the first weight, the association degree and the second weight, and taking the target word as a feature word of a relation corresponding to the relation three-tuple set;
and the second acquisition module is used for acquiring the target matching sentences matched with the feature words from the matching sentence set to obtain training sample data.
9. An electronic device, comprising: a processor and a memory, said memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by said processor to implement the method of generating training sample data according to any one of claims 1 to 7.
10. A computer readable storage medium, having at least one instruction or at least one program stored therein, where the at least one instruction or the at least one program is loaded and executed by a processor to implement the method for generating training sample data according to any one of claims 1 to 7.
CN201910312576.4A 2019-04-18 2019-04-18 Training sample data generation method and device and electronic equipment Active CN110032650B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910312576.4A CN110032650B (en) 2019-04-18 2019-04-18 Training sample data generation method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910312576.4A CN110032650B (en) 2019-04-18 2019-04-18 Training sample data generation method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN110032650A CN110032650A (en) 2019-07-19
CN110032650B true CN110032650B (en) 2022-12-13

Family

ID=67239022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910312576.4A Active CN110032650B (en) 2019-04-18 2019-04-18 Training sample data generation method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN110032650B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674637B (en) * 2019-09-06 2023-07-11 腾讯科技(深圳)有限公司 Character relationship recognition model training method, device, equipment and medium
CN111597809B (en) * 2020-06-09 2023-08-08 腾讯科技(深圳)有限公司 Training sample acquisition method, model training method, device and equipment
CN111967761B (en) * 2020-08-14 2024-04-02 国网数字科技控股有限公司 Knowledge graph-based monitoring and early warning method and device and electronic equipment
CN113051374B (en) * 2021-06-02 2021-08-31 北京沃丰时代数据科技有限公司 Text matching optimization method and device
CN114612725B (en) * 2022-03-18 2023-04-25 北京百度网讯科技有限公司 Image processing method, device, equipment and storage medium

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915420B (en) * 2015-06-10 2019-12-31 百度在线网络技术(北京)有限公司 Knowledge base data processing method and system
US10354188B2 (en) * 2016-08-02 2019-07-16 Microsoft Technology Licensing, Llc Extracting facts from unstructured information
CN105512273A (en) * 2015-12-03 2016-04-20 中山大学 Image retrieval method based on variable-length depth hash learning
CN106294593B (en) * 2016-07-28 2019-04-09 浙江大学 In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study
US10002129B1 (en) * 2017-02-15 2018-06-19 Wipro Limited System and method for extracting information from unstructured text
KR101983455B1 (en) * 2017-09-21 2019-05-28 숭실대학교산학협력단 Knowledge Base completion method and server
CN107918640A (en) * 2017-10-20 2018-04-17 阿里巴巴集团控股有限公司 Sample determines method and device
CN108446769B (en) * 2018-01-23 2020-12-08 深圳市阿西莫夫科技有限公司 Knowledge graph relation inference method, knowledge graph relation inference device, computer equipment and storage medium
CN109582799B (en) * 2018-06-29 2020-09-22 北京百度网讯科技有限公司 Method and device for determining knowledge sample data set and electronic equipment
CN109062894A (en) * 2018-07-19 2018-12-21 南京源成语义软件科技有限公司 The automatic identification algorithm of Chinese natural language Entity Semantics relationship
CN109284378A (en) * 2018-09-14 2019-01-29 北京邮电大学 A kind of relationship classification method towards knowledge mapping

Also Published As

Publication number Publication date
CN110032650A (en) 2019-07-19

Similar Documents

Publication Publication Date Title
CN110032650B (en) Training sample data generation method and device and electronic equipment
CN111104794B (en) Text similarity matching method based on subject term
CN108804641B (en) Text similarity calculation method, device, equipment and storage medium
CN104750798B (en) Recommendation method and device for application program
CN110457708B (en) Vocabulary mining method and device based on artificial intelligence, server and storage medium
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
CN111581949B (en) Method and device for disambiguating name of learner, storage medium and terminal
CN105224682B (en) New word discovery method and device
CN110147421B (en) Target entity linking method, device, equipment and storage medium
CN110162630A (en) A kind of method, device and equipment of text duplicate removal
WO2014210387A2 (en) Concept extraction
CN112000783B (en) Patent recommendation method, device and equipment based on text similarity analysis and storage medium
WO2011134141A1 (en) Method of extracting named entity
CN110674301A (en) Emotional tendency prediction method, device and system and storage medium
CN113988157A (en) Semantic retrieval network training method and device, electronic equipment and storage medium
CN110705281B (en) Resume information extraction method based on machine learning
CN114995903A (en) Class label identification method and device based on pre-training language model
CN105164672A (en) Content classification
CN110728135A (en) Text theme indexing method and device, electronic equipment and computer storage medium
CN112506864B (en) File retrieval method, device, electronic equipment and readable storage medium
Skoumas et al. On quantifying qualitative geospatial data: A probabilistic approach
CN113609847A (en) Information extraction method and device, electronic equipment and storage medium
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN112883198A (en) Knowledge graph construction method and device, storage medium and computer equipment
CN117435685A (en) Document retrieval method, document retrieval device, computer equipment, storage medium and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant