WO2018171499A1 - 一种信息检测方法、设备及存储介质 - Google Patents

一种信息检测方法、设备及存储介质 Download PDF

Info

Publication number
WO2018171499A1
WO2018171499A1 PCT/CN2018/079111 CN2018079111W WO2018171499A1 WO 2018171499 A1 WO2018171499 A1 WO 2018171499A1 CN 2018079111 W CN2018079111 W CN 2018079111W WO 2018171499 A1 WO2018171499 A1 WO 2018171499A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
candidate
sentence
vector
target
Prior art date
Application number
PCT/CN2018/079111
Other languages
English (en)
French (fr)
Inventor
李潇
张锋
王策
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2018171499A1 publication Critical patent/WO2018171499A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to an information detecting method, device, and storage medium.
  • the search result may include the result of the superordinate word corresponding to the keyword, or the result of the subordinate word corresponding to the keyword, if the keyword is a tiger, the upper word is an animal; if the keyword is an animal, the keyword corresponds to The lower word can be a tiger or something else. Therefore, how to determine the upper word corresponding to a word is an important link.
  • the embodiment of the present invention provides an information detecting method, device, and storage medium. By analyzing a sentence containing a candidate pair and an entity word and a candidate upper word in the candidate pair, whether the candidate upper word is an upper word of the entity word is implemented. The detection improves the detection efficiency of the upper word.
  • an embodiment of the present invention provides an information detection method, which is applied to an information detection device, and the method includes:
  • an embodiment of the present invention further provides an information detecting apparatus, where the apparatus includes a processor and a memory, where the memory stores an instruction executable by the processor, when the instruction is executed,
  • the processor is used to:
  • a third aspect of embodiments of the present invention provides a computer readable storage medium storing computer readable instructions that cause at least one processor to perform the method as described above.
  • FIG. 1a is a schematic structural diagram of an implementation environment according to an embodiment of the present invention.
  • FIG. 1b is a schematic flow chart of an information detecting method in an embodiment of the present invention.
  • FIG. 2 is a schematic flow chart of another information detecting method in an embodiment of the present invention.
  • FIG. 3 is a schematic flowchart of step 205 according to an embodiment of the present invention.
  • step 206 is a schematic flowchart of step 206 provided by an embodiment of the present invention.
  • FIG. 5 is a diagram showing an example of an information detecting method according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of an information detecting apparatus according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a determining module according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a matrix determining unit according to an embodiment of the present disclosure.
  • FIG. 9 is a schematic structural diagram of a vector generating unit according to an embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of a detection module according to an embodiment of the present invention.
  • FIG. 11 is a schematic structural diagram of another information detecting apparatus according to an embodiment of the present invention.
  • FIG. 1a is a schematic structural diagram of an implementation environment according to an embodiment of the present invention.
  • the information detection system 100 includes a server 110, a network 120, a terminal device 130, and a user 140.
  • the server 110 includes a processor and a memory, and the method embodiments of the present invention are executed by the processor executing instructions stored in the memory.
  • the server 110 includes a sentence database 111, a word vector database 112, and a superordinate word detecting unit 113.
  • a client 130-1 is installed on the terminal device 130. The client 130-1 provides a search window to the user for the user to input the entity word to be queried.
  • a large number of sentences are stored in the sentence database 111 to form a sentence set; the word vector database 112 stores word vectors corresponding to each word to form a word vector set.
  • the superordinate word detecting unit 113 is configured to select a candidate sentence including the target candidate pair from the sentence set pre-stored in the sentence database 111, and generate a candidate sentence set, where the target candidate pair includes the target entity word and the candidate upper word corresponding to the target entity word;
  • Each candidate sentence in the sentence set and the pre-stored word vector set determine a sentence set vector corresponding to the candidate sentence set; and obtain a first word vector and a candidate upper-level corresponding to the target entity word from the word vector set stored in the word vector database 112
  • the second word vector corresponding to the word detects whether the candidate upper word is the upper word of the target entity word according to the first word vector, the second word vector, and the sentence set vector. Then, the server 110 transmits the determined upper word as a part of the search result to the client 130-1 in the terminal
  • the server 110 can be a server, or a server cluster composed of several servers, or a cloud computing service center.
  • the network 120 can connect the server 110 and the terminal device 130 in a wireless or wired form.
  • the terminal device 130 can be a smart terminal, including a smart phone, a tablet computer, a laptop portable computer, and the like.
  • FIG. 1b is a schematic flowchart diagram of an information detection method according to an embodiment of the present invention. As shown in FIG. 1b, the method of the embodiment of the present invention is applied to an information detecting device, and may include the following steps 101-103.
  • the information detecting device selects a candidate sentence including the target candidate pair from the pre-stored sentence set.
  • the pre-stored sentence set may be composed of a corpus set for extracting candidate pairs.
  • the target candidate pair is any one of a plurality of candidate pairs, and each candidate pair can implement detection of the upper word by using the solution introduced in the embodiment of the present invention.
  • the target candidate pair includes a candidate upper word corresponding to the target entity word and the target entity word.
  • the information detecting device first selects a candidate sentence including the target entity word and the candidate upper word from the pre-stored sentence set, and combines the selected candidate sentence into a candidate sentence set, wherein the candidate sentence set is used to detect the target Whether the candidate upper word of the candidate pair is the upper word of the target entity word.
  • the candidate upper word is determined as the upper word of the entity word, for example, the entity word is a tiger, the candidate upper word is an animal, and the tiger can be considered as a tiger. It is an animal, so the animal is the upper word of the tiger. In this way, the target candidate pair can be expressed as (tiger, animal).
  • the entity words include nouns, pronouns, and the like.
  • the so-called upper word is relative to the entity word, referring to the word that is more conceptually extended. Any attribute of any concept expressed by an entity word, or any kind of classification, can be its superordinate word.
  • the candidate upper words of the entity word “flower delivery” may be “flowers", “courier”, “online shopping”, “flower etiquette”, “flower shop”, “gift company”.
  • the candidate upper words of the entity word “Wang Fei” may be "singers", “women”, “mummy”, “daughter”, “Hong Kong", “Leo”.
  • the target entity word and the candidate upper word included in the target candidate pair are the information detecting device randomly selecting one from the entity word set and the candidate upper word set respectively.
  • the entity word set is a set including at least one entity word
  • the candidate upper word set includes at least one candidate upper word set. It can be seen that, before the combination, it is not determined whether the candidate upper word is the upper word of the target entity word.
  • the function of detecting the upper word can be realized by performing the following actions.
  • the information detecting device determines a sentence set vector corresponding to the candidate sentence set according to each candidate sentence in the candidate sentence set and the pre-stored word vector set.
  • the candidate upper words of the entity words are detected by means of machine learning, and the entity words in the natural language need to be mathematicalized, and the word vectors of each entity word are preset and stored here.
  • the word vector is a representation of a word in a vector.
  • the information detecting device may compress the sentence matrix of each candidate sentence into an H-dimensional vector by using a Long Short-Term Memory (LSTM), where H is the number of hidden layers in the LSTM network.
  • LSTM Long Short-Term Memory
  • the sentence information of the candidate sentence related to the target candidate pair can be embodied by the sentence set vector to improve the accuracy of the upper word detection.
  • the information detecting device acquires, from the set of word vectors, a first word vector corresponding to the target entity word and a second word vector corresponding to the candidate upper word, and according to the first word vector, And describing the second word vector and the determined sentence set vector corresponding to the candidate sentence set, and detecting whether the candidate upper word is a superordinate word of the target entity word.
  • the information of the target entity word and the candidate upper word is combined, and the information of the candidate sentence including the target entity word and the candidate upper word is considered, thereby being able to more accurately determine whether the candidate upper word is the target entity word.
  • Upper word the information of the target entity word and the candidate upper word.
  • the first word vector and the second word vector are used to distinguish a target entity word and a candidate upper word corresponding word vector.
  • the word vector is a word expressed by a vector, and the vector is Each element can be a number.
  • “microphone” is represented as a word vector [0 0 0 1 0 0 0 0 0 0 0 0 0 0...]
  • "Mike” is represented as a word vector [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0...]
  • the word vector can be expressed as [0.792, -0.177, -0.17, 0.109, -0.542, ...].
  • the information detecting device may classify data including the first word vector, the second word vector, and the sentence set vector by using a classifier, and the category may be divided into indicating that the candidate upper word is the a first classification of the upper word of the target entity word, and a second classification indicating that the candidate upper word is not the upper word of the target entity word, and determining whether the candidate upper word is based on the classification values of the first classification and the second classification The upper word of the target entity word.
  • a candidate sentence including a target candidate pair is first selected from a pre-stored sentence set, and a candidate sentence set is generated according to the selected candidate sentence, and the target candidate pair includes a target upper-word corresponding to the target entity word and the target entity word; Determining a sentence set vector corresponding to the candidate sentence set by selecting each candidate sentence in the candidate sentence set and the pre-stored word vector set; acquiring a first word vector corresponding to the target entity word and the candidate upper position from the word vector set a second word vector corresponding to the word; and, according to the first word vector, the second word vector, and the sentence set vector, detecting whether the candidate upper word is a superordinate word of the target entity word.
  • the detection of the candidate upper word is the upper word of the entity word, and avoids the manual extraction of the upper word feature, which improves the upper position.
  • the efficiency of word detection is the upper word of the entity word, and avoids the manual extraction of the upper word feature, which improves the upper position.
  • the search accuracy of the server is improved by the technical solution of the foregoing embodiment, and the final search result is more in line with the query requirement of the user, thereby preventing the user from inputting the entity word multiple times to obtain the desired search result. Therefore, the resource utilization of the server is improved.
  • FIG. 2 is a schematic flowchart diagram of another method for detecting information according to an embodiment of the present invention. As shown in FIG. 2, the method in the embodiment of the present invention may include the following steps 201-209.
  • the information detecting device extracts a plurality of entity words from the pre-stored sentence set, and forms the extracted plurality of entity words into a set of entity words.
  • the pre-stored sentence set can be used to extract multiple entity words.
  • the information detecting device may obtain multiple entity words from the pre-stored sentence set by using a Named Entity Recognition (NER) technology, where the NER can identify the person name, the animal name, the place name, and the place name in the pre-stored sentence set. Entity such as organization name, for example, tiger, lion, Shenzhen, etc.
  • NER Named Entity Recognition
  • the information detecting device extracts, from the pre-stored sentence set, a plurality of candidate upper words satisfying the preset part of speech by using a word segmentation manner.
  • the information detecting device may perform segmentation on each sentence in the pre-stored sentence set according to the current vocabulary dictionary.
  • the information detecting device may adopt, but is not limited to, a segmentation method based on string matching. Word segmentation based on statistics, such as word segmentation, to obtain dozens, thousands or even more words.
  • the vocabulary dictionary is prepared for word segmentation, the term dictionary includes a plurality of words, words and phrases, and further optionally, the vocabulary dictionary can be updated in real time, so that the new vocabulary can be updated to the vocabulary dictionary. In the middle, the new words in the pre-existing sentence set are not separated, and the accuracy of the word segmentation is guaranteed.
  • the preset part of speech may include at least one of a noun and a noun phrase.
  • the candidate upper word is determined as the upper word of the entity word, for example, the entity word is a tiger, the candidate upper word is an animal, and the tiger is considered to be Animals, therefore animals are the upper words of the tiger.
  • the entity words that do not satisfy the preset part of speech are deleted from the extracted plurality of entity words, and the deleted plurality of entity words are combined into the entity word set.
  • the physical words of the part of speech such as prepositions, adjectives, and adverbs cannot determine the corresponding upper words
  • the physical words that cannot find the upper words can be excluded by the preset part of speech, so as to reduce the calculation amount of the upper word detection. the complexity.
  • the information detecting device extracts the plurality of candidate upper words
  • the information detecting device composes the plurality of candidate upper words into the candidate upper word set.
  • the information detecting device combines each entity word in the entity word set with each candidate upper word in the candidate episode set to generate a candidate pair.
  • the entity word set includes the entity words A1, A2, A3, A4, and A5; and the candidate upper word set shown in Table 2, the candidate upper word set The candidate upper words B1, B2, and B3 are included.
  • the candidate pairs consisting of Tables 1 and 2 include A1-B1, A1-B2, A1-B3, A2-B1, A2-B2, A2-B3, A3-B1, A3-B2, A3-B3, A4-B1. , A4-B2, A4-B3, A5-B1, A5-B2, A5-B3. It can be seen that each entity word in the entity word set can be combined with each candidate upper word in the candidate upper word set as a candidate pair to ensure the integrity of the candidate pair.
  • the target candidate pair includes a candidate upper word corresponding to the target entity word and the target entity word.
  • the information detecting device first selects a candidate sentence including the target entity word and the candidate upper word from the pre-stored sentence set, and combines the selected candidate sentence into a candidate sentence set, wherein the candidate sentence set is used to detect the target Whether the candidate upper word of the candidate pair is the upper word of the target entity word.
  • FIG. 3 is a schematic flowchart of a step 205 according to an embodiment of the present invention. As shown in FIG. 3, the step 205 includes a step 2051 and a step 2052.
  • the information detecting device performs segmentation on each candidate sentence in the candidate sentence set, and further extracts at least one word segment included in each candidate sentence, and determines the at least according to the pre-stored word vector set.
  • the information detecting device may divide each candidate sentence into at least one word segment according to a dictionary of words including a plurality of words, words and phrases, and convert each word segment in the candidate sentence into a word vector. .
  • the word vector is a vector representation of a word
  • the information detecting device may separately search for a word vector corresponding to each part of the candidate sentence from the pre-stored word vector set.
  • a pre-stored set of word vectors can be a tool that converts words into vectors (eg, the word2vec method) to convert a word into a word vector.
  • the information detecting device combines the word vectors corresponding to each participle according to the order of the each participle in each candidate sentence, and generates a sentence matrix corresponding to each candidate sentence.
  • the sentence matrix is a two-dimensional matrix of L*N dimensions, L is the number of word segments, and N is the length of the word vector.
  • the sentence matrix corresponding to each candidate sentence may be determined according to step 2051 and step 2052. Take a candidate sentence as an example for illustration.
  • the information detecting device generates a sentence set vector corresponding to the candidate sentence set according to the sentence matrix corresponding to each candidate sentence in the candidate sentence set.
  • FIG. 4 a schematic flowchart of a step 206 is provided according to an embodiment of the present invention. As shown in FIG. 4, the step 206 includes a step 2061 and a step 2062.
  • the information detecting device performs training and prediction by using the LSTM, and compresses a sentence matrix corresponding to each candidate sentence in the candidate sentence set into a sentence vector corresponding to the candidate sentence.
  • the LSTM in the embodiment of the present invention is used for upper word detection.
  • the information detecting device may compress the L*N-dimensional sentence sub-matrix of each candidate sentence into an H-dimensional sentence vector by using LSTM, where H is a preset number of hidden layers in the LSTM network.
  • a large number of positive and negative candidate pairs are constructed, and the LSTM is trained according to the candidate sentence sets included in the positive and negative candidate pairs, so that the LSTM can learn some semantic features included in the positive and negative candidate pairs, for example, sentence patterns. Implicit information, global state and other characteristics.
  • the detection of the target candidate pair can be achieved based on the semantic features contained in the positive and negative candidate pairs that have been obtained.
  • the specific process of the LSTM learning the semantic features included in the positive and negative candidate pairs is: taking a positive candidate pair as an example, inputting each positive candidate pair of a large number of positive candidate pairs, and obtaining a candidate including a positive candidate pair a sentence set, extracting multiple types of semantic features and feature values corresponding to each feature from the candidate sentence set; likewise, performing the same operation on the negative candidate pair, and approaching the preset standard value with the feature values of most positive candidate pairs, The principle that most of the negative candidate pairs are away from the preset standard value determines the various parameters required in the LSTM for the upper word detection.
  • a positive candidate pair includes an entity word and a candidate upper word corresponding to the entity word.
  • the candidate upper word is a superordinate word of the entity word.
  • the negative candidate pair includes an entity word and a candidate upper word corresponding to the entity word.
  • the candidate upper word is not the upper word of the entity word.
  • the information detecting device performs weighted averaging on the sentence vectors corresponding to each candidate sentence in the candidate sentence set, and generates a sentence set vector corresponding to the candidate sentence set. This makes it possible to represent sentence information of all candidate sentences including the target candidate pair by a sentence set vector.
  • the weighting value corresponding to each sentence component can be freely set by the information detecting device, and can also be set according to the level corresponding to each candidate sentence, for example, a high-level setting with a large weighting value, and a low-level device. Small weighting value. Further optionally, the level of each candidate sentence may be determined according to, but not limited to, the length of the candidate sentence, the number of target entity words included in the target candidate pair, and/or the number of candidate upper words.
  • the candidate sentence set contains 4 candidate sentences
  • the sentence vectors corresponding to each candidate sentence determined by step 2061 are H1, H2, H3, and H4, respectively, and the weight values of each sentence component are 1.
  • the sentence set vector corresponding to the candidate sentence set is:
  • the information detecting device acquires, from the set of word vectors, a first word vector corresponding to the target entity word, and a second word vector of the candidate upper word.
  • the word vector set includes a word vector corresponding to multiple words.
  • the information detecting device combines the first word vector corresponding to the target entity word, the second word vector of the candidate upper word, and the sentence set vector to generate a target vector.
  • the first word vector and the second word vector are used to distinguish a target entity word from a candidate upper word.
  • the acquired first word vector, second word vector, and sentence set vector generated in step 206 are combined to generate a target vector. For example, if the first word vector is N1, the second word vector is N2, and the sentence set word vector is H avg ; then the target vector T is:
  • the preset classifier is used to detect, according to the target vector, whether the candidate upper word is a superordinate word of the target entity word.
  • the information detecting device uses a preset classifier to detect whether the candidate upper word is a super word of the target entity word.
  • the classifier is set to two categories, which are the first category and the second category, respectively.
  • the first category indicates that the candidate upper word is a super word of the target entity word; and the second category indicates that the candidate upper word is not a super word of the target entity word.
  • the preset classifier can calculate the classification value corresponding to each category according to the target vector, and determine the detection result according to the classification value.
  • the target vector determined in step 208 is 1*(N+N+H) dimension
  • the classifier includes a first class and a second class, and the preset classifier can calculate the target vector through a parameter matrix of (N+N+H) rows*2 columns, thereby obtaining the classification value of each class.
  • the parameter matrix of the (N+N+H) row*2 column is obtained by the information detecting device through a plurality of training candidate pairs.
  • the classification value corresponding to the first category calculated by using the preset classifier is greater than the classification value corresponding to the second category, determining that the candidate upper word is a superordinate word of the target entity word; And determining, by the preset classifier, that the classification value corresponding to the first category is not greater than the classification value corresponding to the second category, determining that the candidate upper word is not a superordinate word of the target entity word.
  • the parameter matrix of the (N+N+H) row*2 column is obtained by training the positive and negative candidate pairs, so that the first classification of most positive candidate pairs corresponds to a larger classification value, and most of the negative candidates The second classification of the pair corresponds to a smaller classification value.
  • the preset classifier may include but is not limited to a softmax classifier.
  • the classification value corresponding to each category indicates the probability that the classification may occur, and the sum of the possible occurrences of each classification is 1. If the probability of the first classification is 0.8 and the probability of the second classification is 0.2; then since the probability of the first classification is greater than the probability of the second classification, it is determined that the candidate upper word is the upper word of the target entity word.
  • the classification result corresponding to the first classification may be determined to determine the detection result. For example, if the classification value corresponding to the first category calculated by the preset classifier is greater than the first threshold, determining that the candidate upper word is a superordinate word of the target entity word; if the preset classifier is used to calculate If the classification value corresponding to the first category is not greater than the first threshold, it is determined that the candidate upper word is not a superordinate word of the target entity word.
  • the parameter matrix of the (N+N+H) row*2 column is obtained by training the positive and negative candidate pairs such that the classification value corresponding to the first classification of most positive candidate pairs is greater than the first threshold, and most The classification value corresponding to the first category of the negative candidate pair is not greater than the first threshold.
  • the detection result may also be determined by judging the classification value corresponding to the second classification. For example, if the classification value corresponding to the second category calculated by the preset classifier is greater than the second threshold, determining that the candidate upper word is not a superordinate word of the target entity word; if the preset classifier is used to calculate If the classification value corresponding to the second category is not greater than the second threshold, determining that the candidate upper word is a superordinate word of the target entity word. For example, the parameter matrix of the (N+N+H) row*2 column is obtained by training the positive and negative candidate pairs, such that the classification value corresponding to the second category of most positive candidate pairs is not greater than the second threshold. And the classification value corresponding to the second category of most negative candidate pairs is greater than the second threshold.
  • FIG. 5 is an exemplary diagram of an information detection method according to an embodiment of the present invention.
  • the information detecting device includes a pre-stored sentence set storage module, a word vector storage module, an LSTM module, and a preset classifier module.
  • the pre-stored sentence set storage module is configured to store a large amount of corpus data, and may be used to extract multiple entity words, multiple candidate upper words, and select candidate sentences including target candidate pairs, and generate candidates according to the selected candidate sentences. Sentence collection.
  • the word vector storage module may be configured to store a word vector corresponding to each word generated by the training candidate pair, and may be used to determine a word vector of the entity word, the candidate upper word, the word segmentation in the sentence, and the like.
  • the LSTM module may compress a sentence matrix of each candidate sentence into a sentence vector, and generate a sentence set vector corresponding to the candidate sentence set.
  • the preset classifier module may be configured to combine the first word vector corresponding to the target entity word, the second word vector of the candidate upper word, and the sentence set vector to generate a target vector, and detect whether the candidate upper word is The upper word of the entity word.
  • the specific implementation process based on FIG. 5 is: first, the input target candidate pair is obtained, the candidate sentence including the target candidate pair is obtained from the pre-stored sentence set storage module in the information detecting device, and the candidate sentence including the target candidate pair is combined into a candidate sentence set; then, by segmenting each candidate sentence in the candidate sentence set, extracting at least one word segment included in each candidate sentence, and determining, from the word vector storage module, each word segment corresponding to the at least one word segment a word vector, combining the word vectors corresponding to each participle according to the order of each participle in the candidate sentence, to generate a sentence matrix corresponding to the candidate sentence; and then, based on the LSTM module, each candidate in the candidate sentence set The sentence matrix corresponding to the sentence is compressed into the sentence vector corresponding to the candidate sentence; the sentence vector corresponding to each candidate sentence in the candidate sentence set is weighted and averaged, and the sentence set vector corresponding to the candidate sentence set is generated; and the sentence vector storage module is obtained from the word vector storage module.
  • the first word vector corresponding to the target entity word and the candidate The second word vector of the word; finally, the detection result is determined by the preset classifier module according to the first word vector corresponding to the target entity word, the second word vector of the candidate upper word, and the sentence set vector, wherein the detection result is a candidate upper word Whether it is the upper word of the target entity word.
  • a candidate sentence including a target candidate pair is first selected from a pre-stored sentence set, and a candidate sentence set is generated according to the selected candidate sentence, and the target candidate pair includes a target upper-word corresponding to the target entity word and the target entity word; Determining a sentence set vector corresponding to the candidate sentence set by selecting each candidate sentence in the candidate sentence set and the pre-stored word vector set; acquiring a first word vector corresponding to the target entity word and the candidate upper position from the word vector set a second word vector corresponding to the word; and, according to the first word vector, the second word vector, and the sentence set vector, detecting whether the candidate upper word is a superordinate word of the target entity word.
  • the detection of the candidate upper word is the upper word of the entity word, and avoids the manual extraction of the upper word feature, which improves the upper position.
  • the efficiency of word detection is the upper word of the entity word, and avoids the manual extraction of the upper word feature, which improves the upper position.
  • FIG. 6 is a schematic structural diagram of an information detecting apparatus according to an embodiment of the present invention.
  • the information detecting apparatus 1 of the embodiment of the present invention may include: a generating module 11, a determining module 12, and a detecting module 13.
  • the generating module 11 is configured to select a candidate sentence that includes the target candidate pair from the pre-stored sentence set, and generate a candidate sentence set according to the selected candidate sentence, where the target candidate pair includes the target entity word and the candidate upper word corresponding to the target entity word.
  • the determining module 12 is configured to determine a sentence set vector corresponding to the candidate sentence set according to each candidate sentence in the candidate sentence set and the pre-stored word vector set.
  • FIG. 7 is a schematic structural diagram of a determining module according to an embodiment of the present invention.
  • the determining module 12 includes: a matrix determining unit 121 and a vector generating unit 122 .
  • the matrix determining unit 121 is configured to generate a sentence matrix corresponding to each candidate sentence in the candidate sentence set according to the pre-stored word vector set.
  • FIG. 8 is a schematic structural diagram of a matrix determining unit according to an embodiment of the present invention.
  • the matrix determining unit 121 includes: a word vector determining subunit 1211 and a sentence matrix.
  • a subunit 1212 is generated.
  • the word vector determining sub-unit 1211 is configured to perform word segmentation on the candidate sentence, extract at least one participle contained in the candidate sentence, and determine a word vector corresponding to each part of the at least one participle according to the word vector set.
  • the sentence matrix generation sub-unit 1212 is configured to combine the word vectors corresponding to each participle according to the order of the each participle in the candidate sentence to generate a sentence matrix corresponding to the candidate sentence.
  • the sentence matrix corresponding to each candidate sentence may be determined according to the word vector determining subunit 1211 and the sentence matrix generating subunit 1212.
  • the vector generating unit 122 is configured to generate a sentence set vector corresponding to the candidate sentence set according to the sentence matrix corresponding to each candidate sentence in the candidate sentence set.
  • FIG. 9 is a schematic structural diagram of a vector generating unit according to an embodiment of the present invention.
  • the vector generating unit 122 includes: a sentence vector determining unit 1221 and a quantity determining unit. 1222.
  • the sentence vector determining sub-unit 1221 is configured to compress, according to the temporal recurrent neural network, a sentence matrix corresponding to each of the candidate sentences in the candidate sentence set into a sentence vector corresponding to the candidate sentence.
  • the vector generation sub-unit 1222 is configured to perform weighted averaging on the sentence vectors corresponding to each candidate sentence in the candidate sentence set, and generate a sentence set vector corresponding to the candidate sentence set.
  • the vector generation sub-unit 1222 performs weighted averaging on the sentence vectors corresponding to each candidate sentence in the candidate sentence set, and generates a sentence set vector corresponding to the candidate sentence set. This can represent the sentence information of all candidate sentences including the target candidate pair by a sentence set vector.
  • the detecting module 13 is configured to obtain, from the set of word vectors, a first word vector corresponding to the target entity word and a second word vector corresponding to the candidate upper word; according to the first word vector, the second The word vector and the sentence set vector detect whether the candidate upper word is a super word of the target entity word.
  • FIG. 10 is a schematic structural diagram of a detection module according to an embodiment of the present invention.
  • the detection module 13 includes: a word vector obtaining unit 131 and a target vector generating unit 132.
  • the upper word detecting unit 133 is a schematic structural diagram of a detection module according to an embodiment of the present invention.
  • the detection module 13 includes: a word vector obtaining unit 131 and a target vector generating unit 132.
  • the upper word detecting unit 133 The upper word detecting unit 133.
  • the word vector obtaining unit 131 is configured to obtain, from the set of word vectors, a first word vector corresponding to the target entity word and a second word vector of the candidate upper word.
  • the target vector generating unit 132 is configured to combine the first word vector, the second word vector, and the sentence set vector to generate a target vector.
  • the superordinate word detecting unit 133 is configured to detect, by using a preset classifier, whether the candidate superordinate word is a superordinate word of the target entity word according to the target vector.
  • the preset classifier includes a first category and a second category, and the first category indicates that the candidate upper word is a super word of the target entity word.
  • the second classification indicates that the candidate upper word is not a super word of the target entity word.
  • the superordinate word detecting unit 133 is configured to: use the preset classifier to calculate, according to the target vector, a classification value corresponding to the first category and a classification value corresponding to the second category. And if the classification value corresponding to the first category is greater than the classification value corresponding to the second category, determining that the candidate upper word is a superordinate word of the target entity word. And if the classification value corresponding to the first category is not greater than the classification value corresponding to the second category, determining that the candidate upper word is not a superordinate word of the target entity word.
  • the generating module 11 of the information detecting device 1 selects a candidate sentence that includes a target candidate pair from the pre-stored sentence set, and generates a candidate sentence set according to the selected candidate sentence, and is further used to: from the pre-stored sentence set Extracting a plurality of entity words, and generating a set of entity words including the plurality of entity words; extracting, by using a word segmentation manner, a plurality of candidate upper words satisfying the preset part of speech, and generating the plurality of candidates a set of candidate upper words of the upper word; combining each entity word in the set of entity words with each candidate upper word in the set of candidate upper words to generate at least one candidate pair; generating at least one candidate pair One of the selected ones is determined as the target candidate pair.
  • the generating module 11 is further configured to: delete an entity word that does not satisfy the preset part of speech from the extracted plurality of entity words to generate the entity word set.
  • the candidate upper word is determined as the upper word of the entity word, for example, the entity word is a tiger, the candidate upper word is an animal, and the tiger is considered to be Animals, therefore animals are the upper words of the tiger. Since words such as prepositions, adjectives, and adverbs cannot determine the corresponding upper words, the participles that cannot find the upper words can be excluded by presupposition of part of speech to reduce the computational complexity and complexity of the upper word detection.
  • FIG. 11 is a schematic structural diagram of another information detecting apparatus according to an embodiment of the present invention.
  • the information detecting apparatus 1000 may include at least one processor 1001, such as a CPU, at least one network interface 1004, a user interface 1003, a memory 1005, and at least one communication bus 1002.
  • the communication bus 1002 is used to implement connection communication between these components.
  • the user interface 1003 can include a display and a keyboard.
  • the optional user interface 1003 can also include a standard wired interface and a wireless interface.
  • the network interface 1004 can optionally include a standard wired interface, a wireless interface (such as a WI-FI interface).
  • the memory 1005 may be a high speed RAM memory or a non-volatile memory such as at least one disk memory.
  • the memory 1005 can also optionally be at least one storage device located remotely from the aforementioned processor 1001. As shown in FIG. 11, an operating system, a network communication module, a user interface module, and a superordinate word detection application may be included in the memory 1005 as a computer storage medium.
  • the processor 1001 can be used to call the upper word detecting application stored in the memory 1005, and specifically perform the following operations:
  • the processor 1001 also performs:
  • One of the generated at least one candidate pair is determined to be the target candidate pair.
  • the processor 1001 further performs: deleting an entity word that does not satisfy the preset part of speech from the extracted plurality of entity words to generate the entity word set.
  • the processor 1001 determines, according to each candidate sentence in the candidate sentence set and the pre-stored word vector set, a sentence set vector corresponding to the candidate sentence set, and specifically executes:
  • the processor 1001 is configured to generate a sentence matrix corresponding to each candidate sentence in the candidate sentence set according to the pre-stored word vector set, and specifically:
  • the word vectors corresponding to each participle are combined according to the order of the each participle in the candidate sentence, and the sentence matrix corresponding to the candidate sentence is generated.
  • the processor 1001 generates a sentence set vector corresponding to the candidate sentence set according to the sentence matrix corresponding to each candidate sentence in the candidate sentence set, and specifically executes:
  • the processor 1001 is configured to detect, according to the first word vector, the second word vector, and the sentence set vector, whether the candidate upper word is an upper word of the target entity word. , specific implementation:
  • the preset classifier is used to detect, according to the target vector, whether the candidate upper word is a superordinate word of the target entity word.
  • the preset classifier includes a first category and a second category, the first category indicating that the candidate upper word is a super word of the target entity word; the second category indicates the The candidate upper word is not the upper word of the target entity word;
  • the processor 1001 performs, by using a preset classifier, whether the candidate upper-level word is the upper-level word of the target entity word according to the target vector, and specifically executes:
  • classification value corresponding to the first category is greater than the classification value corresponding to the second category, determining that the candidate upper word is a superordinate word of the target entity word;
  • modules or units in the terminal in the embodiment of the present invention may be combined, divided, and deleted according to actual needs.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明实施例公开了一种信息检测方法、设备及存储介质,其方法包括:从预存句子集合中选取包含目标候选对的候选句子,根据选取的候选句子生成候选句子集合,目标候选对包括目标实体词和目标实体词对应的候选上位词;根据候选句子集合中的每个候选句子和预存的词向量集合,确定候选句子集合对应的句子集合向量;从所述词向量集合中获取所述目标实体词对应的第一词向量和所述候选上位词对应的第二词向量;及,根据所述第一词向量、所述第二词向量以及所述句子集合向量,检测候选上位词是否为目标实体词的上位词。

Description

一种信息检测方法、设备及存储介质
本申请要求于2017年3月21日提交中国专利局、申请号为201710172589.7、申请名称为“一种上位词检测方法及设备”的中国专利申请的优先权。
技术领域
本发明涉及计算机技术领域,尤其涉及一种信息检测方法、设备及存储介质。
发明背景
随着网络技术的发展,网络搜索技术得到了不断的完善,通过网络搜索技术可以从互联网上获取各种信息。例如,用户提交一个查询关键词,网站向用户返回一个与该关键词相关的搜索结果。其中,搜索结果中可以包含该关键词对应的上位词的结果,或者该关键词对应的下位词的结果,若关键词是老虎,其上位词为动物;若关键词是动物,该关键词对应的下位词可以是老虎或者其他。因此,如何确定某一个词对应的上位词是一个重要的环节。
发明内容
本发明实施例提供一种信息检测方法、设备及存储介质,通过将包含候选对的句子和候选对中的实体词、候选上位词一并分析,实现对候选上位词是否为实体词的上位词的检测,提高了上位词的检测效率。
第一方面,本发明实施例提供了一种信息检测方法,应用于信息检测设备,所述方法包括:
从预存句子集合中选取包含目标候选对的候选句子,根据选取的候 选句子生成候选句子集合,所述目标候选对包括目标实体词和目标实体词对应的候选上位词;
根据所述候选句子集合中的每个候选句子和预存的词向量集合,确定所述候选句子集合对应的句子集合向量;
从所述词向量集合中获取所述目标实体词对应的第一词向量和所述候选上位词对应的第二词向量;及,
根据所述第一词向量、所述第二词向量以及所述句子集合向量,检测所述候选上位词是否为所述目标实体词的上位词。
第二方面,本发明实施例还提供了一种信息检测设备,所述设备包括处理器和存储器,所述存储器中存储可被所述处理器执行的指令,当执行所述指令时,所述处理器用于:
从预存句子集合中选取包含目标候选对的候选句子,根据选取的候选句子生成候选句子集合,所述目标候选对包括目标实体词和目标实体词对应的候选上位词;
根据所述候选句子集合中的每个候选句子和预存的词向量集合,确定所述候选句子集合对应的句子集合向量;
从所述词向量集合中获取所述目标实体词对应的第一词向量和所述候选上位词对应的第二词向量;及,
根据所述第一词向量、所述第二词向量以及所述句子集合向量,检测所述候选上位词是否为所述目标实体词的上位词。
本发明实施例第三方面提供了一种计算机可读存储介质,存储有计算机可读指令,可以使至少一个处理器执行如上所述的方法。
附图简要说明
为了更清楚的说明本申请实施例中的技术方案,下面将对实施例描 述中所需要使用的附图作简单的介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来说,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。其中,
图1a为本发明一实施例所涉及的实施环境的结构示意图;
图1b是本发明实施例中的一种信息检测方法的流程示意图;
图2是本发明实施例中的另一种信息检测方法的流程示意图;
图3是本发明实施例提供的步骤205的一种流程示意图;
图4是本发明实施例提供的步骤206的一种流程示意图;
图5是本发明实施例提供的一种信息检测方法的示例图;
图6是本发明实施例提供的一种信息检测设备的结构示意图;
图7是本发明实施例提供的一种确定模块的结构示意图;
图8是本发明实施例提供的一种矩阵确定单元的结构示意图;
图9是本发明实施例提供的一种向量生成单元的结构示意图;
图10是本发明实施例提供的一种检测模块的结构示意图;
图11是本发明实施例提供的另一种信息检测设备的结构示意图。
实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本发明的说明书和权利要求书及上述附图中的术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步 骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。
在现有的技术方案中,对于包含实体词和实体词对应的候选上位词的候选对而言,是通过手动整理和提取能够成为上位词的特征,来确定该候选对中的候选上位词是否为该实体词的上位词,这样需要较多的领域知识和人力,降低了上位词的检测效率。
图1a为本发明一实施例所涉及的实施环境的结构示意图。如图1a所示,信息检测系统100包括服务器110、网络120、终端设备130以及用户140。其中,服务器110包括处理器和存储器,本发明中的方法实施例由处理器执行存储在存储器中的指令来执行。具体地,服务器110包括句子数据库111、词向量数据库112和上位词检测单元113。终端设备130上安装有客户端130-1。客户端130-1向用户提供搜索窗口,供用户输入待查询的实体词。
在本发明的实施例中,句子数据库111中存储有大量的句子,形成句子集合;词向量数据库112中存储有每个词对应的词向量,形成词向量集合。上位词检测单元113用于从句子数据库111中预存的句子集合中选取包含目标候选对的候选句子,生成候选句子集合,目标候选对包括目标实体词和目标实体词对应的候选上位词;根据候选句子集合中的每个候选句子和预存的词向量集合,确定候选句子集合对应的句子集合向量;从词向量数据库112中存储的词向量集合中获取目标实体词对应的第一词向量和候选上位词对应的第二词向量,根据第一词向量、第二词向量以及句子集合向量,检测候选上位词是否为目标实体词的上位词。然后,服务器110将确定的上位词作为搜索结果的一部分发送给终端设备130中的客户端130-1,客户端130-1向用户展示该搜索结果。
其中,服务器110可以是一台服务器,或者由若干台服务器组成的 服务器集群,或者是一个云计算服务中心。网络120可以为无线或有线的形式将服务器110和终端设备130进行相连。终端设备130可以为智能终端,包括智能手机、平板电脑、膝上型便携计算机等。
下面将结合附图1b-附图5,对本发明实施例提供的信息检测方法进行详细介绍。
请参见图1b,为本发明实施例提供了一种信息检测方法的流程示意图。如图1b所示,本发明实施例的所述方法,应用于信息检测设备,可以包括以下步骤101-步骤103。
101,从预存句子集合中选取包含目标候选对的候选句子,根据选取的候选句子生成候选句子集合。
具体的,信息检测设备从预存句子集合中选取包含目标候选对的候选句子。其中,预存句子集合可以是由可供提取候选对的语料集组成的。所述目标候选对是多个候选对中的任意一个,每个候选对都可以通过本发明实施例所介绍的方案来实现对上位词的检测。所述目标候选对包括目标实体词和目标实体词对应的候选上位词。
所述信息检测设备先从预存句子集合中选择既包含目标实体词又包含候选上位词的候选句子,在将所选择的候选句子组合为候选句子集合,该候选句子集合是用于检测所述目标候选对中候选上位词是否为目标实体词的上位词的。
进一步的,在本发明实施例中,如果实体词和候选上位词存在上下位关系,则确定候选上位词为实体词的上位词,例如,实体词为老虎,候选上位词为动物,可以认为老虎是动物,因此动物是老虎的上位词。这样,目标候选对可以表示为(老虎,动物)。
在本发明实施例中,实体词包括名词和代词等。所谓的上位词是相对实体词而言的,指概念上外延更广的词语。一个实体词所表达概念的 任何一种属性、任何一种归类方式,都可以是它的上位词。例如,实体词“鲜花快递”的候选上位词可以是“鲜花”、“快递”、“网上购物”、“鲜花礼仪”、“鲜花店”、“礼品公司”。又如,实体词“王菲”的候选上位词可以是“歌星”、“女人”、“妈咪”、“女儿”、“香港”、“狮子座”。
可选的,所述目标候选对包含的目标实体词和候选上位词是所述信息检测设备从实体词集合和候选上位词集合中分别随机选择一个而组合的。其中,实体词集合是包含至少一个实体词的集合,候选上位词集合包含至少一个候选上位词的集合。可以看出,在组合之前并未确定候选上位词是否为所述目标实体词的上位词,在本发明实施例中,通过以下执行动作可以实现上位词的检测功能。
102,根据所述候选句子集合中的每个候选句子和预存的词向量集合,确定所述候选句子集合对应的句子集合向量。
具体的,所述信息检测设备根据所述候选句子集合中的每个候选句子和预存的词向量集合,确定所述候选句子集合对应的句子集合向量。
本发明实施例中,通过机器学习的方式对实体词的候选上位词进行检测,需要将自然语言中的实体词数学化,这里预先设置并存储每个实体词的词向量。可选的,词向量是将一个词用向量的方式表示。
可选的,所述信息检测设备可以通过时间递归神经网络(Long Short-Term Memory,LSTM)将每个候选句子的句子矩阵压缩为H维向量,其中,H是LSTM网络中隐藏层数量。在本发明实施例中能够通过句子集合向量来体现与目标候选对相关的候选句子的句子信息,以提高上位词检测的准确性。
103,从所述词向量集合中获取所述目标实体词对应的第一词向量和所述候选上位词对应的第二词向量。
104,根据所述第一词向量、所述第二词向量以及所述句子集合向 量,检测所述候选上位词是否为所述目标实体词的上位词。
具体的,所述信息检测设备从所述词向量集合中获取所述目标实体词对应的第一词向量和所述候选上位词对应的第二词向量,并根据所述第一词向量、所述第二词向量和确定的所述候选句子集合对应的句子集合向量,检测所述候选上位词是否为所述目标实体词的上位词。这样既结合了目标实体词和候选上位词的信息,又考虑了包含目标实体词和候选上位词的候选句子的信息,进而能够更加准确地确定所述候选上位词是否为所述目标实体词的上位词。
其中,所述第一词向量和所述第二词向量是用于区分目标实体词和候选上位词对应的词向量的,可选的,词向量是将一个词用向量的方式表示,向量的每个元素可以为数字,举个例子,“话筒”表示为词向量[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0……];“麦克”表示为词向量[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0……];或者,词向量可以表示为[0.792,-0.177,-0.107,0.109,-0.542,……]。
可选的,所述信息检测设备可以利用分类器对包含所述第一词向量、第二词向量和所述句子集合向量的数据进行分类,类别可以划分为表示所述候选上位词为所述目标实体词的上位词的第一分类,以及表示所述候选上位词不是所述目标实体词的上位词的第二分类,根据第一分类和第二分类的分类值确定所述候选上位词是否为所述目标实体词的上位词。
在本发明实施例中,首先从预存句子集合中选取包含目标候选对的候选句子,根据选取的候选句子生成候选句子集合,目标候选对包括目标实体词和目标实体词对应的候选上位词;根据候选句子集合中的每个候选句子和预存的词向量集合,确定候选句子集合对应的句子集合向量;从所述词向量集合中获取所述目标实体词对应的第一词向量和所述 候选上位词对应的第二词向量;及,根据所述第一词向量、所述第二词向量以及所述句子集合向量,检测候选上位词是否为目标实体词的上位词。通过将包含候选对的句子和候选对中的实体词、候选上位词一并分析,进而实现对候选上位词是否为实体词的上位词的检测,并避免了人工提取上位词特征,提高了上位词的检测效率。
当信息检测设备为服务器时,通过上述实施例的技术方案,使得服务器的搜索准确率提升,最终的搜索结果将更加符合用户的查询需求,避免了用户多次输入实体词以获得期望的搜索结果,因此,提升了服务器的资源利用率。
请参见图2,为本发明实施例提供了另一种信息检测方法的流程示意图。如图2所示,本发明实施例的所述方法可以包括以下步骤201-步骤209。
201,从预存句子集合中提取多个实体词,并生成包含所述多个实体词的实体词集合。
具体的,信息检测设备从预存句子集合中提取多个实体词,并将提取到的多个实体词组成实体词集合。其中,预存句子集合可以用于提取多个实体词。可选的,所述信息检测设备可以采用命名实体识别(Named Entity Recognition,NER)技术从预存句子集合中获取多个实体词,其中,NER能够识别预存句子集合中的人名、动物名、地名、组织机构名等实体词,例如,老虎、狮子、深圳等。
202,采用分词方式从所述预存句子集合中提取满足预设词性的多个候选上位词,并生成包含所述多个候选上位词的候选上位词集合。
具体的,所述信息检测设备采用分词方式从所述预存句子集合中提取满足预设词性的多个候选上位词。可选的,所述信息检测设备可以根 据当前的词汇字典,对所述预存句子集合中的每个句子进行分词,例如,所述信息检测设备可以采用但不限定于基于字符串匹配的分词方法、基于统计的分词方法等进行分词,以获得几十、几千甚至更多的词。其中,所述词汇字典是为分词准备的,所述词条词典中包含多个单词、词语和短语,进一步可选的,可以对词汇字典进行实时更新,这样能够将新的词汇更新到词汇字典中,使得预存句子集合中的新词汇不会被分开,保证了分词的准确性。
进一步可选的,所述预设词性可以包括名词和名词短语中的至少一项。另外,在本发明实施例中,如果实体词和候选上位词存在上下位关系,则确定候选上位词为实体词的上位词,例如,实体词为老虎,候选上位词为动物,可以认为老虎是动物,因此动物是老虎的上位词。
在本发明实施例中,从提取出的多个实体词中删除不满足所述预设词性的实体词,将删除后的多个实体词组合成所述实体词集合。具体地,由于介词、形容词、副词等词性的实体词无法确定相对应的上位词,因此可以通过预设词性的方式将无法找到上位词的实体词进行排除,以减少上位词检测的计算量和复杂度。
进一步的,在所述信息检测设备提取到多个候选上位词之后,所述信息检测设备将提取到的多个候选上位词组成候选上位词集合。
203,将所述实体词集合中的每个实体词与所述候选上位词集合中的每个候选上位词进行组合,生成至少一个候选对,从生成的至少一个候选对中选择一个确定为目标候选对。
具体的,所述信息检测设备将所述实体词集合中的每个实体词与所述候选上位词集合中的每个候选上位词进行组合,生成候选对。举例来说,如表一所示的实体词集合,该实体词集合中包含实体词A1、A2、A3、A4、A5;再如表二所示的候选上位词集合,该候选上位词集合中 包含候选上位词B1、B2、B3。
表一
实体词集合 A1 A2 A3 A4 A5
表二
候选上位词集合 B1 B2 B3
由表一和表二组成的候选对包括A1-B1,A1-B2,A1-B3,A2-B1,A2-B2,A2-B3,A3-B1,A3-B2,A3-B3,A4-B1,A4-B2,A4-B3,A5-B1,A5-B2,A5-B3。可以看出,实体词集合中的每个实体词都可以与候选上位词集合中的每个候选上位词组合为一个候选对,以保证候选对的完整性。
204,从预存句子集合中选取包含目标候选对的候选句子,根据选取的候选句子生成候选句子集合。
具体的,所述目标候选对包括目标实体词和目标实体词对应的候选上位词。所述信息检测设备先从预存句子集合中选择既包含目标实体词又包含候选上位词的候选句子,在将所选择的候选句子组合为候选句子集合,该候选句子集合是用于检测所述目标候选对中候选上位词是否为目标实体词的上位词的。
205,根据预存的词向量集合,确定所述候选句子集合中每个候选句子对应的句子矩阵。
具体的,所述信息检测设备根据预存的词向量集合,确定所述候选句子集合中每个候选句子对应的句子矩阵。请一并参见图3,为本发明实施例提供了一种步骤205的流程示意图,如图3所示,所述步骤205包括步骤2051和步骤2052。
2051,对所述候选句子集合中的每个候选句子进行分词,以提取所 述每个候选句子中包含的至少一个分词,并根据预存的词向量集合确定所述至少一个分词中每个分词对应的词向量。
具体的,所述信息检测设备对所述候选句子集合中的每个候选句子进行分词,进而提取到所述每个候选句子中包含的至少一个分词,并根据预存的词向量集合确定所述至少一个分词中每个分词对应的词向量。可选的,所述信息检测设备可以按照包含多个单词、词语和短语的词条字典对每个候选句子进行划分以得到至少一个分词,并将该候选句子中的每个分词转变为词向量。
可选的,词向量是将一个词用向量的方式表示,所述信息检测设备可以从预存的词向量集合中分别查找该候选句子中每个分词对应的词向量。举例来说,预存的词向量集合可以是通过词转化为向量的工具(如,word2vec方法)实现将一个词转化为词向量。
2052,按照所述每个分词在所述每个候选句子中的排列顺序,将所述每个分词对应的词向量进行组合,生成所述每个候选句子对应的句子矩阵。
具体的,所述信息检测设备按照所述每个分词在所述每个候选句子中的排列顺序,将所述每个分词对应的词向量进行组合,生成所述每个候选句子对应的句子矩阵。其中,句子矩阵是一个L*N维的二维矩阵,L为分词的数量,N为词向量的长度。
对于候选句子集合中的每个候选句子都可以按照步骤2051和步骤2052来确定每个候选句子对应的句子矩阵。以一个候选句子为例进行说明。
如,候选句子为“abc”;对该候选句子分词后得到“word1word2word3”,word1=a、word2=b、word3=c;接着在词向量集合中查找到分词对应的词向量,word1=word embedding1、word2=word embedding2、 word3=word embedding3;最后按照各分词在该候选句子中的顺序构造该候选句子对应的句子矩阵,句子矩阵如下所示,其中,由于候选句子中分词的数量L为3,若每个分词对应的词向量为1*N维,则该句子矩阵为3*N。
Figure PCTCN2018079111-appb-000001
206,根据所述候选句子集合中所述每个候选句子对应的句子矩阵,生成所述候选句子集合对应的句子集合向量。
具体的,所述信息检测设备根据所述候选句子集合中所述每个候选句子对应的句子矩阵,生成所述候选句子集合对应的句子集合向量。请一并参见图4,为本发明实施例提供了一种步骤206的流程示意图,如图4所示,所述步骤206包括步骤2061和步骤2062。
2061,基于时间递归神经网络,根据所述候选句子集合中所述每个候选句子对应的句子矩阵,确定所述每个候选句子对应的句子向量。
具体的,所述信息检测设备通过LSTM进行训练和预测,将所述候选句子集合中所述每个候选句子对应的句子矩阵压缩成该候选句子对应的句子向量。其中,本发明实施例中的LSTM是用于上位词检测的。所述信息检测设备可以通过LSTM将每个候选句子的L*N维句子矩阵压缩为一个H维的句子向量,其中,H是LSTM网络中预设的隐藏层数量。
在具体应用时,构建海量的正负候选对,根据这些正负候选对包含的候选句子集合,对LSTM进行训练,这样LSTM能够学习得到正负候选对所包含的一些语义特征,例如,句式的隐含信息、全局状态等特征。基于已经获得的正负候选对所包含的语义特征,能够实现对目标候选对 的检测。
可选的,所述LSTM学习正负候选对所包含的一些语义特征的具体过程为:以正候选对为例,输入海量的正候选对中每个正候选对,获取包含正候选对的候选句子集合,从候选句子集合中提取多类语义特征及各个特征对应的特征值;同样,对于负候选对也执行相同的操作,并以大多数正候选对的特征值接近与预设标准值、大多数负候选对的特征值远离预设标准值的原则,确定用于上位词检测的LSTM中所需的各个参数。
需要说明的是,构建海量的正负候选对时,正候选对、负候选与本发明实施例中目标候选对所包含的信息的类型相同。正候选对包括一实体词和与该实体词对应的候选上位词,对于正候选对而言,候选上位词是该实体词的上位词。负候选对包括一实体词和与该实体词对应的候选上位词,对于负候选对而言,候选上位词不是该实体词的上位词。
2062,对所述候选句子集合中所述每个候选句子对应的句子向量进行加权平均,生成所述候选句子集合对应的句子集合向量。
具体的,所述信息检测设备对所述候选句子集合中所述每个候选句子对应的句子向量进行加权平均,生成所述候选句子集合对应的句子集合向量。这样能够通过一句子集合向量来表示包含该目标候选对的全部候选句子的句子信息。
其中,各个句子分量对应的加权值可以由所述信息检测设备自由设定,还可以根据各个候选句子对应的等级进行设定,例如等级高的设定较大的加权值,等级低的设备较小的加权值。进一步可选的,每个候选句子的等级可以根据但不限定于候选句子的长度、包含的目标候选对中目标实体词和/或候选上位词的数量而确定。
举例来说,若所述候选句子集合中包含4个候选句子,且通过步骤 2061确定的每个候选句子对应的句子向量分别为H1、H2、H3和H4;且各个句子分量的加权值均为1,则该候选句子集合对应的句子集合向量为:
Figure PCTCN2018079111-appb-000002
207,从所述词向量集合中获取所述目标实体词对应的第一词向量和所述候选上位词的第二词向量。
具体的,所述信息检测设备从所述词向量集合中获取所述目标实体词对应的第一词向量,以及所述候选上位词的第二词向量。可选的,所述词向量集合中包含多个词对应的词向量。
208,将所述目标实体词对应的第一词向量、所述候选上位词的第二词向量和所述句子集合向量进行合并,生成目标向量。
具体的,所述信息检测设备将所述目标实体词对应的第一词向量、所述候选上位词的第二词向量和所述句子集合向量进行合并,生成目标向量。其中,所述第一词向量和所述第二词向量是用于区分目标实体词和候选上位词对应的词向量的。
进一步的,将获取到的第一词向量、第二词向量以及步骤206中生成的句子集合向量进行合并,生成目标向量。举例来说,若第一词向量为N1,第二词向量为N2,句子集合词向量为H avg;则目标向量T为:
T=[N 1,N 2,H avg]
209,采用预设分类器根据所述目标向量检测所述候选上位词是否为所述目标实体词的上位词。
具体的,所述信息检测设备采用预设分类器检测所述候选上位词是否为所述目标实体词的上位词。可选的方案中,由于本发明实施例是为了检测候选上位词是否为所述目标实体词的上位词,因此,将分类器设 定为两个分类,分别是第一分类和第二分类,其中,所述第一分类表示所述候选上位词是所述目标实体词的上位词;所述第二分类表示所述候选上位词不是所述目标实体词的上位词。通过预设分类器可以根据所述目标向量计算得到每个分类对应的分类值,并按照分类值确定检测结果。
举例来说,若任意一个词向量的维度为1*N维,句子集合向量的维度为1*H维,则在步骤208中确定的目标向量为1*(N+N+H)维,预设分类器包含第一分类和第二分类,该预设分类器可以通过一个(N+N+H)行*2列的参数矩阵,对目标向量进行计算,进而获得每个分类的分类值。其中,(N+N+H)行*2列的参数矩阵是所述信息检测设备通过多个训练候选对训练获得的。
可选的,若采用预设分类器计算的所述第一分类对应的分类值大于所述第二分类对应的分类值,则确定所述候选上位词是所述目标实体词的上位词;若采用所述预设分类器计算的所述第一分类对应的分类值不大于所述第二分类对应的分类值,则确定所述候选上位词不是所述目标实体词的上位词。举例来说,(N+N+H)行*2列的参数矩阵是通过正负候选对进行训练获得,使得大多数正候选对的第一分类对应的分类值较大,且大多数负候选对的第二分类对应的分类值较小。
可选的,所述预设分类器可以包括但不限定于softmax分类器。以softmax分类器举例来说,每个分类对应的分类值表示该分类可能出现的概率,且各个分类的可能出现的概率之和为1。若第一分类的概率为0.8,第二分类的概率为0.2;则由于第一分类的概率大于第二分类的概率,因此确定所述候选上位词是所述目标实体词的上位词。
除了采用上述通过第一分类对应的分类值与第二分类对应的分类值进行比较的方式来确定检测结果之外,还可以通过对第一分类对应的 分类值进行判断以确定检测结果。例如,若采用预设分类器计算的所述第一分类对应的分类值大于第一阈值,则确定所述候选上位词是所述目标实体词的上位词;若采用所述预设分类器计算的所述第一分类对应的分类值不大于所述第一阈值,则确定所述候选上位词不是所述目标实体词的上位词。举例来说,(N+N+H)行*2列的参数矩阵是通过正负候选对进行训练获得,使得大多数正候选对的第一分类对应的分类值大于第一阈值,且大多数负候选对的第一分类对应的分类值不大于所述第一阈值。
以及,还可以通过对第二分类对应的分类值进行判断以确定检测结果。例如,若采用预设分类器计算的所述第二分类对应的分类值大于第二阈值,则确定所述候选上位词不是所述目标实体词的上位词;若采用所述预设分类器计算的所述第二分类对应的分类值不大于所述第二阈值,则确定所述候选上位词是所述目标实体词的上位词。举例来说,(N+N+H)行*2列的参数矩阵是通过正负候选对进行训练获得,使得大多数正候选对的第二分类对应的分类值不大于所述第二阈值,且大多数负候选对的第二分类对应的分类值大于第二阈值。
为了更好的理解本发明实施例,请一并参见图5,为本发明实施例提供了一种信息检测方法的示例图。如图5所示,在信息检测设备中包括预存句子集合存储模块、词向量存储模块、LSTM模块、预设分类器模块。
其中,所述预存句子集合存储模块,用于存储大量的语料数据,可以用于提取多个实体词、多个候选上位词,以及选取包含目标候选对的候选句子,根据选取的候选句子生成候选句子集合。
所述词向量存储模块可以用于存储由训练候选对生成的各个词对 应的词向量,可以用于确定实体词、候选上位词、句子中的分词等的词向量。
所述LSTM模块可以将每个候选句子的句子矩阵压缩为句子向量,并生成所述候选句子集合对应的句子集合向量。
预设分类器模块可以用于将所述目标实体词对应的第一词向量、所述候选上位词的第二词向量和所述句子集合向量进行合并,生成目标向量,并检测候选上位词是否为实体词的上位词。
基于图5的具体实现过程为,首先,获取输入的目标候选对,从信息检测设备中的预存句子集合存储模块中获取包含目标候选对的候选句子,并将包含目标候选对的候选句子组合为候选句子集合;接着,通过对候选句子集合中的每个候选句子进行分词,以提取每个候选句子中包含的至少一个分词,并从词向量存储模块中确定至少一个分词中每个分词对应的词向量,按照每个分词在该候选句子中的排列顺序,将每个分词对应的词向量进行组合,生成该候选句子对应的句子矩阵;然后,基于LSTM模块,将候选句子集合中每个候选句子对应的句子矩阵压缩成该候选句子对应的句子向量;对候选句子集合中每个候选句子对应的句子向量进行加权平均,生成候选句子集合对应的句子集合向量;以及从词向量存储模块中获取目标实体词对应的第一词向量和候选上位词的第二词向量;最后,通过预设分类器模块根据目标实体词对应的第一词向量、候选上位词的第二词向量和句子集合向量,确定检测结果,其中检测结果为候选上位词是否为目标实体词的上位词。
在本发明实施例中,首先从预存句子集合中选取包含目标候选对的候选句子,根据选取的候选句子生成候选句子集合,目标候选对包括目标实体词和目标实体词对应的候选上位词;根据候选句子集合中的每个候选句子和预存的词向量集合,确定候选句子集合对应的句子集合向 量;从所述词向量集合中获取所述目标实体词对应的第一词向量和所述候选上位词对应的第二词向量;及,根据所述第一词向量、所述第二词向量以及所述句子集合向量,检测候选上位词是否为目标实体词的上位词。通过将包含候选对的句子和候选对中的实体词、候选上位词一并分析,进而实现对候选上位词是否为实体词的上位词的检测,并避免了人工提取上位词特征,提高了上位词的检测效率。
请参见图6,为本发明实施例提供了一种信息检测设备的结构示意图。如图6所示,本发明实施例的所述信息检测设备1可以包括:生成模块11、确定模块12、检测模块13。
生成模块11,用于从预存句子集合中选取包含目标候选对的候选句子,根据选取的候选句子生成候选句子集合,所述目标候选对包括目标实体词和目标实体词对应的候选上位词。
确定模块12,用于根据所述候选句子集合中的每个候选句子和预存的词向量集合,确定所述候选句子集合对应的句子集合向量。
具体的,请一并请参见图7,为本发明实施例提供了一种确定模块的结构示意图,如图7所示,所述确定模块12包括:矩阵确定单元121、向量生成单元122。
矩阵确定单元121,用于根据预存的词向量集合,生成所述候选句子集合中每个候选句子对应的句子矩阵。
具体的,请一并请参见图8,为本发明实施例提供了一种矩阵确定单元的结构示意图,如图8所示,所述矩阵确定单元121包括:词向量确定子单元1211、句子矩阵生成子单元1212。
其中,针对所述候选句子集合中的每个候选句子,执行如下处理:
词向量确定子单元1211,用于对该候选句子进行分词,提取出该候 选句子中包含的至少一个分词,并根据所述词向量集合确定所述至少一个分词中每个分词对应的词向量。
句子矩阵生成子单元1212,用于按照所述每个分词在该候选句子中的排列顺序,将所述每个分词对应的词向量进行组合,生成该候选句子对应的句子矩阵。
对于候选句子集合中的每个候选句子都可以按照词向量确定子单元1211、句子矩阵生成子单元1212来确定每个候选句子对应的句子矩阵。
向量生成单元122,用于根据所述候选句子集合中所述每个候选句子对应的句子矩阵,生成所述候选句子集合对应的句子集合向量。
具体的,请一并请参见图9,为本发明实施例提供了一种向量生成单元的结构示意图,如图9所示,所述向量生成单元122包括:句子向量确定单元1221、数量确定单元1222。
句子向量确定子单元1221,用于基于时间递归神经网络,将所述候选句子集合中所述每个候选句子对应的句子矩阵压缩成该候选句子对应的句子向量。
向量生成子单元1222,用于对所述候选句子集合中所述每个候选句子对应的句子向量进行加权平均,生成所述候选句子集合对应的句子集合向量。
具体的,所述向量生成子单元1222对所述候选句子集合中所述每个候选句子对应的句子向量进行加权平均,生成所述候选句子集合对应的句子集合向量。这样能够通过一句子集合向量上来表示包含该目标候选对的全部候选句子的句子信息。
检测模块13,用于从所述词向量集合中获取所述目标实体词对应的第一词向量和所述候选上位词对应的第二词向量;根据所述第一词向 量、所述第二词向量以及所述句子集合向量,检测所述候选上位词是否为所述目标实体词的上位词。
具体的,请一并请参见图10,为本发明实施例提供了一种检测模块的结构示意图,如图10所示,所述检测模块13包括:词向量获取单元131、目标向量生成单元132、上位词检测单元133。
词向量获取单元131,用于从所述词向量集合中获取所述目标实体词对应的第一词向量和所述候选上位词的第二词向量。
目标向量生成单元132,用于将所述第一词向量、所述第二词向量和所述句子集合向量进行合并,生成目标向量。
上位词检测单元133,用于采用预设分类器根据所述目标向量检测所述候选上位词是否为所述目标实体词的上位词。
可选的,所述预设分类器包含第一分类和第二分类,所述第一分类表示所述候选上位词是所述目标实体词的上位词。所述第二分类表示所述候选上位词不是所述目标实体词的上位词。
所述上位词检测单元133,具体用于:采用所述预设分类器根据所述目标向量计算得到所述第一分类对应的分类值和所述第二分类对应的分类值。若所述第一分类对应的分类值大于所述第二分类对应的分类值,则确定所述候选上位词是所述目标实体词的上位词。若所述第一分类对应的分类值不大于所述第二分类对应的分类值,则确定所述候选上位词不是所述目标实体词的上位词。
可选的,所述信息检测设备1的生成模块11在执行从预存句子集合中选取包含目标候选对的候选句子,根据选取的候选句子生成候选句子集合之前,还用于:从预存句子集合中提取多个实体词,并生成包含所述多个实体词的实体词集合;采用分词方式从所述预存句子集合中提取满足预设词性的多个候选上位词,并生成包含所述多个候选上位词的 候选上位词集合;将所述实体词集合中的每个实体词与所述候选上位词集合中的每个候选上位词进行组合,生成至少一个候选对;从生成的至少一个候选对中选择一个确定为所述目标候选对。
可选的,生成模块11还用于:从提取出的多个实体词中删除不满足所述预设词性的实体词,以生成所述实体词集合。
另外,在本发明实施例中,如果实体词和候选上位词存在上下位关系,则确定候选上位词为实体词的上位词,例如,实体词为老虎,候选上位词为动物,可以认为老虎是动物,因此动物是老虎的上位词。由于介词、形容词、副词等词性的词语无法确定相对应的上位词,因此可以通过预设词性的方式将无法找到上位词的词性进行排除,以减少上位词检测的计算量和复杂度。
请参见图11,为本发明实施例提供了另一种信息检测设备的结构示意图。如图11所示,所述信息检测设备1000可以包括:至少一个处理器1001,例如CPU,至少一个网络接口1004,用户接口1003,存储器1005,至少一个通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。其中,用户接口1003可以包括显示屏(Display)、键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器1005可选的还可以是至少一个位于远离前述处理器1001的存储装置。如图11所示,作为一种计算机存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及上位词检测应用程序。
在图11所示的信息检测设备1000中,处理器1001可以用于调用 存储器1005中存储的上位词检测应用程序,并具体执行以下操作:
从预存句子集合中选取包含目标候选对的候选句子,根据选取的候选句子生成候选句子集合,所述目标候选对包括目标实体词和目标实体词对应的候选上位词;
根据所述候选句子集合中的每个候选句子和预存的词向量集合,确定所述候选句子集合对应的句子集合向量;
从所述词向量集合中获取所述目标实体词对应的第一词向量和所述候选上位词对应的第二词向量;及,
根据所述第一词向量、所述第二词向量以及所述句子集合向量,检测所述候选上位词是否为所述目标实体词的上位词。
在一个实施例中,所述处理器1001还执行:
从预存句子集合中提取多个实体词,并生成包含所述多个实体词的实体词集合;
采用分词方式从所述预存句子集合中提取满足预设词性的多个候选上位词,并生成包含所述多个候选上位词的候选上位词集合;
将所述实体词集合中的每个实体词与所述候选上位词集合中的每个候选上位词进行组合,生成至少一个候选对;
从生成的至少一个候选对中选择一个确定为所述目标候选对。
在一个实施例中,所述处理器1001还执行:从提取出的多个实体词中删除不满足所述预设词性的实体词,以生成所述实体词集合。
在一个实施例中,所述处理器1001在执行根据所述候选句子集合中的每个候选句子和预存的词向量集合,确定所述候选句子集合对应的句子集合向量,具体执行:
根据预存的词向量集合,生成所述候选句子集合中每个候选句子对应的句子矩阵;
根据所述候选句子集合中所述每个候选句子对应的句子矩阵,生成所述候选句子集合对应的句子集合向量。
在一个实施例中,所述处理器1001在执行根据预存的词向量集合,生成所述候选句子集合中每个候选句子对应的句子矩阵,具体执行:
针对所述候选句子集合中的每个候选句子,执行如下处理:
对该候选句子进行分词,提取出该候选句子中包含的至少一个分词,并根据所述词向量集合确定所述至少一个分词中每个分词对应的词向量;
按照所述每个分词在该候选句子中的排列顺序,将所述每个分词对应的词向量进行组合,生成该候选句子对应的句子矩阵。
在一个实施例中,所述处理器1001在执行根据所述候选句子集合中所述每个候选句子对应的句子矩阵,生成所述候选句子集合对应的句子集合向量,具体执行:
基于时间递归神经网络,将所述候选句子集合中所述每个候选句子对应的句子矩阵压缩成该候选句子对应的句子向量;
对所述候选句子集合中所述每个候选句子对应的句子向量进行加权平均,生成所述候选句子集合对应的句子集合向量。
在一个实施例中,所述处理器1001在执行根据所述第一词向量、所述第二词向量以及所述句子集合向量,检测所述候选上位词是否为所述目标实体词的上位词,具体执行:
将所述第一词向量、所述第二词向量和所述句子集合向量进行合并,生成目标向量;
采用预设分类器根据所述目标向量检测所述候选上位词是否为所述目标实体词的上位词。
在一个实施例中,所述预设分类器包含第一分类和第二分类,所述 第一分类表示所述候选上位词是所述目标实体词的上位词;所述第二分类表示所述候选上位词不是所述目标实体词的上位词;
所述处理器1001在执行采用预设分类器根据所述目标向量检测所述候选上位词是否为所述目标实体词的上位词,具体执行:
采用所述预设分类器根据所述目标向量计算得到所述第一分类对应的分类值和所述第二分类对应的分类值;
若所述第一分类对应的分类值大于所述第二分类对应的分类值,则确定所述候选上位词是所述目标实体词的上位词;
若所述第一分类对应的分类值不大于所述第二分类对应的分类值,则确定所述候选上位词不是所述目标实体词的上位词。
需要说明的是,本发明实施例所描述的处理器1001所执行的动作可根据上述图1b至图5所示方法实施例中的方法具体实现,此处不再赘述。
本发明实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减。
本发明实施例终端中的模块或单元可以根据实际需要进行合并、划分和删减。
在本发明实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本发明。在本发明实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。另外,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了 一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
以上所揭露的仅为本发明较佳实施例而已,当然不能以此来限定本发明之权利范围,因此依本发明权利要求所作的等同变化,仍属本发明所涵盖的范围。

Claims (17)

  1. 一种信息检测方法,其特征在于,应用于信息检测设备,所述方法包括:
    从预存句子集合中选取包含目标候选对的候选句子,根据选取的候选句子生成候选句子集合,所述目标候选对包括目标实体词和目标实体词对应的候选上位词;
    根据所述候选句子集合中的每个候选句子和预存的词向量集合,确定所述候选句子集合对应的句子集合向量;
    从所述词向量集合中获取所述目标实体词对应的第一词向量和所述候选上位词对应的第二词向量;及,
    根据所述第一词向量、所述第二词向量以及所述句子集合向量,检测所述候选上位词是否为所述目标实体词的上位词。
  2. 根据权利要求1所述的方法,其特征在于,还包括:
    从预存句子集合中提取多个实体词,并生成包含所述多个实体词的实体词集合;
    采用分词方式从所述预存句子集合中提取满足预设词性的多个候选上位词,并生成包含所述多个候选上位词的候选上位词集合;
    将所述实体词集合中的每个实体词与所述候选上位词集合中的每个候选上位词进行组合,生成至少一个候选对;
    从生成的至少一个候选对中选择一个确定为所述目标候选对。
  3. 根据权利要求2所述的方法,其特征在于,还包括:
    从提取出的多个实体词中删除不满足所述预设词性的实体词,以生 成所述实体词集合。
  4. 根据权利要求1-3中任一项所述的方法,其特征在于,所述根据所述候选句子集合中的每个候选句子和预存的词向量集合,确定所述候选句子集合对应的句子集合向量,包括:
    根据预存的词向量集合,生成所述候选句子集合中每个候选句子对应的句子矩阵;
    根据所述候选句子集合中所述每个候选句子对应的句子矩阵,生成所述候选句子集合对应的句子集合向量。
  5. 根据权利要求4所述的方法,其特征在于,所述根据预存的词向量集合,生成所述候选句子集合中每个候选句子对应的句子矩阵,包括:
    针对所述候选句子集合中的每个候选句子,执行如下处理:
    对该候选句子进行分词,提取出该候选句子中包含的至少一个分词,并根据所述词向量集合确定所述至少一个分词中每个分词对应的词向量;
    按照所述每个分词在该候选句子中的排列顺序,将所述每个分词对应的词向量进行组合,生成该候选句子对应的句子矩阵。
  6. 根据权利要求4所述的方法,其特征在于,所述根据所述候选句子集合中所述每个候选句子对应的句子矩阵,生成所述候选句子集合对应的句子集合向量,包括:
    基于时间递归神经网络,将所述候选句子集合中所述每个候选句子对应的句子矩阵压缩成该候选句子对应的句子向量;
    对所述候选句子集合中所述每个候选句子对应的句子向量进行加权平均,生成所述候选句子集合对应的句子集合向量。
  7. 根据权利要求1-3中任一项所述的方法,其特征在于,所述根据所述第一词向量、所述第二词向量以及所述句子集合向量,检测所述候选上位词是否为所述目标实体词的上位词,包括:
    将所述第一词向量、所述第二词向量和所述句子集合向量进行合并,生成目标向量;
    采用预设分类器根据所述目标向量检测所述候选上位词是否为所述目标实体词的上位词。
  8. 根据权利要求7所述的方法,其特征在于,所述预设分类器包含第一分类和第二分类,所述第一分类表示所述候选上位词是所述目标实体词的上位词;所述第二分类表示所述候选上位词不是所述目标实体词的上位词;
    所述采用预设分类器根据所述目标向量检测所述候选上位词是否为所述目标实体词的上位词,包括:
    采用所述预设分类器根据所述目标向量计算得到所述第一分类对应的分类值和所述第二分类对应的分类值;
    若所述第一分类对应的分类值大于所述第二分类对应的分类值,则确定所述候选上位词是所述目标实体词的上位词;
    若所述第一分类对应的分类值不大于所述第二分类对应的分类值,则确定所述候选上位词不是所述目标实体词的上位词。
  9. 一种信息检测设备,其特征在于,包括处理器和存储器,所述 存储器中存储可被所述处理器执行的指令,当执行所述指令时,所述处理器用于:
    从预存句子集合中选取包含目标候选对的候选句子,根据选取的候选句子生成候选句子集合,所述目标候选对包括目标实体词和目标实体词对应的候选上位词;
    根据所述候选句子集合中的每个候选句子和预存的词向量集合,确定所述候选句子集合对应的句子集合向量;
    从所述词向量集合中获取所述目标实体词对应的第一词向量和所述候选上位词对应的第二词向量;及,
    根据所述第一词向量、所述第二词向量以及所述句子集合向量,检测所述候选上位词是否为所述目标实体词的上位词。
  10. 根据权利要求9所述的设备,其特征在于,当执行所述指令时,所述处理器进一步用于:
    从预存句子集合中提取多个实体词,并生成包含所述多个实体词的实体词集合;
    采用分词方式从所述预存句子集合中提取满足预设词性的多个候选上位词,并生成包含所述多个候选上位词的候选上位词集合;
    将所述实体词集合中的每个实体词与所述候选上位词集合中的每个候选上位词进行组合,生成至少一个候选对;
    从生成的至少一个候选对中选择一个确定为所述目标候选对。
  11. 根据权利要求10所述的设备,其特征在于,当执行所述指令时,所述处理器进一步用于:
    从提取出的多个实体词中删除不满足所述预设词性的实体词,以生 成所述实体词集合。
  12. 根据权利要求9-11中任一项所述的设备,其特征在于,当执行所述指令时,所述处理器进一步用于:
    根据预存的词向量集合,生成所述候选句子集合中每个候选句子对应的句子矩阵;
    根据所述候选句子集合中所述每个候选句子对应的句子矩阵,生成所述候选句子集合对应的句子集合向量。
  13. 根据权利要求12所述的设备,其特征在于,当执行所述指令时,所述处理器进一步用于:
    针对所述候选句子集合中的每个候选句子,执行如下处理:
    对该候选句子进行分词,提取出该候选句子中包含的至少一个分词,并根据所述词向量集合确定所述至少一个分词中每个分词对应的词向量;
    按照所述每个分词在该候选句子中的排列顺序,将所述每个分词对应的词向量进行组合,生成该候选句子对应的句子矩阵。
  14. 根据权利要求12所述的设备,其特征在于,当执行所述指令时,所述处理器进一步用于:
    基于时间递归神经网络,将所述候选句子集合中所述每个候选句子对应的句子矩阵压缩成该候选句子对应的句子向量;
    对所述候选句子集合中所述每个候选句子对应的句子向量进行加权平均,生成所述候选句子集合对应的句子集合向量。
  15. 根据权利要求9-11中任一项所述的设备,其特征在于,当执行所述指令时,所述处理器进一步用于:
    从所述词向量集合中获取所述第一词向量和所述第二词向量;
    将所述第一词向量、所述二词向量和所述句子集合向量进行合并,生成目标向量;
    采用预设分类器根据所述目标向量检测所述候选上位词是否为所述目标实体词的上位词。
  16. 根据权利要求15所述的设备,其特征在于,所述预设分类器包含第一分类和第二分类,所述第一分类表示所述候选上位词是所述目标实体词的上位词;所述第二分类表示所述候选上位词不是所述目标实体词的上位词;
    当执行所述指令时,所述处理器进一步用于:
    采用所述预设分类器根据所述目标向量计算得到所述第一分类对应的分类值和所述第二分类对应的分类值;
    若所述第一分类对应的分类值大于所述第二分类对应的分类值,则确定所述候选上位词是所述目标实体词的上位词;
    若所述第一分类对应的分类值不大于所述第二分类对应的分类值,则确定所述候选上位词不是所述目标实体词的上位词。
  17. 一种计算机可读存储介质,其特征在于,存储有计算机可读指令,可以使至少一个处理器执行如权利要求1至8中任一项所述的方法。
PCT/CN2018/079111 2017-03-21 2018-03-15 一种信息检测方法、设备及存储介质 WO2018171499A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710172589.7 2017-03-21
CN201710172589.7A CN108304366B (zh) 2017-03-21 2017-03-21 一种上位词检测方法及设备

Publications (1)

Publication Number Publication Date
WO2018171499A1 true WO2018171499A1 (zh) 2018-09-27

Family

ID=62872084

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/079111 WO2018171499A1 (zh) 2017-03-21 2018-03-15 一种信息检测方法、设备及存储介质

Country Status (2)

Country Link
CN (1) CN108304366B (zh)
WO (1) WO2018171499A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110196982B (zh) * 2019-06-12 2022-12-27 腾讯科技(深圳)有限公司 上下位关系抽取方法、装置及计算机设备
CN110610000A (zh) * 2019-08-12 2019-12-24 央视国际网络无锡有限公司 一种关键人名语境错误检测方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365912A (zh) * 2012-04-06 2013-10-23 富士通株式会社 对实体关系模式进行聚类、提取的方法和设备
CN104933183A (zh) * 2015-07-03 2015-09-23 重庆邮电大学 一种融合词向量模型和朴素贝叶斯的查询词改写方法
CN105808525A (zh) * 2016-03-29 2016-07-27 国家计算机网络与信息安全管理中心 一种基于相似概念对的领域概念上下位关系抽取方法
CN106126588A (zh) * 2016-06-17 2016-11-16 广州视源电子科技股份有限公司 提供相关词的方法和装置
CN106407211A (zh) * 2015-07-30 2017-02-15 富士通株式会社 对实体词的语义关系进行分类的方法和装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268296A (zh) * 2014-10-27 2015-01-07 刘莎 一种上位词搜索的方法与装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365912A (zh) * 2012-04-06 2013-10-23 富士通株式会社 对实体关系模式进行聚类、提取的方法和设备
CN104933183A (zh) * 2015-07-03 2015-09-23 重庆邮电大学 一种融合词向量模型和朴素贝叶斯的查询词改写方法
CN106407211A (zh) * 2015-07-30 2017-02-15 富士通株式会社 对实体词的语义关系进行分类的方法和装置
CN105808525A (zh) * 2016-03-29 2016-07-27 国家计算机网络与信息安全管理中心 一种基于相似概念对的领域概念上下位关系抽取方法
CN106126588A (zh) * 2016-06-17 2016-11-16 广州视源电子科技股份有限公司 提供相关词的方法和装置

Also Published As

Publication number Publication date
CN108304366B (zh) 2020-04-03
CN108304366A (zh) 2018-07-20

Similar Documents

Publication Publication Date Title
CN109960800B (zh) 基于主动学习的弱监督文本分类方法及装置
CN107832414B (zh) 用于推送信息的方法和装置
WO2023060795A1 (zh) 关键词自动提取方法、装置、设备及存储介质
WO2021042503A1 (zh) 信息分类抽取方法、装置、计算机设备和存储介质
WO2020082560A1 (zh) 文本关键词提取方法、装置、设备及计算机可读存储介质
WO2021174717A1 (zh) 文本意图识别方法、装置、计算机设备和存储介质
TWI753034B (zh) 特徵向量的產生、搜索方法、裝置及電子設備
US11544459B2 (en) Method and apparatus for determining feature words and server
WO2018157805A1 (zh) 一种自动问答处理方法及自动问答系统
WO2020186627A1 (zh) 舆情极性预测方法、装置、计算机设备及存储介质
CN110019732B (zh) 一种智能问答方法以及相关装置
CN110083693B (zh) 机器人对话回复方法及装置
US11763164B2 (en) Image-to-image search method, computer-readable storage medium and server
CN111563384B (zh) 面向电商产品的评价对象识别方法、装置及存储介质
CN109086265B (zh) 一种语义训练方法、短文本中多语义词消歧方法
WO2018068648A1 (zh) 一种信息匹配方法及相关装置
CN111737997A (zh) 一种文本相似度确定方法、设备及储存介质
CN114090823A (zh) 视频检索方法、装置、电子设备及计算机可读存储介质
CN111090719B (zh) 文本分类方法、装置、计算机设备及存储介质
CN112069312B (zh) 一种基于实体识别的文本分类方法及电子装置
CN111753082A (zh) 基于评论数据的文本分类方法及装置、设备和介质
CN112434533B (zh) 实体消歧方法、装置、电子设备及计算机可读存储介质
CN112527958A (zh) 用户行为倾向识别方法、装置、设备及存储介质
KR101545050B1 (ko) 정답 유형 자동 분류 방법 및 장치, 이를 이용한 질의 응답 시스템
WO2023155304A1 (zh) 关键词推荐模型训练方法、推荐方法和装置、设备、介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18770394

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18770394

Country of ref document: EP

Kind code of ref document: A1