WO2018171515A1 - 一种词汇挖掘方法、装置及设备 - Google Patents

一种词汇挖掘方法、装置及设备 Download PDF

Info

Publication number
WO2018171515A1
WO2018171515A1 PCT/CN2018/079259 CN2018079259W WO2018171515A1 WO 2018171515 A1 WO2018171515 A1 WO 2018171515A1 CN 2018079259 W CN2018079259 W CN 2018079259W WO 2018171515 A1 WO2018171515 A1 WO 2018171515A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
candidate
pair
entity
word vector
Prior art date
Application number
PCT/CN2018/079259
Other languages
English (en)
French (fr)
Inventor
李潇
张锋
王策
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2018171515A1 publication Critical patent/WO2018171515A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Definitions

  • the present application relates to the field of data mining technologies, and in particular, to a vocabulary mining method, apparatus and device.
  • the meaning of the superordinate word is that if an entity word A and a word B form an upper and lower position relationship, and the entity word A belongs to the lower position of the word B, the word B is the upper word of the entity word A.
  • “animal” is the upper word of "tiger.”
  • a pair of words consisting of the entity words A and B constituting the upper and lower position relationship is called a superordinate word pair.
  • "Tiger, Animal” constitutes a superior word pair.
  • the present application provides a vocabulary mining method, apparatus, and device, which can be used to implement an upper-level word pair with low cost and high efficiency.
  • a vocabulary mining method including:
  • a vocabulary mining device comprising:
  • a set determining unit configured to determine, for each sentence included in the corpus to be mined, a set of entity words included in the sentence, and a set of candidate upper words composed of nouns and noun phrases included in the sentence;
  • a candidate word pair determining unit configured to combine the entity word in the entity word set and the candidate upper word in the candidate upper word set, and the word pair combined with the entity word and the candidate upper word is used as a candidate word pair;
  • a word vector determining unit configured to determine a word vector of each of the candidate word pair entity word and the candidate upper word, and the candidate word vector pair is composed of the respective word vectors
  • the superordinate word determining unit is configured to determine, according to the candidate word vector pair, whether the candidate word pair is a vocabulary mining result.
  • a computer device comprising a processor and a memory, the memory storing at least one instruction, at least one program, code set or instruction set, the at least one instruction, the at least one program, the code set or the instruction set Loaded and executed by the processor to implement the vocabulary mining method described above.
  • a computer readable storage medium having stored therein at least one instruction, at least one program, a code set or a set of instructions, the at least one instruction, the at least one program, the code set or the instruction set being processed
  • the loader is loaded and executed to implement the above vocabulary mining method.
  • the vocabulary mining method determines, for each sentence included in the corpus to be mined, a set of entity words included in the sentence, and a set of candidate episodes composed of nouns and noun phrases included in the sentence. Combining the entity words in the entity word set with the candidate upper words in the candidate episode set, and combining the word pairs of the entity word and the candidate upper word as a candidate word pair; determining the candidate word alignment A word vector of each of the entity word and the candidate upper word, the candidate word vector pair is composed of the respective word vectors; and according to the candidate word vector pair, determining whether the candidate word pair is a vocabulary mining result.
  • the present application determines a set of entity words and a set of candidate episodes in a corpus sentence, and combines the words in the two sets to obtain a candidate word pair, and further determines a word of each of the candidate word pair entity word and the candidate upper word.
  • the vector, and determining whether the candidate word pair is a vocabulary mining result according to the word vector pair, for example, determining whether the candidate word pair is a superordinate word pair.
  • the application does not need to manually organize the corpus, and realizes the automatic mining of the upper word pairs, and the upper word has greatly improved the mining efficiency and reduces the mining cost.
  • FIG. 1 is a schematic structural diagram of a server hardware according to an embodiment of the present application.
  • FIG. 3 is a flowchart of another vocabulary mining method disclosed in an embodiment of the present application.
  • Figure 4 illustrates a two-way cyclic neural network model architecture diagram
  • FIG. 5 is a schematic diagram of a superficial word pair mining process according to an example of the present application.
  • FIG. 6 is a schematic structural diagram of a vocabulary mining apparatus according to an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a word vector determining unit according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram of a mining result determining unit disclosed in an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of an initial word vector determining unit according to an embodiment of the present disclosure.
  • FIG. 10 is a schematic structural diagram of another initial word vector determining unit according to an embodiment of the present application.
  • the embodiment of the present application provides an automatic mining vocabulary scheme, which can be used to mine a superordinate word pair.
  • the mining solution is implemented based on a server, and the server may also be referred to as a computer device.
  • the hardware structure of the server may be a processing device such as a computer or a notebook.
  • the hardware structure of the server is first introduced. As shown in FIG. 1, the server may include a processor 1, a communication interface 2, a memory 3, a communication bus 4, and a display screen 5.
  • Processor 1 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 1 may be a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application.
  • CPU Central Processing Unit
  • ASIC Application Specific Integrated Circuit
  • the memory 3 can include one or more computer readable storage media, which can be non-transitory.
  • the memory 3 may also include high speed random access memory, as well as non-volatile memory such as one or more disk storage devices, flash storage devices.
  • the non-transitory computer readable storage medium in memory 3 is for storing at least one instruction, at least one program, code set, or set of instructions, the at least one instruction, at least one program, code set, or instruction set It is used by the processor 1 to implement the methods provided by the method embodiments described below.
  • Communication interface 2 may include one or more interfaces, such as an interface between a server and other peripherals, for enabling communication between the server and other peripheral devices.
  • the display 5 is used to display a user interface (UI).
  • UI user interface
  • the UI can include graphics, text, icons, video, and any combination thereof.
  • the processor 1, the communication interface 2, the memory 3, and the display screen 5 complete communication with each other via the communication bus 4.
  • the bus 4 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the method may include:
  • Step S200 determining, for each sentence included in the corpus to be mined, a set of entity words included in the sentence, and a set of candidate episodes composed of nouns and noun phrases included in the sentence;
  • the corpus to be mined is composed of a large number of statements. For each sentence in the corpus to be mined, the set of entity words contained in the sentence, and the nouns and noun phrases contained in the sentence are determined, and the nouns and noun phrases are used as candidate upper words to form a set of candidate upper words.
  • the entity word may be a named entity such as a person's name, a place name, or an organization name, and the named entity usually refers to an entity having a special meaning or a very strong reference in the text. Due to the increasing number of these named entities, it is usually impossible to exhaustively list them in the dictionary, and their constituent methods have their own regularities. Therefore, in determining the entity words contained in the sentence, this step may use a Named Entities Recognition (NER) method to identify the entity words contained in the sentence, and each entity word constitutes a set of entity words.
  • NER Named Entities Recognition
  • nouns For candidate episodes, they are generally composed of nouns and noun phrases.
  • nouns represent the names of people, things, places or abstract concepts, and nouns are classified into proper nouns and common nouns.
  • a noun phrase also called a noun phrase, refers to a type of phrase whose grammatical function is equivalent to a noun.
  • Noun phrases can include partial phrases with nouns as the central word (such as "great motherland”, “these children"), joint phrases composed of nouns (such as “worker farmers"), plural phrases (such as “Capital Beijing” ), orientation phrases (such as “on the desktop", “before the building"), "yes” (such as "playing"), some of the central words are verbs, adjectives, and their attributives are pronouns, Nouns or other noun phrases are also noun phrases, such as "his departure”, “China's liberation”, “the sincerity of his attitude” and so on.
  • this step may first segment the sentences, and then identify the part of speech of each participle, and use the participles whose part of speech is noun and noun phrase as the candidate upper words to form the set of candidate upper words.
  • the corpus to be mined may be stored in the memory 3 through the communication interface 2 in advance.
  • the processor 1 determines, through the communication bus 4, the set of entity words contained in the sentence, and the set of candidate episodes composed of nouns and noun phrases contained in the sentence, in the corpus to be mined stored in the memory.
  • Step S210 Combine the entity words in the entity word set and the candidate upper words in the candidate upper word set, and select the word pair after the entity word and the candidate upper word combination as the candidate word pair;
  • the combination of the two pairs of words in the two sets has a total of N*M species.
  • the processor 1 may combine the entity words in the entity word set and the candidate upper words in the candidate episode set in pairs.
  • Step S220 Determine a word vector of each of the candidate word pair entity word and the candidate upper word, and form a candidate word vector pair by the respective word vectors;
  • word embedding is a process of mathematically termifying words in natural language in the process of natural language processing, specifically expressing words in the form of mathematical vectors.
  • the candidate word pair, the word vector of the entity word, and the word vector of the candidate upper word are determined.
  • a candidate word vector pair is composed of a word vector of the entity word and a word vector of the candidate upper word, and the candidate word vector pair and the candidate word pair correspond to each other.
  • the processor 1 may determine a word vector of each of the candidate word pair entity word and the candidate upper word, and the respective word vector constitutes a candidate word vector pair.
  • Step S230 Determine, according to the candidate word vector pair, whether the candidate word pair is a vocabulary mining result.
  • the candidate word vector pair may be input into the classification model by using the pre-trained classification model, thereby obtaining the classification result output by the classification model.
  • the classification result indicates whether the candidate word pair is a vocabulary mining result, such as indicating whether the candidate word is a superordinate word pair.
  • the classification model can use the softmax classification model.
  • the classification model is trained using a pair of training words that are pre-marked with classification results.
  • the candidate word vector pair is input into the trained classification model, and the candidate word pair is determined to be a superordinate word pair according to the output result of the classification model.
  • the processor 1 may determine, according to the candidate word vector pair, whether the candidate word pair is a vocabulary mining result, and output the display through the display screen 5.
  • the vocabulary mining method determines, for each sentence included in the corpus to be mined, a set of entity words included in the sentence, and a set of candidate episodes composed of nouns and noun phrases included in the sentence. Combining the entity words in the entity word set with the candidate upper words in the candidate episode set, and combining the word pairs of the entity word and the candidate upper word as a candidate word pair; determining the candidate word alignment A word vector of each of the entity word and the candidate upper word, the candidate word vector pair is composed of the respective word vectors; and according to the candidate word vector pair, determining whether the candidate word pair is a vocabulary mining result.
  • the present application determines a set of entity words and a set of candidate episodes in a corpus sentence, and combines the words in the two sets to obtain a candidate word pair, and further determines a word of each of the candidate word pair entity word and the candidate upper word.
  • the vector, and determining whether the candidate word pair is a vocabulary mining result according to the word vector pair, for example, determining whether the candidate word pair is a superordinate word pair.
  • the application does not need to manually organize the corpus, and realizes the automatic mining of the upper word pairs, and the upper word has greatly improved the mining efficiency and reduces the mining cost.
  • another vocabulary mining method is disclosed. Taking the lexical mining result as a superordinate word pair as an example, as shown in FIG. 3, the method includes:
  • Step S300 determining, for each sentence included in the corpus to be mined, a set of entity words included in the sentence, and a set of candidate episodes composed of nouns and noun phrases included in the sentence;
  • Step S310 combining the entity words in the entity word set and the candidate upper words in the candidate upper word set, and combining the entity words and the candidate upper words as a candidate word pair;
  • steps S300-S310 are in one-to-one correspondence with the foregoing steps S200-S210, and details are not described herein again.
  • Step S320 determining an initial word vector of each word included in the sentence, and initial word vectors of each word constitute an initial word vector matrix
  • the present application may use a random number to determine an initial word vector of each word included in the sentence.
  • the present application can also use the word2vec method to train all the corpus to be mined, and convert all the words contained in the corpus to be converted into a vector form. Further, in the word vector of each word included in the corpus to be mined, the word vector corresponding to each word in the sentence is searched for as the initial word vector of each word.
  • word2vec is a tool to convert words into vector form, open source by google.
  • the processing of the text content can be simplified to the vector operation in the vector space, and the similarity in the vector space can be calculated to represent the semantic similarity of the text.
  • the initial word vector matrix composed of the initial word vectors of the words contained in the sentence is a matrix of L*N.
  • word1 a
  • word2 b
  • word3 c
  • word1 word embedding1
  • word2 word embedding2
  • word3 word embedding3.
  • the matrix constructing 3*N is as shown in Table 1 (sentence length is 3):
  • Step S330 adjusting the initial word vector matrix by using a cyclic neural network model to obtain an adjusted word vector matrix composed of the adjusted word vectors of the words;
  • the Recurrent Neural Network has the advantage that it can utilize context-related information in the mapping process between input and output sequences.
  • the relationship between each word before and after each word can be comprehensively considered, and then the initial word vector of the word is adjusted, so that each output is The adjusted word vector of the word is more accurate.
  • the dimension of the adjusted word vector is H, and H is the same as the number of hidden layers in the cyclic neural network. Therefore, the adjusted word vector matrix composed of the adjusted word vectors of the words is a matrix of L*H.
  • Step S340 searching for the adjusted word vector corresponding to each of the candidate word pair entity word and the candidate upper word in the adjusted word vector matrix, and composing the candidate word vector pair by the corresponding adjusted word vector;
  • the adjusted word vector of the corresponding position is searched in the adjusted word vector matrix, the adjusted word vector corresponding to the entity word is determined, and the candidate upper word is determined. Corresponding adjusted word vector.
  • the table 2 of the query can determine that the adjusted word vector corresponding to the entity word b is Word embedding 21, and the candidate upper word c The corresponding adjusted word vector is Word embedding31.
  • Step S350 Determine, according to the candidate word vector pair, whether the candidate word pair is a superordinate word pair.
  • the process of determining the word vector of each of the candidate word pair entity word and the candidate upper word is described in detail.
  • the initial word vector is adjusted by using the cyclic neural network model, so that the adjusted word vector takes into account the context-related information of the words, and the determined word vectors of the entity words and the candidate upper words are more accurate.
  • the cyclic neural network model may be a bidirectional cyclic neural network model, such as a Long Short-Term Memory (LSTM) model.
  • LSTM Long Short-Term Memory
  • Standard contextual neural networks have limited access to contextual information. This problem makes the influence of the input of the hidden layer on the network output declining with the recurrence of the network loop, and the two-way long- and short-term memory artificial neural network model LSTM can solve this problem.
  • Figure 4 illustrates a two-way cyclic neural network model architecture diagram.
  • the model includes an input layer input layer, a forward hidden layer forward layer, a backward hidden layer backward layer, and an output layer output layer.
  • the word vector adjustment process will consider the previous information, while in the backward hidden layer, the word vector adjustment process will consider the following information, and the final output will consider the forward implied at the same time.
  • FIG. 5 is a schematic diagram of a superficial word pair mining process according to an example of the present application.
  • S2 Input the initial word vector of each word into the two-way long-term and short-term memory artificial neural network model LSTM model, and adjust the initial word vector of each word to obtain the adjusted word vector of each word.
  • the classifier can use the softmax classifier.
  • the vocabulary mining device provided by the embodiment of the present application is described below.
  • the vocabulary mining device described below and the vocabulary mining method described above can refer to each other.
  • FIG. 6 is a schematic structural diagram of a vocabulary mining apparatus according to an embodiment of the present application.
  • the device has the function of implementing the above vocabulary mining method, and the function may be implemented by hardware or by executing corresponding software by hardware.
  • the apparatus may include:
  • the set determining unit 11 is configured to determine, for each sentence included in the corpus to be mined, a set of entity words included in the sentence, and a set of candidate upper words composed of nouns and noun phrases included in the sentence;
  • the process of determining the set of entity words included in the sentence by the set determining unit may specifically identify the entity words included in the sentence by using a named entity identification method, and each entity word constitutes a set of entity words.
  • the candidate word pair determining unit 12 is configured to combine the entity words in the entity word set and the candidate upper words in the candidate upper word set, and the word pairs combined by the entity word and the candidate upper word are used as candidate word pairs. ;
  • a word vector determining unit 13 configured to determine a word vector of each of the candidate word pair entity word and the candidate upper word, and the candidate word vector pair is composed of the respective word vectors
  • the mining result determining unit 14 is configured to determine, according to the candidate word vector pair, whether the candidate word pair is a vocabulary mining result.
  • the vocabulary mining device determines, for each sentence included in the corpus to be mined, a set of entity words included in the sentence, and a set of candidate episodes composed of nouns and noun phrases included in the sentence. Combining the entity words in the entity word set with the candidate upper words in the candidate episode set, and combining the word pairs of the entity word and the candidate upper word as a candidate word pair; determining the candidate word alignment A word vector of each of the entity word and the candidate upper word, the candidate word vector pair is composed of the respective word vectors; and according to the candidate word vector pair, determining whether the candidate word pair is a vocabulary mining result.
  • the present application determines a set of entity words and a set of candidate episodes in a corpus sentence, and combines the words in the two sets to obtain a candidate word pair, and further determines a word of each of the candidate word pair entity word and the candidate upper word.
  • the vector, and determining whether the candidate word pair is a vocabulary mining result according to the word vector pair, for example, determining whether the candidate word pair is a superordinate word pair.
  • the application does not need to manually organize the corpus, and realizes the automatic mining of the upper word pairs, and the upper word has greatly improved the mining efficiency and reduces the mining cost.
  • the embodiment of the present application exemplifies an optional structure of the above-mentioned word vector determining unit 13.
  • the word vector determining unit 13 may include:
  • the initial word vector determining unit 131 is configured to determine an initial word vector of each word included in the sentence, and the initial word vector of each word constitutes an initial word vector matrix;
  • the initial word vector matrix adjusting unit 132 is configured to adjust the initial word vector matrix by using a cyclic neural network model to obtain an adjusted word vector matrix composed of the adjusted word vectors of the words;
  • the cyclic neural network model may include: a bidirectional long-term and short-term memory artificial neural network model.
  • the adjusted word vector searching unit 133 is configured to search, in the adjusted word vector matrix, the adjusted word vector corresponding to each of the candidate word pair entity word and the candidate upper word.
  • the embodiment of the present application exemplifies an optional structure of the above-mentioned upper-level word determining unit 14, and the vocabulary may be a super-word pair.
  • the mining result determining unit 14 may include:
  • the classification determining unit 141 is configured to input the candidate word vector pair to the pre-trained classification model to obtain a classification result output by the classification model, and the classification result indicates whether the candidate word pair is a superior word pair.
  • the embodiment of the present application exemplifies two optional structures of the initial word vector determining unit 131, as shown in FIG. 9 and FIG. 10 respectively:
  • the initial word vector determining unit 131 may include:
  • the first initial word vector determining sub-unit 1311 is configured to determine an initial word vector of each word included in the sentence by using a random number.
  • the initial word vector determining unit 131 may include:
  • the second initial word vector determining sub-unit 1312 is configured to determine, by using the word2vec method, a word vector corresponding to each word included in the sentence as an initial word vector.
  • a computer readable storage medium having stored therein at least one instruction, at least one program, a code set or a set of instructions, the at least one instruction, the at least one program
  • the code set or set of instructions is loaded and executed by a processor of the computer device to implement the various steps in the above method embodiments.
  • the computer readable storage medium may be a Read Only Memory (ROM), a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, or the like.
  • ROM Read Only Memory
  • RAM Random Access Memory
  • CD-ROM Compact Disc-ROM
  • magnetic tape a magnetic tape
  • floppy disk a magnetic tape
  • optical data storage device or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种词汇挖掘方法、装置及设备,在语料句子中确定所包含的实体词集合和候选上位词集合,将两个集合中的词两两组合,得到候选词对,进一步确定候选词对中实体词和候选上位词各自的词向量,并根据词向量对来确定候选词对是否为词汇挖掘结果,确定候选词对是否为上位词对。不需要人工整理语料,通过机器学习方式实现了上位词对的自动挖掘,其上位词对挖掘效率大大提升,降低了挖掘成本。

Description

一种词汇挖掘方法、装置及设备
本申请要求于2017年3月21日提交中国国家知识产权局、申请号为201710169796.7、发明名称为“一种词汇挖掘方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据挖掘技术领域,更具体地说,涉及一种词汇挖掘方法、装置及设备。
背景技术
上位词的含义是,如果一个实体词A和一个词B构成上下位关系,实体词A属于词B的下位,则词B就是实体词A的上位词。例如,“动物”是“老虎”的上位词。在此基础上,由构成上下位关系的实体词A和词B组成的词对称之为上位词对。如,《老虎、动物》构成一个上位词对。
在大量的语料中挖掘出上位词对,能够帮助进行篇章分析等工作。现有的上位词对挖掘方法一般是人工对语料进行语义分析,从中确定上位词对。显然,人工挖掘的方式效率低下,并且需要挖掘人员具备一定的领域知识,人工成本高。
发明内容
有鉴于此,本申请提供了一种词汇挖掘方法、装置及设备,可用于实现低成本、高效率地挖掘上位词对。
为了实现上述目的,现提出的方案如下:
一种词汇挖掘方法,包括:
针对待挖掘语料所包含的每一句子,确定所述句子所包含的实体词集合,以及由所述句子所包含的名词及名词短语组成的候选上位词集合;
将所述实体词集合中的实体词和所述候选上位词集合中的候选上位词两两组合,实体词和候选上位词组合后的词对作为候选词对;
确定所述候选词对中实体词和候选上位词各自的词向量,由所述各自的词向量组成候选词向量对;
根据所述候选词向量对,确定所述候选词对是否为词汇挖掘结果。
一种词汇挖掘装置,包括:
集合确定单元,用于针对待挖掘语料所包含的每一句子,确定所述句子所包含的实体词集合,以及由所述句子所包含的名词及名词短语组成的候选上位词集合;
候选词对确定单元,用于将所述实体词集合中的实体词和所述候选上位词集合中的候选上位词两两组合,实体词和候选上位词组合后的词对作为候选词对;
词向量确定单元,用于确定所述候选词对中实体词和候选上位词各自的词向量,由所述各自的词向量组成候选词向量对;
上位词确定单元,用于根据所述候选词向量对,确定所述候选词对是否为词汇挖掘结果。
一种计算机设备,包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现上述词汇挖掘方法。
一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现上述词汇挖掘方法。
一种计算机程序产品,当该计算机程序产品被执行时,其用于执行上述词汇挖掘方法。
本申请实施例提供的词汇挖掘方法,针对待挖掘语料所包含的每一句子,确定所述句子所包含的实体词集合,以及由所述句子所包含的名词及名词短语组成的候选上位词集合;将所述实体词集合中的实体词和所述候选上位词集合中的候选上位词两两组合,实体词和候选上位词组合后的词对作为候选词对;确定所述候选词对中实体词和候选上位词各自的词向量,由所述各自的词向量组成候选词向量对;根据所述候选词向量对,确定所述候选词对是否为词汇挖掘结果。本申请在语料句子中确定所包含的实体词集合和候选上位词集合,将两个集合中的词两两组合,得到候选词对,进一步确定候选词对中实体词和候选上位词各自的词向量,并根据词向量对来确定候选词对是否为词汇挖掘结 果,示例如,确定候选词对是否为上位词对。本申请不需要人工整理语料,实现了上位词对的自动挖掘,其上位词对挖掘效率大大提升,降低了挖掘成本。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请实施例公开的一种服务器硬件结构示意图;
图2为本申请实施例公开的一种词汇挖掘方法流程图;
图3为本申请实施例公开的另一种词汇挖掘方法流程图;
图4示例了一种双向循环神经网络模型架构图;
图5为本申请示例的一种上位词对挖掘流程示意图;
图6为本申请实施例公开的一种词汇挖掘装置结构示意图;
图7为本申请实施例公开的一种词向量确定单元结构示意图;
图8为本申请实施例公开的一种挖掘结果确定单元结构示意图;
图9为本申请实施例公开的一种初始词向量确定单元结构示意图;
图10为本申请实施例公开的另一种初始词向量确定单元结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供了一种词汇的自动挖掘方案,可以用于挖掘上位词对,该挖掘方案基于服务器实现,所述服务器也可以称为计算机设备。该服务器的硬件结构可以是电脑、笔记本等处理设备,在介绍本申请的词汇挖掘方法之前,首先介绍一下服务器的硬件结构。如图1所示,该服务器可以包括:处理器1,通信接口2,存储器3,通信总线4和显示屏5。
处理器1可以包括一个或多个处理核心,比如4核心处理器、8核心处理 器等。处理器1可以是一个中央处理器(Central Processing Unit,CPU),或者是特定集成电路(Application Specific Integrated Circuit,ASIC),或者是被配置成实施本申请实施例的一个或多个集成电路。
存储器3可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器3还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器3中的非暂态的计算机可读存储介质用于存储至少一个指令、至少一段程序、代码集或指令集,该至少一个指令、至少一段程序、代码集或指令集用于被处理器1所执行以实现下述方法实施例提供的方法。
通信接口2可以包括一个或多个接口,例如服务器与其它外围设备之间的接口,用于实现服务器与其它外围设备之间的通信。
显示屏5用于显示用户界面(User Interface,UI)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。
处理器1、通信接口2、存储器3和显示屏5通过通信总线4完成相互间的通信。总线4可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。
接下来,我们结合服务器硬件结构,对本申请的词汇挖掘方法进行介绍,如图2所示,该方法可以包括:
步骤S200、针对待挖掘语料所包含的每一句子,确定所述句子所包含的实体词集合,以及由所述句子所包含的名词及名词短语组成的候选上位词集合;
具体地,待挖掘语料由众多的语句构成。针对待挖掘语料中每一句子,确定句子中所包含的实体词集合,以及句子所包含的名词及名词短语,由名词及名词短语作为候选上位词,组成候选上位词集合。
其中,实体词可以是人名、地名、组织机构名等命名实体,命名实体通常指的是文本中具有特别意义或者指代性非常强的实体。由于这些命名实体数量不断增加,通常不可能在词典中穷尽列出,且其构成方法具有各自的一些规律性。因此,本步骤在确定句子所包含的实体词时,可以采用命名实体识别(Named Entities Recognition,NER)方法,识别句子所包含的实体词,各实 体词组成实体词集合。
而对于候选上位词,其一般是由名词和名词短语构成。其中,名词表示人、事物、地点或抽象概念的名称,名词分为专有名词和普通名词。名词短语也称为名词性短语(Noun phrase),是指语法功能相当于名词的一类短语。名词短语可以包括以名词为中心词的偏正短语(如“伟大的祖国”,“这些孩子”)、用名词构成的联合短语(如“工人农民”)、复指短语(如“首都北京”)、方位短语(如“桌面上”,“大楼前面”)、“的”字短语(如“打更的”)等,某些中心词是动词、形容词的偏正短语,其定语是代词、名词或其他名词短语,也属于名词短语,如“他的离开”,“中国的解放”,“他态度的诚恳”等。本步骤在确定句子所包含的候选上位词时,可以首先对句子进行分词,进而识别每一分词的词性,将词性为名词和名词短语的分词作为候选上位词,组成候选上位词集合。
具体实施时,可以预先通过通信接口2,将待挖掘语料存储至存储器3中。在挖掘时,由处理器1通过通信总线4在存储器存储的待挖掘语料句子中,确定句子所包含的实体词集合,以及由所述句子所包含的名词及名词短语组成的候选上位词集合。
步骤S210、将所述实体词集合中的实体词和所述候选上位词集合中的候选上位词两两组合,实体词和候选上位词组合后的词对作为候选词对;
其中,假设实体词集合中存在N个实体词,候选上位词集合中存在M个候选上位词,则两个集合中的词两两组合的组合方式一共有N*M种。实体词和候选上位词组合后构成的词对,作为候选词对。
具体实施时,可以由处理器1将所述实体词集合中的实体词和所述候选上位词集合中的候选上位词两两组合。
步骤S220、确定所述候选词对中实体词和候选上位词各自的词向量,由所述各自的词向量组成候选词向量对;
具体地,词向量(word embedding)是在自然语言处理过程中,将自然语言中的词数学化的过程,具体为将词以数学向量的形式来表示。
本步骤中确定所述候选词对中,实体词的词向量以及候选上位词的词向量。由实体词的词向量以及候选上位词的词向量组成候选词向量对,该候选词向量对与候选词对相互对应。
具体实施时,可以由处理器1来确定所述候选词对中实体词和候选上位词各自的词向量,由所述各自的词向量组成候选词向量对。
步骤S230、根据所述候选词向量对,确定所述候选词对是否为词汇挖掘结果。
具体地,在确定了候选词对对应的候选词向量对之后,可以使用预先训练好的分类模型,将候选词向量对输入至分类模型中,进而得到分类模型输出的分类结果。该分类结果表明所述候选词对是否为词汇挖掘结果,如表明所述候选词是否为上位词对。
分类模型可以使用softmax分类模型。利用预先标记有分类结果的训练词向量对,对分类模型进行训练。将所述候选词向量对输入至训练好的分类模型中,根据分类模型的输出结果来确定候选词对是否为上位词对。
具体实施时,可以由处理器1根据所述候选词向量对,确定所述候选词对是否为词汇挖掘结果,并通过显示屏5输出显示。
本申请实施例提供的词汇挖掘方法,针对待挖掘语料所包含的每一句子,确定所述句子所包含的实体词集合,以及由所述句子所包含的名词及名词短语组成的候选上位词集合;将所述实体词集合中的实体词和所述候选上位词集合中的候选上位词两两组合,实体词和候选上位词组合后的词对作为候选词对;确定所述候选词对中实体词和候选上位词各自的词向量,由所述各自的词向量组成候选词向量对;根据所述候选词向量对,确定所述候选词对是否为词汇挖掘结果。本申请在语料句子中确定所包含的实体词集合和候选上位词集合,将两个集合中的词两两组合,得到候选词对,进一步确定候选词对中实体词和候选上位词各自的词向量,并根据词向量对来确定候选词对是否为词汇挖掘结果,示例如,确定候选词对是否为上位词对。本申请不需要人工整理语料,实现了上位词对的自动挖掘,其上位词对挖掘效率大大提升,降低了挖掘成本。
在本申请的另一个实施例中,公开了另一种词汇挖掘方法,以词汇挖掘结果为上位词对为例,如图3所示,该方法包括:
步骤S300、针对待挖掘语料所包含的每一句子,确定所述句子所包含的实体词集合,以及由所述句子所包含的名词及名词短语组成的候选上位词集合;
步骤S310、将所述实体词集合中的实体词和所述候选上位词集合中的候选上位词两两组合,实体词和候选上位词组合后的词对作为候选词对;
具体地,步骤S300-S310与上述步骤S200-S210一一对应,此处不再赘述。
步骤S320、确定所述句子所包含的各词的初始词向量,各词的初始词向量 组成初始词向量矩阵;
具体地,本申请可以采用随机数确定所述句子所包含的各词的初始词向量。
除此之外,本申请还可以采用word2vec方法在全部待挖掘语料上进行训练,将全部待挖掘语料所包含的各词转换成向量形式。进一步,在待挖掘语料所包含各词的词向量中,查找所述句子中各词对应的词向量,作为各词的初始词向量。
其中,word2vec是一个将单词转换成向量形式的工具,由google开源。可以把对文本内容的处理简化为向量空间中的向量运算,计算出向量空间上的相似度,来表示文本语义上的相似度。
假设句子包含L个词,词向量为N维,则由句子所包含各词的初始词向量组成的初始词向量矩阵为L*N的矩阵。
举例如:
原始句子“abc”
对句子分词后“word1 word2 word3”,word1=a、word2=b、word3=c
确定各分词的初始词向量word embedding,word1=word embedding1、word2=word embedding2、word3=word embedding3。
构造3*N的矩阵如下表1(句子长度为3):
初始词向量
Word embedding1
Word embedding2
Word embedding3
表1
步骤S330、利用循环神经网络模型对所述初始词向量矩阵进行调整,得到由各词的调整后词向量组成的调整后词向量矩阵;
其中,循环神经网络(Recurrent Neural Network,RNN)的优点是,其能够在输入和输出序列之间的映射过程中利用上下文相关信息。针对初始词向量矩阵中每个词对应的初始词向量,在经过循环神经网络调整后,能够综合考虑每一词前后各词的关联关系,进而对词的初始词向量进行调整,使得输出的各词的调整后词向量更加准确。
具体地,调整后词向量的维度为H,H与循环神经网络中隐藏层的个数相 同。因此,由各词的调整后词向量组成的调整后词向量矩阵为L*H的矩阵。
步骤S340、在所述调整后词向量矩阵中查找所述候选词对中实体词和候选上位词各自对应的调整后词向量,由各自对应的调整后词向量组成候选词向量对;
具体地,根据候选词对中实体词和候选上位词在句子中的位置,在调整后词向量矩阵中查找对应位置的调整后词向量,确定实体词对应的调整后词向量,以及候选上位词对应的调整后词向量。
假设上表1示例的初始词向量矩阵在经过循环神经网络模型调整后输出结果如下表2所示:
初始词向量 调整后词向量
Word embedding1 Word embedding11
Word embedding2 Word embedding21
Word embedding3 Word embedding31
表2
仍以句子为“abc”为例,假设候选词向量中实体词为b,候选上位词为c。则可以确定实体词b位于句子中第二个分词,候选上位词c位于句子中第三个分词,因此查询上表2可以确定实体词b对应的调整后词向量为Word embedding21,候选上位词c对应的调整后词向量为Word embedding31。
步骤S350、根据所述候选词向量对,确定所述候选词对是否为上位词对。
本实施例的方法中,详细介绍了确定候选词对中实体词和候选上位词各自的词向量的过程。在确定了各自的初始词向量之后,通过使用循环神经网络模型对初始词向量进行调整,使得调整后词向量更加考虑词的上下文相关信息,确定的实体词和候选上位词的词向量更加准确。
可选的,上述循环神经网络模型可以是双向循环神经网络模型,如长短期记忆人工神经网络模型(Long Short-Term Memory,LSTM)。
标准的循环神经网络能够存取的上下文信息范围很有限。这个问题就使得隐含层的输入对于网络输出的影响随着网络环路的不断递归而衰退,而双向长短期记忆人工神经网络模型LSTM恰好能够解决这个问题。
参照图4,图4示例了一种双向循环神经网络模型架构图。
其中,模型共包含输入层input layer、前向隐含层forward layer、后向隐含层backward layer和输出层output layer。
在前向隐含层forward layer中,词向量调整过程会考虑前文信息,而在后向隐含层backward layer中,词向量调整过程会考虑后文信息,最终输出结果会同时考虑前向隐含层forward layer及后向隐含层backward layer的调整结果,使得分词的词向量调整结果同时考虑分词的上下文信息,
参照图5,图5为本申请示例的一种上位词对挖掘流程示意图。
结合图5对方案整体流程进行介绍:
S1、针对句子进行分词后得到句子所包含的各词,并确定各词的初始词向量。
S2、将各词的初始词向量输入至双向长短期记忆人工神经网络模型LSTM模型中,对各词的初始词向量进行调整,得到各词的调整后的词向量。
S3、根据从句子中确定的候选词对所包含的实体词和候选上位词,确定实体词调整后词向量以及候选上位词调整后词向量。
S4、将实体词调整后词向量以及候选上位词调整后词向量合并为一个词向量矩阵,并输入至分类器中,得到分类器的分类结果,分类结果表明候选词对是否为上位词对。
其中,分类器可以选用softmax分类器。
下面对本申请实施例提供的词汇挖掘装置进行描述,下文描述的词汇挖掘装置与上文描述的词汇挖掘方法可相互对应参照。
参见图6,图6为本申请实施例公开的一种词汇挖掘装置结构示意图。该装置具有实现上述词汇挖掘方法的功能,所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。
如图6所示,该装置可以包括:
集合确定单元11,用于针对待挖掘语料所包含的每一句子,确定所述句子所包含的实体词集合,以及由所述句子所包含的名词及名词短语组成的候选上位词集合;
具体地,集合确定单元确定所述句子所包含的实体词集合的过程,具体可以采用命名实体识别方法,识别所述句子所包含的实体词,各实体词组成实体词集合。
候选词对确定单元12,用于将所述实体词集合中的实体词和所述候选上位 词集合中的候选上位词两两组合,实体词和候选上位词组合后的词对作为候选词对;
词向量确定单元13,用于确定所述候选词对中实体词和候选上位词各自的词向量,由所述各自的词向量组成候选词向量对;
挖掘结果确定单元14,用于根据所述候选词向量对,确定所述候选词对是否为词汇挖掘结果。
本申请实施例提供的词汇挖掘装置,针对待挖掘语料所包含的每一句子,确定所述句子所包含的实体词集合,以及由所述句子所包含的名词及名词短语组成的候选上位词集合;将所述实体词集合中的实体词和所述候选上位词集合中的候选上位词两两组合,实体词和候选上位词组合后的词对作为候选词对;确定所述候选词对中实体词和候选上位词各自的词向量,由所述各自的词向量组成候选词向量对;根据所述候选词向量对,确定所述候选词对是否为词汇挖掘结果。本申请在语料句子中确定所包含的实体词集合和候选上位词集合,将两个集合中的词两两组合,得到候选词对,进一步确定候选词对中实体词和候选上位词各自的词向量,并根据词向量对来确定候选词对是否为词汇挖掘结果,示例如,确定候选词对是否为上位词对。本申请不需要人工整理语料,实现了上位词对的自动挖掘,其上位词对挖掘效率大大提升,降低了挖掘成本。
可选的,本申请实施例示例了上述词向量确定单元13的一种可选结构,参见图7可知,词向量确定单元13可以包括:
初始词向量确定单元131,用于确定所述句子所包含的各词的初始词向量,各词的初始词向量组成初始词向量矩阵;
初始词向量矩阵调整单元132,用于利用循环神经网络模型对所述初始词向量矩阵进行调整,得到由各词的调整后词向量组成的调整后词向量矩阵;
可选的,所述循环神经网络模型可以包括:双向长短期记忆人工神经网络模型。
调整后词向量查找单元133,用于在所述调整后词向量矩阵中查找所述候选词对中实体词和候选上位词各自对应的调整后词向量。
可选的,本申请实施例示例了上述上位词确定单元14的一种可选结构,所述词汇可以为上位词对,参见图8可知,挖掘结果确定单元14可以包括:
分类确定单元141,用于将所述候选词向量对输入至预训练的分类模型,得到所述分类模型输出的分类结果,所述分类结果表明所述候选词对是否为上 位词对。
可选的,本申请实施例示例了上述初始词向量确定单元131的两种可选结构,分别如图9和图10所示:
第一种,初始词向量确定单元131可以包括:
第一初始词向量确定子单元1311,用于采用随机数确定所述句子所包含的各词的初始词向量。
第二种,初始词向量确定单元131可以包括:
第二初始词向量确定子单元1312,用于采用word2vec方法确定所述句子所包含的每一词对应的词向量,作为初始词向量。
需要说明的是:上述实施例提供的装置在实现其功能时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
在示例性实施例中,还提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由计算机设备的处理器加载并执行以实现上述方法实施例中的各个步骤。
可选地,上述计算机可读存储介质可以是只读存储器ROM(Read Only Memory)、随机存取存储器RAM(Random Access Memory)、只读光盘CD-ROM、磁带、软盘和光数据存储设备等。
在示例性实施例中,还提供了一种计算机程序产品,当该计算机程序产品被执行时,其用于实现上述方法实施例中的各个步骤的功能。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明 确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (14)

  1. 一种词汇挖掘方法,其特征在于,所述方法包括:
    针对待挖掘语料所包含的每一句子,确定所述句子所包含的实体词集合,以及由所述句子所包含的名词及名词短语组成的候选上位词集合;
    将所述实体词集合中的实体词和所述候选上位词集合中的候选上位词两两组合,实体词和候选上位词组合后的词对作为候选词对;
    确定所述候选词对中实体词和候选上位词各自的词向量,由所述各自的词向量组成候选词向量对;
    根据所述候选词向量对,确定所述候选词对是否为词汇挖掘结果。
  2. 根据权利要求1所述的方法,其特征在于,所述确定所述候选词对中实体词和候选上位词各自的词向量,包括:
    确定所述句子所包含的各词的初始词向量,各词的初始词向量组成初始词向量矩阵;
    利用循环神经网络模型对所述初始词向量矩阵进行调整,得到由各词的调整后词向量组成的调整后词向量矩阵;
    在所述调整后词向量矩阵中查找所述候选词对中实体词和候选上位词各自对应的调整后词向量。
  3. 根据权利要求1或2所述的方法,其特征在于,所述词汇挖掘结果为上位词对,所述根据所述候选词向量对,确定所述候选词对是否为词汇挖掘结果,包括:
    将所述候选词向量对输入至预训练的分类模型,得到所述分类模型输出的分类结果,所述分类结果表明所述候选词对是否为所述上位词对。
  4. 根据权利要求2所述的方法,其特征在于,所述确定所述句子所包含的各词的初始词向量,包括:
    采用随机数确定所述句子所包含的各词的初始词向量;
    或,
    采用word2vec方法确定所述句子所包含的每一词对应的词向量,作为初始词向量。
  5. 根据权利要求1所述的方法,其特征在于,所述确定所述句子所包含的实体词集合,包括:
    采用命名实体识别方法,识别所述句子所包含的实体词,各实体词组成所述实体词集合。
  6. 根据权利要求2所述的方法,其特征在于,所述循环神经网络模型包括:
    双向长短期记忆人工神经网络模型。
  7. 一种词汇挖掘装置,其特征在于,所述装置包括:
    集合确定单元,用于针对待挖掘语料所包含的每一句子,确定所述句子所包含的实体词集合,以及由所述句子所包含的名词及名词短语组成的候选上位词集合;
    候选词对确定单元,用于将所述实体词集合中的实体词和所述候选上位词集合中的候选上位词两两组合,实体词和候选上位词组合后的词对作为候选词对;
    词向量确定单元,用于确定所述候选词对中实体词和候选上位词各自的词向量,由所述各自的词向量组成候选词向量对;
    挖掘结果确定单元,用于根据所述候选词向量对,确定所述候选词对是否为词汇挖掘结果。
  8. 根据权利要求7所述的装置,其特征在于,所述词向量确定单元包括:
    初始词向量确定单元,用于确定所述句子所包含的各词的初始词向量,各词的初始词向量组成初始词向量矩阵;
    初始词向量矩阵调整单元,用于利用循环神经网络模型对所述初始词向量矩阵进行调整,得到由各词的调整后词向量组成的调整后词向量矩阵;
    调整后词向量查找单元,用于在所述调整后词向量矩阵中查找所述候选词对中实体词和候选上位词各自对应的调整后词向量。
  9. 根据权利要求7或8所述的装置,其特征在于,所述词汇挖掘结果为上位词对,所述挖掘结果确定单元包括:
    分类确定单元,用于将所述候选词向量对输入至预训练的分类模型,得到所述分类模型输出的分类结果,所述分类结果表明所述候选词对是否为上位词对。
  10. 根据权利要求8所述的装置,其特征在于,所述初始词向量确定单元包括:
    第一初始词向量确定子单元,用于采用随机数确定所述句子所包含的各词的初始词向量;
    或,
    第二初始词向量确定子单元,用于采用word2vec方法确定所述句子所包含的每一词对应的词向量,作为初始词向量。
  11. 根据权利要求7所述的装置,其特征在于,所述集合确定单元确定所述句子所包含的实体词集合的过程,具体包括:
    采用命名实体识别方法,识别所述句子所包含的实体词,各实体词组成实体词集合。
  12. 根据权利要求8所述的装置,其特征在于,所述循环神经网络模型包括:
    双向长短期记忆人工神经网络模型。
  13. 一种计算机设备,其特征在于,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至6任一项所述的词汇挖掘方法。
  14. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如权利要求1至6任一项所述的词汇挖掘方法。
PCT/CN2018/079259 2017-03-21 2018-03-16 一种词汇挖掘方法、装置及设备 WO2018171515A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710169796.7 2017-03-21
CN201710169796.7A CN108628821B (zh) 2017-03-21 2017-03-21 一种词汇挖掘方法及装置

Publications (1)

Publication Number Publication Date
WO2018171515A1 true WO2018171515A1 (zh) 2018-09-27

Family

ID=63584662

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/079259 WO2018171515A1 (zh) 2017-03-21 2018-03-16 一种词汇挖掘方法、装置及设备

Country Status (2)

Country Link
CN (1) CN108628821B (zh)
WO (1) WO2018171515A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110969549A (zh) * 2018-09-30 2020-04-07 北京国双科技有限公司 一种司法数据处理方法及系统

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110196982B (zh) * 2019-06-12 2022-12-27 腾讯科技(深圳)有限公司 上下位关系抽取方法、装置及计算机设备
CN112559711A (zh) * 2020-12-23 2021-03-26 作业帮教育科技(北京)有限公司 一种同义文本提示方法、装置及电子设备
CN114020880B (zh) * 2022-01-06 2022-04-19 杭州费尔斯通科技有限公司 提取上位词的方法、系统、电子装置和存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090265344A1 (en) * 2008-04-22 2009-10-22 Ntt Docomo, Inc. Document processing device and document processing method
CN101794303A (zh) * 2010-02-11 2010-08-04 重庆邮电大学 采用特征扩展分类文本及构造文本分类器的方法和装置
CN106407211A (zh) * 2015-07-30 2017-02-15 富士通株式会社 对实体词的语义关系进行分类的方法和装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214189B (zh) * 2010-04-09 2013-04-24 腾讯科技(深圳)有限公司 基于数据挖掘获取词用法知识的系统及方法
CN103942198B (zh) * 2013-01-18 2017-07-28 佳能株式会社 用于挖掘意图的方法和设备
US9129013B2 (en) * 2013-03-12 2015-09-08 Nuance Communications, Inc. Methods and apparatus for entity detection
CN104679836B (zh) * 2015-02-06 2018-11-20 中国农业大学 一种农业本体自动扩充方法
CN104881399B (zh) * 2015-05-15 2017-10-27 中国科学院自动化研究所 基于概率软逻辑psl的事件识别方法和系统
CN105574092B (zh) * 2015-12-10 2019-08-23 百度在线网络技术(北京)有限公司 信息挖掘方法和装置
CN106095872A (zh) * 2016-06-07 2016-11-09 北京高地信息技术有限公司 用于智能问答系统的答案排序方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090265344A1 (en) * 2008-04-22 2009-10-22 Ntt Docomo, Inc. Document processing device and document processing method
CN101794303A (zh) * 2010-02-11 2010-08-04 重庆邮电大学 采用特征扩展分类文本及构造文本分类器的方法和装置
CN106407211A (zh) * 2015-07-30 2017-02-15 富士通株式会社 对实体词的语义关系进行分类的方法和装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110969549A (zh) * 2018-09-30 2020-04-07 北京国双科技有限公司 一种司法数据处理方法及系统
CN110969549B (zh) * 2018-09-30 2023-08-25 北京国双科技有限公司 一种司法数据处理方法及系统

Also Published As

Publication number Publication date
CN108628821A (zh) 2018-10-09
CN108628821B (zh) 2022-11-25

Similar Documents

Publication Publication Date Title
US10176804B2 (en) Analyzing textual data
US10114809B2 (en) Method and apparatus for phonetically annotating text
WO2019153737A1 (zh) 用于对评论进行评估的方法、装置、设备和存储介质
KR102354716B1 (ko) 딥 러닝 모델을 이용한 상황 의존 검색 기법
WO2018171515A1 (zh) 一种词汇挖掘方法、装置及设备
US20170139899A1 (en) Keyword extraction method and electronic device
US10282419B2 (en) Multi-domain natural language processing architecture
WO2018157789A1 (zh) 一种语音识别的方法、计算机、存储介质以及电子装置
US10803253B2 (en) Method and device for extracting point of interest from natural language sentences
WO2020063092A1 (zh) 知识图谱的处理方法及装置
CN110162771B (zh) 事件触发词的识别方法、装置、电子设备
US11144729B2 (en) Summary generation method and summary generation apparatus
US11327971B2 (en) Assertion-based question answering
CN107967256B (zh) 词语权重预测模型生成方法、职位推荐方法及计算设备
CN110083681B (zh) 基于数据分析的搜索方法、装置及终端
US11756094B2 (en) Method and device for evaluating comment quality, and computer readable storage medium
US10592542B2 (en) Document ranking by contextual vectors from natural language query
US9703773B2 (en) Pattern identification and correction of document misinterpretations in a natural language processing system
TWI553491B (zh) 問句處理系統及其方法
Todi et al. Building a kannada pos tagger using machine learning and neural network models
WO2022042125A1 (zh) 一种命名实体识别方法
WO2014036827A1 (zh) 一种文本校正方法及用户设备
TW201822190A (zh) 語音辨識系統及其方法、詞彙建立方法與電腦程式產品
Habib et al. An exploratory approach to find a novel metric based optimum language model for automatic bangla word prediction
US20220365956A1 (en) Method and apparatus for generating patent summary information, and electronic device and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18772665

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18772665

Country of ref document: EP

Kind code of ref document: A1