WO2018153265A1 - 关键词提取方法、计算机设备和存储介质 - Google Patents

关键词提取方法、计算机设备和存储介质 Download PDF

Info

Publication number
WO2018153265A1
WO2018153265A1 PCT/CN2018/075711 CN2018075711W WO2018153265A1 WO 2018153265 A1 WO2018153265 A1 WO 2018153265A1 CN 2018075711 W CN2018075711 W CN 2018075711W WO 2018153265 A1 WO2018153265 A1 WO 2018153265A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
words
determined
text
processed
Prior art date
Application number
PCT/CN2018/075711
Other languages
English (en)
French (fr)
Inventor
王煦祥
尹庆宇
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2019521096A priority Critical patent/JP6956177B2/ja
Priority to EP18758452.9A priority patent/EP3518122A4/en
Priority to KR1020197017920A priority patent/KR102304673B1/ko
Publication of WO2018153265A1 publication Critical patent/WO2018153265A1/zh
Priority to US16/363,646 priority patent/US10963637B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of information technology, and in particular, to a keyword extraction method, a computer device, and a storage medium.
  • the traditional keyword extraction method uses a machine learning algorithm based on feature selection, which requires manual extraction of effective features according to the characteristics of the data. Since the way of human participation involves a large subjective idea, it is difficult to guarantee the accuracy of the keyword.
  • a keyword extraction method a computer device, and a storage medium are provided.
  • a keyword extraction method applied to a user terminal or server including:
  • a computer device comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor such that the processor performs the following steps:
  • One or more non-volatile storage media storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the following steps:
  • FIG. 1 is a schematic diagram of an application environment of a keyword extraction method according to an embodiment
  • FIG. 2 is a schematic diagram showing the internal structure of a computer device of an embodiment
  • FIG. 3 is a flow chart of a keyword extraction method of an embodiment
  • Figure 5 is a structural diagram of an LSTM unit of an embodiment
  • FIG. 6 is a schematic structural diagram of a model corresponding to a keyword extraction method according to an embodiment
  • Figure 7 is a block diagram showing the structure of a computer device of an embodiment
  • Figure 8 is a block diagram showing the structure of a computer device of another embodiment.
  • FIG. 1 is a schematic diagram of an application environment of a keyword extraction method provided by an embodiment.
  • the application environment includes a user terminal 110 and a server 120, and the user terminal 110 is communicatively coupled to the server 120.
  • the user terminal 110 is installed with a search engine or a question answering system.
  • the user inputs text through the user terminal 110, and the input text is sent to the server 120 through the communication network.
  • the server 120 processes the input text, extracts keywords in the input text, and provides a search for the user. Results or question and answer results.
  • the user inputs text through the user terminal 110
  • the user terminal 110 processes the input text, extracts the keyword of the input text, and transmits the keyword to the server 120 through the communication network, and the server 120 provides the user with a search result or a question and answer result.
  • the computer device includes a processor, memory, and network interface connected by a system bus.
  • the processor is used to provide computing and control capabilities to support the operation of the entire computer device.
  • the memory includes a non-volatile storage medium and an internal memory, the non-volatile storage medium storing operating system and computer readable instructions, the internal memory being an operating system and computer readable instructions in a non-volatile storage medium
  • the runtime providing environment when executed by the processor, causes the processor to perform a keyword extraction method.
  • the network interface is used for network communication with an external terminal.
  • FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied.
  • the specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.
  • a keyword extraction method is provided.
  • the method runs in the server 120 shown in FIG. 1, and the method includes the following steps:
  • S310 Acquire each to-be-determined word of the text to be processed.
  • the text to be processed usually consists of a single word. Compared with single words, words can express semantics more practically.
  • the text to be processed can be preprocessed to obtain the to-be-determined words of the text to be processed.
  • the word to be judged is a word in the text to be processed that needs to be judged whether it is a keyword of the text to be processed.
  • the word to be judged may be a word of the text to be processed obtained after the word segmentation processing, that is, the preprocessing may include word segmentation processing.
  • the to-be-determined word may also be a word with practical meaning extracted from the words of the text to be processed, that is, the pre-processing may further include a process of identifying the stop word and excluding the stop word.
  • the method may further include the step of: acquiring the text to be processed.
  • the user inputs text through the user terminal, and the server acquires the text input by the user through the communication network to obtain the text to be processed.
  • S320 Determine a pre-word corresponding to each of the to-be-determined words, where the pre-word is a word appearing in the above-mentioned word to be judged in the to-be-determined word.
  • the preceding vocabulary is the word appearing in the above-mentioned word to be judged in the text to be processed, and the preceding vocabulary corresponding to each of the to-be-determined words may be determined according to the to-be-processed text. Specifically, after the pre-processing (eg, word segmentation processing) of the text to be processed, the order in which the obtained words appear in the to-be-processed text determines the pre-words appearing in the above-mentioned words to be judged.
  • pre-processing eg, word segmentation processing
  • S330 Determine a sequence of words of each to-be-determined word according to an order in which the preceding words corresponding to each of the to-be-determined words and the respective to-be-determined words appear in the to-be-processed text.
  • the first word to be judged in the text to be processed may not have a corresponding pre-word, and the sequence of words of the first word to be judged may be composed of the first word to be judged.
  • the corresponding word sequence is the pre-word corresponding to the word to be judged plus itself, according to these words (each of the preceding words plus The sequence of words to be determined in the order in which the text to be processed appears.
  • S350 input the word sequence of each word to be judged into the trained cyclic neural network model, respectively, and obtain the probability that each word to be determined is a keyword of the text to be processed.
  • the cyclic neural network model in this embodiment may adopt an RNN (Recurrent Neural Net) model, a Long Short-Term Memory (LSTM) model, or a GRU (Gated Recurrent Unit) model.
  • the cyclic neural network model includes an input layer, an implicit layer and an output layer, wherein the hidden unit in the hidden layer completes the most important work, and according to the input word sequence of the word to be judged, the word to be judged is the text to be processed.
  • the probability of a keyword Since the sequence of words input into the trained cyclic neural network model is determined by the preceding words of the word to be judged and the word to be judged, the above information can be fully considered, and a more accurate word to be determined is obtained. The probability of the keyword.
  • S360 Determine a keyword of the to-be-processed text according to a probability that each to-be judged word is a keyword of the to-be-processed text and a preset threshold.
  • the probability that each of the to-be-determined words in the to-be-processed text is a keyword is compared with a preset threshold, and the to-be-determined word whose probability is greater than or not less than the preset threshold is determined as a keyword in the to-be-processed text.
  • the threshold setting is related to the specific requirements.
  • the threshold is set high, the accuracy is high, and the recall rate is correspondingly reduced. If the threshold is set low, the accuracy is low, and the recall rate is high.
  • the user can set the threshold as needed. For example, the threshold can be set to 0.5.
  • the above keyword extraction method does not need to manually extract effective features according to the characteristics of the data, but inputs the word sequence into the trained cyclic neural network model to obtain the probability that the corresponding word to be judged is a keyword, and inputs the The sequence of words in the trained cyclic neural network model is determined by the preceding words of the word to be judged and the word to be judged. Therefore, the above information can be fully considered, and a more accurate word to be judged is the keyword of the text to be processed. Probability, which improves the accuracy of the extracted keywords.
  • step S310 includes the following steps:
  • Step a Perform word segmentation processing on the processed text to obtain words in the text to be processed.
  • Step b Identify the stop words in the text to be processed, and determine words other than the stop words in the text to be processed as the words to be judged.
  • the stop words in the stop vocabulary can be compared to the words in the pending text to determine the stop words in the pending text.
  • commonly used stop words include "of", "a", "what", etc. These words must not be used as keywords.
  • the words other than the stop words in the text to be processed are determined as the words to be judged, and the words other than the stop words are usually real words, and the actual words are used as judgment words, instead of Using the stop word as the word to be judged, on the one hand, it can avoid the accuracy of the keyword extraction due to the output result of the stop word, and on the other hand, can improve the speed of keyword extraction.
  • the pre-word includes words appearing in the to-be-determined word in the text to be processed, except for the stop word. It can be understood that the words appearing in the above-mentioned words to be judged in the to-be-determined words, except for the stop words, are the actual words appearing in the above-mentioned words to be judged in the text to be processed.
  • the pre-word may include all words appearing in the above-mentioned word to be judged in the text to be processed, that is, including the stop words and the substantive words appearing in the above-mentioned words to be judged in the to-be-determined words. .
  • step S330 can include:
  • Step a acquiring a word vector of the preceding text corresponding to each of the to-be-determined words and a word vector of each to-be-determined word.
  • Step b determining, according to the order in which the preceding words and the to-be-determined words respectively appear in the to-be-processed text, the word vector of the preceding word corresponding to each of the to-be-determined words, and the word vector of each to-be-determined word.
  • the sequence of words to be judged, the sequence of words is a sequence of word vectors.
  • the word vector is a vector representation of a word, which is a way to digitize words in natural language.
  • Word vectors can be trained using language models.
  • the commonly used language model is Word2vec, which uses the idea of deep learning to simplify the processing of text content into vector operations in K-dimensional vector space through training.
  • the word vector of each word can be obtained by using Word2vec training through large-scale text data, and the word vector of each word in the text to be processed can be obtained by searching, so that the words to be judged can be obtained.
  • word vectors are used to represent each word, so that the semantic information of the word level can be better obtained, thereby further improving the accuracy of the extracted keywords.
  • the result of the hidden layer output of the trained cyclic neural network model is also a vector, in order to map the vector to Within the range of 0-1 to indicate the probability of each word to be judged, a Softmax function or a Sigmoid function may be used.
  • the Softmax function is a commonly used multi-class regression model. Judging whether the word to be judged is a keyword can be constructed as a two-dimensional problem, the corresponding Softmax function has two-dimensional, one-dimensional representation is the probability of the keyword, and the second dimension represents the probability of not being the keyword.
  • the method for obtaining the word vector corresponding to the pre-word and the word to be judged is obtained through large-scale corpus training.
  • the semantic information of the word can be fully utilized to help identify the keyword from the semantic level, so that the accuracy of the extracted keyword can be further improved.
  • the method further includes the following steps:
  • S340 acquiring a training sample, training the cyclic neural network model to obtain a trained cyclic neural network model; the training sample includes an element pair, the element pair includes a training word corresponding to the training text, and the training word is the training The probability of the keyword of the text.
  • the value of the probability that the training word in the pair is the keyword of the training text includes 0 and 1; when the value is 0, the training word is not the keyword of the training text; when the value is 1, the training is indicated The word is the keyword of the training text.
  • the Gaussian distribution can be used to initialize the network parameters of the cyclic neural network model.
  • the i-th to be judged words of the training text and the preceding words of the to-be-determined words are formed in the order in which the words appear in the text to be processed.
  • the word sequence inputs each word vector in the word sequence into the cyclic neural network model, and obtains the loss of the i-th word to be judged, thereby obtaining the loss of each word to be judged.
  • the gradient descent method can be used to update the parameters of the cyclic neural network model.
  • the cyclic neural network model is an LSTM model.
  • the LSTM model is based on the RNN model.
  • the implicit unit in the cyclic neural network model is the LSTM unit.
  • a LSTM unit structure diagram is shown in Figure 5.
  • the memory unit is used to store history information, and the update and use of the history information are controlled by three gates, an Input Gate, a Forget Gate, and an Output Gate, respectively. Since the LSTM model can overcome the shortcomings of the input of the indefinite length sequence, the history information can be better stored, and therefore, the accuracy of the extracted keywords can be further improved.
  • FIG. 5 and FIG. 6 together with the cyclic neural network model as an example for the process of processing the word sequence of a word to be judged by the LSTM model, and the word vector of each pre-word in the word sequence corresponding to the word to be judged (Word) Embedding and the word vector of the word to be judged are input to the trained LSTM model in the order in which they appear in the text to be processed. Therefore, the LSTM model is used multiple times according to the number of words to be judged, and each word to be judged is obtained. The probability of dealing with keywords in text.
  • each of the words to be judged is the input of the last LSTM unit of the LSTM model, and the output result of each word to be judged takes into account the above historical information of each word to be judged, that is, each corresponding to each word to be judged
  • the semantic information of the preceding words is the output of the last hidden layer (LSTM unit) of the LSTM model.
  • a model corresponding to the keyword extraction method includes an LSTM unit and a Sotfmax function.
  • a classifier can be constructed to determine the probability that it will become a keyword for each word to be judged of the processed text. For a word to be judged, all the words starting from the beginning of the sentence are extracted into a sequence of words.
  • the input of the model is a word vector, and each LSTM unit can output a result of a word corresponding to the word vector input by the LSTM unit, and combine the result with the next word vector in the word sequence as an input to the next LSTM unit.
  • the last LSTM unit combines the output result of the previous LSTM unit with the word vector of the word to be judged as the input of the last LSTM unit, and the output result is the result corresponding to the word to be determined expressed in vector form, and the result of the vector form
  • the probability of the word to be judged is a keyword is determined by the Sotfmax function.
  • the probability that the word to be judged is a keyword is compared with a preset threshold to determine whether the word to be judged is a keyword.
  • the words to be judged include “Ningbo”, “Specialty”, “Shanghai”, “Expo”, “Occupation”. "And "a place.” Separating the word vector of each word to be judged and the word vector of each preceding word above the word to be judged into the trained cyclic neural network model in the order in which they appear in the text to be processed, respectively, The word is the probability of a keyword of the text to be processed.
  • the corresponding word vector can be input to the circulating neural network in the order of “Ningbo”, “Yes”, “Specialty”, “Shanghai” and “Expo” as shown in Fig. 6.
  • the word vector of “Ningbo” is input to the first LSTM unit of the LSTM model
  • the word vector of “Yes” is input to the second LSTM unit
  • the word vector input of the word “Expo” is judged.
  • each LSTM unit is affected by the output of the previous LSTM unit.
  • the output of the LSTM model is the probability value corresponding to the output vector of the last LSTM unit through the Softmax function mapping, thereby obtaining the probability that each to-be judged word is a keyword of the text to be processed. Since the input word vector sequence itself includes the vector of each pre-word corresponding to the word to be judged and the word vector of the word to be judged, the above information is considered, and within the LSTM model, the history information can be better stored, and thus Further obtaining a more accurate probability that the word to be judged is a keyword of the text to be processed.
  • FIGS. 3 and 4 are sequentially displayed as indicated by the arrows, these steps are not necessarily performed in the order indicated by the arrows. Except as explicitly stated herein, the execution of these steps is not strictly limited, and the steps may be performed in other orders. Moreover, at least some of the steps in FIGS. 3 and 4 may include a plurality of sub-steps or stages, which are not necessarily performed at the same time, but may be performed at different times, or The order of execution of the stages is also not necessarily sequential, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.
  • a computer device is further provided, and the internal structure of the computer device is as shown in FIG. 2, the computer device is provided with a keyword extracting device, and the keyword extracting device includes each module, and each module can be It is implemented in whole or in part by software, hardware or a combination thereof.
  • a keyword extracting apparatus includes: a sentence judgment module 710, a pre-word determination module 720, a word sequence determination module 730, a probability determination module 750, and a keyword determination module 760. .
  • the to-be-determined word obtaining module 710 is configured to obtain each to-be-determined word of the text to be processed.
  • the pre-word determination module 720 is configured to determine a pre-word corresponding to each of the to-be-determined words, where the pre-word is a word appearing in the above-mentioned word to be judged in the to-be-determined word.
  • the word sequence determining module 730 is configured to determine the word sequence according to the order in which the preceding words corresponding to each of the to-be-determined words and the respective to-be-determined words appear in the to-be-processed text.
  • the probability determining module 750 is configured to input the word sequences of the to-be-determined words into the trained cyclic neural network model respectively, and obtain the probability that each of the to-be-determined words is a keyword of the text to be processed.
  • the keyword determining module 760 is configured to determine a keyword of the to-be-processed text according to a probability that each to-be-determined word is a keyword of the to-be-processed text and a preset threshold.
  • the above keyword extracting device does not need to manually extract valid features according to the characteristics of the data, and inputs the word sequence into the trained cyclic neural network model to obtain the probability that the corresponding word to be judged is a keyword, and inputs to the trained
  • the sequence of words in the cyclic neural network model is determined by the word to be judged and the preceding words of the word to be judged. Therefore, the above information can be fully considered, and the probability that the more-determined word is the keyword of the text to be processed is obtained. , thereby improving the accuracy of the extracted keywords.
  • the to-be-acquired module 710 includes a word segmentation processing unit 711 and an identification determination unit 713.
  • the word segmentation processing unit 711 is configured to perform word segmentation processing on the text to be processed, and obtain words in the text to be processed.
  • the identification determining unit 713 is configured to identify a stop word in the to-be-processed text, and determine a word other than the stop word in the to-be-processed text as the to-be-determined word.
  • the pre-word includes words appearing in the to-be-determined word in the text to be processed, except for the stop word.
  • the word sequence determining module 730 includes: a word vector obtaining unit 731 and a word sequence determining unit 733;
  • the word vector obtaining unit 731 is configured to acquire a word vector of the preceding text corresponding to each of the to-be-determined words and a word vector of each to-be-determined word;
  • the word sequence determining unit 733 determines the word sequence of each word to be judged, and the word sequence is a sequence of word vectors.
  • the method further includes:
  • a model training module 740 configured to acquire a training sample, and train the cyclic neural network model to obtain a trained cyclic neural network model;
  • the training sample includes an element pair, the element pair includes a training word corresponding to the training text, and the training The word is the probability of the keyword of the training text.
  • the cyclic neural network model is an LSTM model.
  • a computer apparatus comprising a memory and a processor, the memory storing computer readable instructions that, when executed by the processor, cause the processor to execute The following steps:
  • the step of obtaining a to-be-determined word of the text to be processed includes:
  • the determining the word sequence of each of the to-be-determined words according to the order in which the preceding words corresponding to each of the to-be-determined words and the respective to-be-determined words respectively appear in the to-be-processed text Steps including:
  • the word vector of the preceding vocabulary corresponding to each of the to-be-determined words determines the word sequence of each of the to-be-determined words, and the word sequence is a sequence of word vectors.
  • the method before inputting the word sequence of each of the to-be-determined words into the trained cyclic neural network model, the method further includes the steps of:
  • the training sample includes an element pair, the element pair includes a training word corresponding to the training text, and the training word is the training The probability of the keyword of the text.
  • the cyclic neural network model is an LSTM model.
  • Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • Synchlink DRAM SLDRAM
  • Memory Bus Radbus
  • RDRAM Direct RAM
  • DRAM Direct Memory Bus Dynamic RAM
  • RDRAM Memory Bus Dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Operations Research (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Character Discrimination (AREA)

Abstract

一种关键词提取方法,应用于用户终端或服务器,该方法包括:获取待处理文本的各待判断词;确定各待判断词分别对应的前文词,前文词为待处理文本中出现在待判断词的上文中的词;根据各待判断词、各待判断词分别对应的前文词在待处理文本中出现的顺序,确定各待判断词的词序列;分别将各待判断词的词序列输入到经过训练的循环神经网络模型中,得到各待判断词是待处理文本的关键词的概率;根据各待判断词是待处理文本的关键词的概率及预设阈值,确定待处理文本的关键词。

Description

关键词提取方法、计算机设备和存储介质
本申请要求于2017年02月23日提交中国专利局,申请号为2017101010131,申请名称为“关键词提取方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及信息技术领域,特别是涉及一种关键词提取方法、计算机设备和存储介质。
背景技术
信息的表达方式随着信息时代的发展而日益多样,其中利用文本来表达信息的方式又是不可替代的。随着网络的发展,线上文本信息的数量呈爆炸式增长,手工获取所需文本信息的难度日益增大,因此,如何高效地获取信息成为一个十分重要的课题。
为了能够有效地处理海量的文本数据,研究人员在文本分类、文本聚类、自动文摘和信息检索等方向进行了大量的研究,而这些研究都涉及到一个关键而又基础的问题,即如何获取文本中的关键词。
传统的关键词提取方法采用基于特征选择的机器学习算法,需要人工根据数据的特点来抽取有效的特征。由于人工参与的方式包含较大的主观思想,难以保证关键词的准确性。
发明内容
根据本申请提供的各种实施例,提供一种关键词提取方法、计算机设备和存储介质。
为达到上述目的,本申请实施例采用以下技术方案:
一种关键词提取方法,应用于用户终端或服务器,包括:
获取待处理文本的各待判断词;
确定各所述待判断词分别对应的前文词,所述前文词为所述待处理文本中出现在所述待判断词的上文中的词;
根据各所述待判断词、各所述待判断词分别对应的前文词在所述待处理文本中出现的顺序,确定各所述待判断词的词序列;
分别将各所述待判断词的所述词序列输入到经过训练的循环神经网络模型中,得到各所述待判断词是所述待处理文本的关键词的概率;
根据各所述待判断词是所述待处理文本的关键词的概率及预设阈值,确定所述待处理文本的关键词。
一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如下步骤:
获取待处理文本的各待判断词;
确定各所述待判断词分别对应的前文词,所述前文词为所述待处理文本中出现在所述待判断词的上文中的词;
根据各所述待判断词、各所述待判断词分别对应的前文词在所述待处理文本中出现的顺序,确定各所述待判断词的词序列;
分别将各所述待判断词的所述词序列输入到经过训练的循环神经网络模型中,得到各所述待判断词是所述待处理文本的关键词的概率;
根据各所述待判断词是所述待处理文本的关键词的概率及预设阈值,确定所述待处理文本的关键词。
一个或多个存储有计算机可读指令的非易失性存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
获取待处理文本的各待判断词;
确定各所述待判断词分别对应的前文词,所述前文词为所述待处理文本中出现在所述待判断词的上文中的词;
根据各所述待判断词、各所述待判断词分别对应的前文词在所述待处理 文本中出现的顺序,确定各所述待判断词的词序列;
分别将各所述待判断词的所述词序列输入到经过训练的循环神经网络模型中,得到各所述待判断词是所述待处理文本的关键词的概率;
根据各所述待判断词是所述待处理文本的关键词的概率及预设阈值,确定所述待处理文本的关键词。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍。显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个实施例的关键词提取方法的应用环境示意图;
图2为一个实施例的计算机设备的内部结构示意图;
图3为一个实施例的关键词提取方法的流程图;
图4为另一个实施例的关键词提取方法的流程图;
图5为一个实施例的LSTM单元的结构图;
图6为一个实施例的关键词提取方法对应的模型的结构示意图;
图7为一个实施例的计算机设备的结构框图;
图8为另一个实施例的计算机设备的结构框图。
具体实施方式
为使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步的详细说明。应当理解,此处所描述的具体实施方式仅仅用以解释本申请,并不限定本申请的保护范围。
图1为一个实施例提供的关键词提取方法的应用环境示意图。如图1所示, 该应用环境包括用户终端110和服务器120,用户终端110与服务器120通信连接。用户终端110安装有搜索引擎或问答系统,用户通过用户终端110输入文本,输入文本通过通信网络被发送至服务器120,服务器120对输入文本进行处理,提取输入文本中的关键词,为用户提供搜索结果或问答结果。或者,用户通过用户终端110输入文本,用户终端110对输入文本进行处理,提取输入文本的关键词,通过通信网络将关键词发送至服务器120,服务器120为用户提供搜索结果或问答结果。
图2为一个实施例中的计算机设备的内部结构示意图,该计算机设备可以为用户终端或服务器。如图2所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,该处理器用于提供计算和控制能力,支撑整个计算机设备的运行。该存储器包括非易失性存储介质和内存储器,该非易失性存储介质可存储操作系统和计算机可读指令,该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境,该计算机可读指令被处理器执行时,可使得处理器执行一种关键词提取方法。该网络接口用于与外部的终端进行网络通信。
本领域技术人员可以理解,图2中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
请参照图3,在一个实施例中,提供一种关键词提取方法,该方法运行在如图1所示的服务器120中,该方法包括以下步骤:
S310:获取待处理文本的各待判断词。
待处理文本通常由单字组成。相比单字而言,词更能表达语义,更具有实际意义。
可通过对待处理文本进行预处理,从而获取到待处理文本的各待判断词。待判断词为待处理文本中需要判断其是否为待处理文本的关键词的词。待判断词可以为进行分词处理后得到的待处理文本的词,即预处理可以包括分词 处理。为了提高处理效率,待判断词还可以为待处理文本的词中提取的具有实际意义的词,即预处理还可以包括识别停用词和排除停用词的处理。
在其中一个实施方式中,步骤S310之前,还可以包括步骤:获取待处理文本。用户通过用户终端输入文本,服务器通过通信网络获取用户输入的文本得到待处理文本。
S320:确定各待判断词分别对应的前文词,前文词为待处理文本中出现在待判断词的上文中的词。
由前文词的定义,前文词为待处理文本中出现在待判断词的上文中的词,可以根据待处理文本确定各待判断词分别对应的前文词。具体地,可以根据对待处理文本进行预处理(如,分词处理)之后,得到的词在待处理文本中出现的顺序确定出现在待判断词的上文中的前文词。
S330:根据各待判断词、各待判断词分别对应的前文词在待处理文本中出现的顺序,确定各待判断词的词序列。
需要说明的是,待处理文本中的第一个待判断词可能没有对应的前文词,第一个待判断词的词序列,可以由第一个待判断词本身构成。
除了第一个待判断词之外的其它待判断词,必定存在前文词,其对应的词序列为该待判断词对应的各前文词再加上其本身,按照这些词(各前文词加上待判断词)在待处理文本中出现的顺序而确定的词的序列。
S350:分别将各待判断词的词序列输入到经过训练的循环神经网络模型中,得到各待判断词是待处理文本的关键词的概率。
本实施例中的循环神经网络模型可采用RNN(Recurrent Neural Net)模型、长短时记忆(Long Short-Term Memory,LSTM)模型或GRU(Gated Recurrent Unit)模型。循环神经网络模型包括输入层、隐含层和输出层,其中隐含层中的隐含单元完成了最主要的工作,根据输入的待判断词的词序列,得到待判断词是待处理文本的关键词的概率。由于输入到经过训练的循环神经网络模型中的词序列是由待判断词及待判断词的前文词确定的,因此,能够充分考虑上文信息,得到更为准确的待判断词是待处理文本的关键词的概 率。
S360:根据各待判断词是待处理文本的关键词的概率及预设阈值,确定待处理文本的关键词。
分别将待处理文本中各待判断词是关键词的概率与预设阈值进行比较,将概率大于或不小于预设阈值的待判断词确定为待处理文本中的关键词。
阈值的设定与具体需求有关,阈值设定得高,准确率就高,召回率相应降低。如果阈值设置得低,准确率就低,召回率就高,用户可根据需要设置阈值,如可以将阈值设置为0.5。
上述关键词提取方法,无需人工根据数据的特点抽取有效的特征,而是通过将词序列输入到经过训练的循环神经网络模型中以获得对应的待判断词是关键词的概率,且输入到经过训练的循环神经网络模型中的词序列是由待判断词及待判断词的前文词确定的,因此,能够充分考虑上文信息,得到更为准确的待判断词是待处理文本的关键词的概率,从而提高了提取的关键词的准确性。
在其中一个实施例中,获取待处理文本的待判断词的步骤,即步骤S310,包括以下步骤:
步骤a:对待处理文本进行分词处理,获得待处理文本中的词。
步骤b:识别待处理文本中的停用词,将待处理文本中除停用词之外的词确定为待判断词。
可将停用词库中的停用词与待处理文本中的词进行比较,确定待处理文本中的停用词。例如,常用的停用词包括“的”、“了”、“什么”等等,这些词一定不可能作为关键词。本实施例中,将待处理文本中除停用词之外的词确定为待判断词,而除停用词之外的词通常为实义词,将实义词作为待判断词,而不将停用词作为待判断词,一方面能够避免因停用词的输出结果影响关键词提取的准确率,另一方面,能够提高关键词提取的速度。
在其中一个实施例中,前文词包括待处理文本中出现在待判断词的上文 中、除停用词之外的词。可以理解地,待处理文本中出现在待判断词的上文中、除停用词之外的词,即为待处理文本中出现在待判断词的上文中的实义词。
在另一个实施例中,前文词可以包括待处理文本中出现在待判断词的上文中的所有词,即,包括待处理文本中出现在待判断词的上文中的停用词和实义词。
在其中一个实施例中,步骤S330可以包括:
步骤a:获取各待判断词分别对应的前文词的词向量和各待判断词的词向量。
步骤b:根据各待判断词分别对应的前文词及各待判断词在待处理文本中出现的顺序、采用各待判断词分别对应的前文词的词向量和各待判断词的词向量确定各待判断词的词序列,词序列为词向量序列。
词向量是一个词对应的向量表示,是将自然语言中的词进行数字化的一种方式,词向量可利用语言模型训练得到。常用的语言模型为Word2vec,其利用深度学习的思想,可以通过训练,把对文本内容的处理简化为K维向量空间中的向量运算。在具体的实施方式中,可以通过大规模文本数据,利用Word2vec训练得到每个词的词向量,通过查找,可以获取到待处理文本中每个词的词向量,从而可以获取到各待判断词分别对应的前文词的词向量和各待判断词的词向量。
如此,采用词向量来对各个词进行表示,因而可以更好地获取词级别的语义信息,从而进一步提高提取的关键词的准确性。
需要说明的是,当输入到经过训练的循环神经网络模型中的词序列为词向量序列时,经过训练的循环神经网络模型的隐含层输出的结果也为一个向量,为了把该向量映射到0-1范围内以表示各待判断词的概率,可使用Softmax函数或Sigmoid函数。Softmax函数是一种常用的多分类回归模型。判断待判断词是否为关键词可以构造为一个二维问题,对应的Softmax函数 具有二维,一维表示是关键词的概率,第二维表示不是关键词的概率。
更进一步地,获取前文词和待判断词分别对应的词向量的获取方式为通过大规模语料库训练得到。使用通过大规模语料库训练得到的词向量,可以充分利用词的语义信息从语义层面帮助判别关键词,从而可以更进一步地提高提取的关键词的准确性。
请参阅图4,在其中一个实施例中,为进一步提高提取的关键词的准确性,在分别将各待判断词的词序列输入到经过训练的循环神经网络模型中之前,还包括步骤:
S340:获取训练样本,对循环神经网络模型进行训练获得经过训练的循环神经网络模型;所述训练样本包括元素对,所述元素对包括训练文本对应的训练词及所述训练词为所述训练文本的关键词的概率。
元素对中的训练词为训练文本的关键词的概率的值包括0和1;当该值为0时,表示该训练词不是该训练文本的关键词;当该值为1时,表示该训练词是该训练文本的关键词。
在训练过程中可使用高斯分布初始化循环神经网络模型的网络参数,训练时将训练文本的第i个待判断词及该待判断词的各前文词按照各个词在待处理文本中出现的顺序形成词序列,依次将词序列中的各个词向量输入到循环神经网络模型中,得到第i个待判断词的损失,从而得到各个待判断词的损失。还需要说明的是,在训练过程中,可采用梯度下降法来更新循环神经网络模型的参数。
在其中一个实施例中,循环神经网络模型为LSTM模型。
LSTM模型在RNN模型的基础上,循环神经网络模型中的隐含单元为LSTM单元。一种LSTM单元结构图如图5所示。其中,记忆单元用于存储历史信息,历史信息的更新和使用分别受三个门的控制——输入门(Input Gate)、遗忘门(Forget Gate)和输出门(Output Gate)。由于LSTM模型能够克服对不 定长序列输入的不足,可以更好地存储历史信息,因此,可以进一步地,提高提取的关键词的准确性。
请结合参阅图5和图6,以循环神经网络模型为LSTM模型对一个待判断词的词序列进行处理的过程为例,将待判断词对应的词序列中的各前文词的词向量(Word Embedding)及该待判断词的词向量,按照在待处理文本中出现的顺序输入到经过训练的LSTM模型,因而,根据待判断词的数量多次利用LSTM模型,得到每个待判断词是待处理文本的关键词的概率。并且每一个待判断词都是作为LSTM模型的最后一个LSTM单元的输入,每一个待判断词的输出结果都考虑了每一个待判断词的上文历史信息,即每个待判断词对应的各个前文词的语义信息。LSTM模型输出的各待判断词的结果为LSTM模型的最后一个隐含层(LSTM单元)的输出。
下面结合具体的实施例对本发明的关键词提取方法进行说明。
请继续参阅图6,一种与关键词提取方法对应模型包括LSTM单元和Sotfmax函数。通过该模型可以构建一个分类器,对待处理文本的每个待判断词,确定其成为关键词的概率。对于一个待判断词,抽取从其句首开始的所有的词组成词序列。模型的输入为词向量,每一个LSTM单元均可以输出一个针对该LSTM单元输入的词向量对应的词的结果,将该结果结合词序列中的下一个词向量,作为下一个LSTM单元的输入。最后一个LSTM单元,将上一个LSTM单元的输出结果结合待判断词的词向量作为该最后一个LSTM单元的输入,其输出结果为以向量形式表示的待判断词对应的结果,该向量形式的结果通过Sotfmax函数,从而确定该待判断词是关键词的概率。将待判断词是关键词的概率与预设阈值进行比较可确定待判断词是否为关键词。
以待处理文本为“宁波有什么特产能在上海世博会占有一席之地呢”为例,在分词处理后,确定的待判断词包括“宁波”、“特产”、“上海”、“世博会”、“占有”和“一席之地”。分别将每个待判断词的词向量及该待判断词上文的各前文词的词向量按照其在待处理文本中出现的顺序输入到经训练得到 的循环神经网络模型中,得各待判断词是待处理文本的关键词的概率。例如,当待判断词为“世博会”时,可以如图6所示,以“宁波”、“有”“特产”“上海”、“世博会”的顺序,将对应的词向量输入到循环神经网络模型中,其中,“宁波”的词向量输入至LSTM模型的第一个LSTM单元,“有”的词向量输入至第二个LSTM单元,依此类推,待判断词“世博会”的词向量输入最后一个LSTM单元,每一个LSTM单元都受上一个LSTM单元的输出影响。LSTM模型的输出为最后一个LSTM单元的输出向量经过Softmax函数映射所对应的概率值,从而得到各待判断词是待处理文本的关键词的概率。由于输入的词向量序列本身包括了待判断词对应的各前文词的向量和待判断词的词向量,从而考虑了上文信息,而且在LSTM模型内部,能够更好地存储历史信息,因此可以进一步得到更为准确的待判断词是待处理文本的关键词的概率。
应该理解的是,虽然图3和图4的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图3和图4中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,还提供了一种计算机设备,该计算机设备的内部结构可如图2所示,该计算机设备设置有关键词提取装置,关键词提取装置中包括各个模块,每个模块可全部或部分通过软件、硬件或其组合来实现。
在一个实施例中,提供一种关键词提取装置,如图7所示,包括:待判词获取模块710、前文词确定模块720、词序列确定模块730、概率确定模块750和关键词确定模块760。
其中,待判词获取模块710,用于获取待处理文本的各待判断词。
前文词确定模块720,用于确定各待判断词分别对应的前文词,前文词为待处理文本中出现在待判断词的上文中的词。
词序列确定模块730,用于根据各待判断词、各待判断词分别对应的前文词在待处理文本中出现的顺序确定词序列。
概率确定模块750,用于分别将各待判断词的词序列输入到经过训练的循环神经网络模型中,得到各待判断词是待处理文本的关键词的概率。
关键词确定模块760,用于根据各待判断词是待处理文本的关键词的概率及预设阈值,确定待处理文本的关键词。
上述关键词提取装置,无需人工根据数据的特点抽取有效的特征,是通过将词序列输入到经过训练的循环神经网络模型中以获得对应的待判断词是关键词的概率,且输入到经过训练的循环神经网络模型中的词序列是由待判断词及待判断词的前文词确定的,因此,能够充分考虑上文信息,得到更为准确的待判断词是待处理文本的关键词的概率,从而提高了提取的关键词的准确性。
请参阅图8,在其中一个实施例中,待判词获取模块710,包括:分词处理单元711和识别确定单元713。
其中,分词处理单元711,用于对待处理文本进行分词处理,获得待处理文本中的词。
识别确定单元713,用于识别待处理文本中的停用词,将待处理文本中除停用词之外的词确定为待判断词。
在其中一个实施例中,前文词包括待处理文本中出现在待判断词的上文中、除停用词之外的词。
在其中一个实施例中,词序列确定模块730包括:词向量获取单元731和词序列确定单元733;
其中,词向量获取单元731,用于获取各待判断词分别对应的前文词的词向量和各待判断词的词向量;
词序列确定单元733,根据各待判断词分别对应的前文词及各待判断词 在待处理文本中出现的顺序、采用各待判断词分别对应的前文词的词向量和各待判断词的词向量确定各待判断词的词序列,词序列为词向量序列。
请继续参阅图8,在其中一个实施例中,还包括:
模型训练模块740,用于获取训练样本,对循环神经网络模型进行训练获得经过训练的循环神经网络模型;所述训练样本包括元素对,所述元素对包括训练文本对应的训练词及所述训练词为所述训练文本的关键词的概率。
在其中一个实施例中,循环神经网络模型为LSTM模型。
由于上述关键词提取装置与上述关键词提取方法相互对应,对于装置中与上述方法对应的具体技术特征,在此不再赘述。
在一个实施例中,还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如下步骤:
获取待处理文本的各待判断词;
确定各所述待判断词分别对应的前文词,所述前文词为所述待处理文本中出现在所述待判断词的上文中的词;
根据各所述待判断词、各所述待判断词分别对应的前文词在所述待处理文本中出现的顺序,确定各所述待判断词的词序列;
分别将各所述待判断词的所述词序列输入到经过训练的循环神经网络模型中,得到各所述待判断词是所述待处理文本的关键词的概率;
根据各所述待判断词是所述待处理文本的关键词的概率及预设阈值,确定所述待处理文本的关键词。
在其中一个实施例中,所述获取待处理文本的待判断词的步骤,包括:
对所述待处理文本进行分词处理,获得所述待处理文本中的词;
识别所述待处理文本中的停用词,将所述待处理文本中除所述停用词之外的词确定为待判断词。
在其中一个实施例中,所述根据各所述待判断词、各所述待判断词分别 对应的前文词在所述待处理文本中出现的顺序,确定各所述待判断词的词序列的步骤,包括:
获取各所述待判断词分别对应的前文词的词向量和各所述待判断词的词向量;
根据各所述待判断词分别对应的前文词及各所述待判断词在所述待处理文本中出现的顺序,采用各所述待判断词分别对应的前文词的词向量和所述各待判断词的词向量确定各所述待判断词的词序列,所述词序列为词向量序列。
在其中一个实施例中,在分别将各所述待判断词的所述词序列输入到经过训练的循环神经网络模型中之前,还包括步骤:
获取训练样本,对循环神经网络模型进行训练获得所述经过训练的循环神经网络模型;所述训练样本包括元素对,所述元素对包括训练文本对应的训练词及所述训练词为所述训练文本的关键词的概率。
在其中一个实施例中,所述循环神经网络模型为LSTM模型。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (15)

  1. 一种关键词提取方法,其特征在于,应用于用户终端或服务器,包括:
    获取待处理文本的各待判断词;
    确定各所述待判断词分别对应的前文词,所述前文词为所述待处理文本中出现在所述待判断词的上文中的词;
    根据各所述待判断词、各所述待判断词分别对应的前文词在所述待处理文本中出现的顺序,确定各所述待判断词的词序列;
    分别将各所述待判断词的所述词序列输入到经过训练的循环神经网络模型中,得到各所述待判断词是所述待处理文本的关键词的概率;
    根据各所述待判断词是所述待处理文本的关键词的概率及预设阈值,确定所述待处理文本的关键词。
  2. 根据权利要求1所述的关键词提取方法,其特征在于,所述获取待处理文本的待判断词的步骤,包括:
    对所述待处理文本进行分词处理,获得所述待处理文本中的词;
    识别所述待处理文本中的停用词,将所述待处理文本中除所述停用词之外的词确定为待判断词。
  3. 根据权利要求1所述的关键词提取方法,其特征在于,所述根据各所述待判断词、各所述待判断词分别对应的前文词在所述待处理文本中出现的顺序,确定各所述待判断词的词序列的步骤,包括:
    获取各所述待判断词分别对应的前文词的词向量和各所述待判断词的词向量;
    根据各所述待判断词分别对应的前文词及各所述待判断词在所述待处理文本中出现的顺序,采用各所述待判断词分别对应的前文词的词向量和所述各待判断词的词向量确定各所述待判断词的词序列,所述词序列为词向量序列。
  4. 根据权利要求1所述的关键词提取方法,其特征在于,在分别将各所述待判断词的所述词序列输入到经过训练的循环神经网络模型中之前,还包 括步骤:
    获取训练样本,对循环神经网络模型进行训练获得所述经过训练的循环神经网络模型;所述训练样本包括元素对,所述元素对包括训练文本对应的训练词及所述训练词为所述训练文本的关键词的概率。
  5. 根据权利要求1所述的关键词提取方法,其特征在于,所述循环神经网络模型为LSTM模型。
  6. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如下步骤:
    获取待处理文本的各待判断词;
    确定各所述待判断词分别对应的前文词,所述前文词为所述待处理文本中出现在所述待判断词的上文中的词;
    根据各所述待判断词、各所述待判断词分别对应的前文词在所述待处理文本中出现的顺序,确定各所述待判断词的词序列;
    分别将各所述待判断词的所述词序列输入到经过训练的循环神经网络模型中,得到各所述待判断词是所述待处理文本的关键词的概率;
    根据各所述待判断词是所述待处理文本的关键词的概率及预设阈值,确定所述待处理文本的关键词。
  7. 根据权利要求6所述的计算机设备,其特征在于,所述获取待处理文本的待判断词的步骤,包括:
    对所述待处理文本进行分词处理,获得所述待处理文本中的词;
    识别所述待处理文本中的停用词,将所述待处理文本中除所述停用词之外的词确定为待判断词。
  8. 根据权利要求6所述的计算机设备,其特征在于,所述根据各所述待判断词、各所述待判断词分别对应的前文词在所述待处理文本中出现的顺序,确定各所述待判断词的词序列的步骤,包括:
    获取各所述待判断词分别对应的前文词的词向量和各所述待判断词的词 向量;
    根据各所述待判断词分别对应的前文词及各所述待判断词在所述待处理文本中出现的顺序,采用各所述待判断词分别对应的前文词的词向量和所述各待判断词的词向量确定各所述待判断词的词序列,所述词序列为词向量序列。
  9. 根据权利要求6所述的计算机设备,其特征在于,在分别将各所述待判断词的所述词序列输入到经过训练的循环神经网络模型中之前,还包括步骤:
    获取训练样本,对循环神经网络模型进行训练获得所述经过训练的循环神经网络模型;所述训练样本包括元素对,所述元素对包括训练文本对应的训练词及所述训练词为所述训练文本的关键词的概率。
  10. 根据权利要求6所述的计算机设备,其特征在于,所述循环神经网络模型为LSTM模型。
  11. 一个或多个存储有计算机可读指令的非易失性存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
    获取待处理文本的各待判断词;
    确定各所述待判断词分别对应的前文词,所述前文词为所述待处理文本中出现在所述待判断词的上文中的词;
    根据各所述待判断词、各所述待判断词分别对应的前文词在所述待处理文本中出现的顺序,确定各所述待判断词的词序列;
    分别将各所述待判断词的所述词序列输入到经过训练的循环神经网络模型中,得到各所述待判断词是所述待处理文本的关键词的概率;
    根据各所述待判断词是所述待处理文本的关键词的概率及预设阈值,确定所述待处理文本的关键词。
  12. 根据权利要求11所述的存储介质,其特征在于,所述获取待处理文本的待判断词的步骤,包括:
    对所述待处理文本进行分词处理,获得所述待处理文本中的词;
    识别所述待处理文本中的停用词,将所述待处理文本中除所述停用词之外的词确定为待判断词。
  13. 根据权利要求11所述的存储介质,其特征在于,所述根据各所述待判断词、各所述待判断词分别对应的前文词在所述待处理文本中出现的顺序,确定各所述待判断词的词序列的步骤,包括:
    获取各所述待判断词分别对应的前文词的词向量和各所述待判断词的词向量;
    根据各所述待判断词分别对应的前文词及各所述待判断词在所述待处理文本中出现的顺序,采用各所述待判断词分别对应的前文词的词向量和所述各待判断词的词向量确定各所述待判断词的词序列,所述词序列为词向量序列。
  14. 根据权利要求11所述的存储介质,其特征在于,在分别将各所述待判断词的所述词序列输入到经过训练的循环神经网络模型中之前,还包括步骤:
    获取训练样本,对循环神经网络模型进行训练获得所述经过训练的循环神经网络模型;所述训练样本包括元素对,所述元素对包括训练文本对应的训练词及所述训练词为所述训练文本的关键词的概率。
  15. 根据权利要求11所述的存储介质,其特征在于,所述循环神经网络模型为LSTM模型。
PCT/CN2018/075711 2017-02-23 2018-02-08 关键词提取方法、计算机设备和存储介质 WO2018153265A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2019521096A JP6956177B2 (ja) 2017-02-23 2018-02-08 キーワード抽出方法、コンピュータ装置及び記憶媒体
EP18758452.9A EP3518122A4 (en) 2017-02-23 2018-02-08 METHOD OF EXTRACTING KEYWORDS, COMPUTER DEVICE AND INFORMATION CARRIER
KR1020197017920A KR102304673B1 (ko) 2017-02-23 2018-02-08 키워드 추출 방법, 컴퓨터 장치, 및 저장 매체
US16/363,646 US10963637B2 (en) 2017-02-23 2019-03-25 Keyword extraction method, computer equipment and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710101013.1 2017-02-23
CN201710101013.1A CN108304365A (zh) 2017-02-23 2017-02-23 关键词提取方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/363,646 Continuation US10963637B2 (en) 2017-02-23 2019-03-25 Keyword extraction method, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2018153265A1 true WO2018153265A1 (zh) 2018-08-30

Family

ID=62872540

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/075711 WO2018153265A1 (zh) 2017-02-23 2018-02-08 关键词提取方法、计算机设备和存储介质

Country Status (6)

Country Link
US (1) US10963637B2 (zh)
EP (1) EP3518122A4 (zh)
JP (1) JP6956177B2 (zh)
KR (1) KR102304673B1 (zh)
CN (1) CN108304365A (zh)
WO (1) WO2018153265A1 (zh)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109615241A (zh) * 2018-12-13 2019-04-12 大连海事大学 一种基于卷积和循环神经网络的软件Bug分派方法
CN109635288A (zh) * 2018-11-29 2019-04-16 东莞理工学院 一种基于深度神经网络的简历抽取方法
CN109740152A (zh) * 2018-12-25 2019-05-10 腾讯科技(深圳)有限公司 文本类目的确定方法、装置、存储介质和计算机设备
CN109902273A (zh) * 2019-01-30 2019-06-18 平安科技(深圳)有限公司 关键词生成模型的建模方法和装置
CN110110330A (zh) * 2019-04-30 2019-08-09 腾讯科技(深圳)有限公司 基于文本的关键词提取方法和计算机设备
CN111460096A (zh) * 2020-03-26 2020-07-28 北京金山安全软件有限公司 一种碎片文本的处理方法、装置及电子设备
CN111709230A (zh) * 2020-04-30 2020-09-25 昆明理工大学 基于词性软模板注意力机制的短文本自动摘要方法
CN111859940A (zh) * 2019-04-23 2020-10-30 北京嘀嘀无限科技发展有限公司 一种关键词提取方法、装置、电子设备及存储介质
CN112015884A (zh) * 2020-08-28 2020-12-01 欧冶云商股份有限公司 一种用户走访数据关键词提取方法、装置及存储介质
CN113076756A (zh) * 2020-01-06 2021-07-06 北京沃东天骏信息技术有限公司 一种文本生成方法和装置

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220215292A1 (en) * 2018-08-01 2022-07-07 Intuit Inc. Method to identify incorrect account numbers
CN111027313A (zh) * 2018-10-08 2020-04-17 中国科学院沈阳计算技术研究所有限公司 基于注意力机制的BiGRU判决结果倾向性分析方法
US11537664B2 (en) 2019-05-23 2022-12-27 Google Llc Learning to select vocabularies for categorical features
US11316810B2 (en) * 2019-06-07 2022-04-26 International Business Machines Corporation Messaging system for automatically generating semantic contextual messages
CN110598095B (zh) * 2019-08-27 2024-02-13 深圳市雅阅科技有限公司 一种识别包含指定信息文章的方法、装置及存储介质
CN111144127B (zh) * 2019-12-25 2023-07-25 科大讯飞股份有限公司 文本语义识别方法及其模型的获取方法及相关装置
CN111738791B (zh) * 2020-01-20 2024-05-24 北京沃东天骏信息技术有限公司 一种文本处理方法、装置、设备和存储介质
CN113221553A (zh) * 2020-01-21 2021-08-06 腾讯科技(深圳)有限公司 一种文本处理方法、装置、设备以及可读存储介质
KR102216066B1 (ko) * 2020-05-04 2021-02-18 호서대학교 산학협력단 문장형 쿼리에 대해 검색결과를 제공하는 방법
KR102418260B1 (ko) * 2020-05-27 2022-07-06 삼성생명보험주식회사 고객 상담 기록 분석 방법
CN111737996B (zh) * 2020-05-29 2024-03-26 北京百度网讯科技有限公司 基于语言模型获取词向量的方法、装置、设备及存储介质
CN111831821B (zh) * 2020-06-03 2024-01-09 北京百度网讯科技有限公司 文本分类模型的训练样本生成方法、装置和电子设备
CN111967268B (zh) * 2020-06-30 2024-03-19 北京百度网讯科技有限公司 文本中的事件抽取方法、装置、电子设备和存储介质
CN112131877B (zh) * 2020-09-21 2024-04-09 民生科技有限责任公司 一种海量数据下的实时中文文本分词方法
CN112052375B (zh) * 2020-09-30 2024-06-11 北京百度网讯科技有限公司 舆情获取和词粘度模型训练方法及设备、服务器和介质
CN112884440B (zh) * 2021-03-02 2024-05-24 岭东核电有限公司 核电试验中的试验工序执行方法、装置和计算机设备
KR102620697B1 (ko) * 2021-07-12 2024-01-02 주식회사 카카오뱅크 딥러닝 기반의 자연어 처리를 통한 메시지 내 이체 정보 판단 방법 및 장치
CN113761161A (zh) * 2021-08-10 2021-12-07 紫金诚征信有限公司 文本关键词提取方法、装置、计算机设备和存储介质
CN113808758B (zh) * 2021-08-31 2024-06-07 联仁健康医疗大数据科技股份有限公司 一种检验数据标准化的方法、装置、电子设备和存储介质
CN113609843B (zh) * 2021-10-12 2022-02-01 京华信息科技股份有限公司 一种基于梯度提升决策树的句词概率计算方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101122909A (zh) * 2006-08-10 2008-02-13 株式会社日立制作所 文本信息检索装置以及文本信息检索方法
CN101944099A (zh) * 2010-06-24 2011-01-12 西北工业大学 一种使用本体进行文本文档自动分类的方法
US20110313865A1 (en) * 2010-06-17 2011-12-22 Microsoft Corporation Ad copy quality detection and scoring
CN105139237A (zh) * 2015-09-25 2015-12-09 百度在线网络技术(北京)有限公司 信息推送的方法和装置

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4360122B2 (ja) * 2003-05-26 2009-11-11 富士ゼロックス株式会社 キーワード抽出装置
US7493322B2 (en) * 2003-10-15 2009-02-17 Xerox Corporation System and method for computing a measure of similarity between documents
US7519588B2 (en) * 2005-06-20 2009-04-14 Efficient Frontier Keyword characterization and application
US8346534B2 (en) * 2008-11-06 2013-01-01 University of North Texas System Method, system and apparatus for automatic keyword extraction
US9715660B2 (en) * 2013-11-04 2017-07-25 Google Inc. Transfer learning for deep neural network based hotword detection
JP6230190B2 (ja) * 2014-01-09 2017-11-15 日本放送協会 重要語抽出装置、及びプログラム
KR102305584B1 (ko) * 2015-01-19 2021-09-27 삼성전자주식회사 언어 모델 학습 방법 및 장치, 언어 인식 방법 및 장치
KR101656741B1 (ko) * 2015-04-23 2016-09-12 고려대학교 산학협력단 프레임 기반 의견스팸 판단장치, 프레임 기반 의견스팸 판단방법, 프레임 기반으로 의견스팸을 판단하기 위한 컴퓨터 프로그램 및 컴퓨터 판독가능 기록매체
US9916376B2 (en) * 2015-08-11 2018-03-13 Fujitsu Limited Digital document keyword generation
CN105260359B (zh) * 2015-10-16 2018-10-02 晶赞广告(上海)有限公司 语义关键词提取方法及装置
CN105955952A (zh) * 2016-05-03 2016-09-21 成都数联铭品科技有限公司 一种基于双向递归神经网络的信息提取方法
CN106095749A (zh) * 2016-06-03 2016-11-09 杭州量知数据科技有限公司 一种基于深度学习的文本关键词提取方法
US10056083B2 (en) * 2016-10-18 2018-08-21 Yen4Ken, Inc. Method and system for processing multimedia content to dynamically generate text transcript
US10255269B2 (en) * 2016-12-30 2019-04-09 Microsoft Technology Licensing, Llc Graph long short term memory for syntactic relationship discovery
CN111078838B (zh) * 2019-12-13 2023-08-18 北京小米智能科技有限公司 关键词提取方法、关键词提取装置及电子设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101122909A (zh) * 2006-08-10 2008-02-13 株式会社日立制作所 文本信息检索装置以及文本信息检索方法
US20110313865A1 (en) * 2010-06-17 2011-12-22 Microsoft Corporation Ad copy quality detection and scoring
CN101944099A (zh) * 2010-06-24 2011-01-12 西北工业大学 一种使用本体进行文本文档自动分类的方法
CN105139237A (zh) * 2015-09-25 2015-12-09 百度在线网络技术(北京)有限公司 信息推送的方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3518122A4 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635288B (zh) * 2018-11-29 2023-05-23 东莞理工学院 一种基于深度神经网络的简历抽取方法
CN109635288A (zh) * 2018-11-29 2019-04-16 东莞理工学院 一种基于深度神经网络的简历抽取方法
CN109615241A (zh) * 2018-12-13 2019-04-12 大连海事大学 一种基于卷积和循环神经网络的软件Bug分派方法
CN109740152A (zh) * 2018-12-25 2019-05-10 腾讯科技(深圳)有限公司 文本类目的确定方法、装置、存储介质和计算机设备
CN109902273A (zh) * 2019-01-30 2019-06-18 平安科技(深圳)有限公司 关键词生成模型的建模方法和装置
CN109902273B (zh) * 2019-01-30 2024-05-07 平安科技(深圳)有限公司 关键词生成模型的建模方法和装置
CN111859940B (zh) * 2019-04-23 2024-05-14 北京嘀嘀无限科技发展有限公司 一种关键词提取方法、装置、电子设备及存储介质
CN111859940A (zh) * 2019-04-23 2020-10-30 北京嘀嘀无限科技发展有限公司 一种关键词提取方法、装置、电子设备及存储介质
CN110110330B (zh) * 2019-04-30 2023-08-11 腾讯科技(深圳)有限公司 基于文本的关键词提取方法和计算机设备
CN110110330A (zh) * 2019-04-30 2019-08-09 腾讯科技(深圳)有限公司 基于文本的关键词提取方法和计算机设备
CN113076756A (zh) * 2020-01-06 2021-07-06 北京沃东天骏信息技术有限公司 一种文本生成方法和装置
CN111460096B (zh) * 2020-03-26 2023-12-22 北京金山安全软件有限公司 一种碎片文本的处理方法、装置及电子设备
CN111460096A (zh) * 2020-03-26 2020-07-28 北京金山安全软件有限公司 一种碎片文本的处理方法、装置及电子设备
CN111709230B (zh) * 2020-04-30 2023-04-07 昆明理工大学 基于词性软模板注意力机制的短文本自动摘要方法
CN111709230A (zh) * 2020-04-30 2020-09-25 昆明理工大学 基于词性软模板注意力机制的短文本自动摘要方法
CN112015884A (zh) * 2020-08-28 2020-12-01 欧冶云商股份有限公司 一种用户走访数据关键词提取方法、装置及存储介质

Also Published As

Publication number Publication date
CN108304365A (zh) 2018-07-20
KR20190085098A (ko) 2019-07-17
US20190220514A1 (en) 2019-07-18
JP6956177B2 (ja) 2021-11-02
EP3518122A1 (en) 2019-07-31
EP3518122A4 (en) 2019-11-20
US10963637B2 (en) 2021-03-30
JP2019531562A (ja) 2019-10-31
KR102304673B1 (ko) 2021-09-23

Similar Documents

Publication Publication Date Title
WO2018153265A1 (zh) 关键词提取方法、计算机设备和存储介质
AU2018214675B2 (en) Systems and methods for automatic semantic token tagging
CN109783655B (zh) 一种跨模态检索方法、装置、计算机设备和存储介质
US11941366B2 (en) Context-based multi-turn dialogue method and storage medium
CN108694225B (zh) 一种图像搜索方法、特征向量的生成方法、装置及电子设备
WO2021068321A1 (zh) 基于人机交互的信息推送方法、装置和计算机设备
WO2020177230A1 (zh) 基于机器学习的医疗数据分类方法、装置、计算机设备及存储介质
WO2021042503A1 (zh) 信息分类抽取方法、装置、计算机设备和存储介质
WO2019136993A1 (zh) 文本相似度计算方法、装置、计算机设备和存储介质
CN111709243B (zh) 一种基于深度学习的知识抽取方法与装置
US11657802B2 (en) Utilizing a dynamic memory network for state tracking
CN112015900B (zh) 医学属性知识图谱构建方法、装置、设备及介质
KR102194200B1 (ko) 인공신경망 모델을 이용한 뉴스 기사 분석에 의한 주가지수 예측 방법 및 장치
WO2020114100A1 (zh) 一种信息处理方法、装置和计算机存储介质
CN112766319B (zh) 对话意图识别模型训练方法、装置、计算机设备及介质
CN111191032B (zh) 语料扩充方法、装置、计算机设备和存储介质
WO2022116436A1 (zh) 长短句文本语义匹配方法、装置、计算机设备及存储介质
CN111191002A (zh) 一种基于分层嵌入的神经代码搜索方法及装置
WO2022134805A1 (zh) 文档分类预测方法、装置、计算机设备及存储介质
CN112380837B (zh) 基于翻译模型的相似句子匹配方法、装置、设备及介质
CN112580329B (zh) 文本噪声数据识别方法、装置、计算机设备和存储介质
CN111680132B (zh) 一种用于互联网文本信息的噪声过滤和自动分类方法
WO2020132933A1 (zh) 短文本过滤方法、装置、介质及计算机设备
CN112307048A (zh) 语义匹配模型训练方法、匹配方法、装置、设备及存储介质
CN115374786A (zh) 实体和关系联合抽取方法及装置、存储介质和终端

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18758452

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019521096

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2018758452

Country of ref document: EP

Effective date: 20190425

ENP Entry into the national phase

Ref document number: 20197017920

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE