WO2018153265A1 - 关键词提取方法、计算机设备和存储介质 - Google Patents
关键词提取方法、计算机设备和存储介质 Download PDFInfo
- Publication number
- WO2018153265A1 WO2018153265A1 PCT/CN2018/075711 CN2018075711W WO2018153265A1 WO 2018153265 A1 WO2018153265 A1 WO 2018153265A1 CN 2018075711 W CN2018075711 W CN 2018075711W WO 2018153265 A1 WO2018153265 A1 WO 2018153265A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- word
- words
- determined
- text
- processed
- Prior art date
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 24
- 238000003062 neural network model Methods 0.000 claims abstract description 48
- 238000000034 method Methods 0.000 claims abstract description 24
- 239000013598 vector Substances 0.000 claims description 66
- 238000012549 training Methods 0.000 claims description 56
- 125000004122 cyclic group Chemical group 0.000 claims description 47
- 230000015654 memory Effects 0.000 claims description 21
- 230000011218 segmentation Effects 0.000 claims description 12
- 230000000306 recurrent effect Effects 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 10
- 238000012545 processing Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 7
- 238000004891 communication Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Definitions
- the present application relates to the field of information technology, and in particular, to a keyword extraction method, a computer device, and a storage medium.
- the traditional keyword extraction method uses a machine learning algorithm based on feature selection, which requires manual extraction of effective features according to the characteristics of the data. Since the way of human participation involves a large subjective idea, it is difficult to guarantee the accuracy of the keyword.
- a keyword extraction method a computer device, and a storage medium are provided.
- a keyword extraction method applied to a user terminal or server including:
- a computer device comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor such that the processor performs the following steps:
- One or more non-volatile storage media storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the following steps:
- FIG. 1 is a schematic diagram of an application environment of a keyword extraction method according to an embodiment
- FIG. 2 is a schematic diagram showing the internal structure of a computer device of an embodiment
- FIG. 3 is a flow chart of a keyword extraction method of an embodiment
- Figure 5 is a structural diagram of an LSTM unit of an embodiment
- FIG. 6 is a schematic structural diagram of a model corresponding to a keyword extraction method according to an embodiment
- Figure 7 is a block diagram showing the structure of a computer device of an embodiment
- Figure 8 is a block diagram showing the structure of a computer device of another embodiment.
- FIG. 1 is a schematic diagram of an application environment of a keyword extraction method provided by an embodiment.
- the application environment includes a user terminal 110 and a server 120, and the user terminal 110 is communicatively coupled to the server 120.
- the user terminal 110 is installed with a search engine or a question answering system.
- the user inputs text through the user terminal 110, and the input text is sent to the server 120 through the communication network.
- the server 120 processes the input text, extracts keywords in the input text, and provides a search for the user. Results or question and answer results.
- the user inputs text through the user terminal 110
- the user terminal 110 processes the input text, extracts the keyword of the input text, and transmits the keyword to the server 120 through the communication network, and the server 120 provides the user with a search result or a question and answer result.
- the computer device includes a processor, memory, and network interface connected by a system bus.
- the processor is used to provide computing and control capabilities to support the operation of the entire computer device.
- the memory includes a non-volatile storage medium and an internal memory, the non-volatile storage medium storing operating system and computer readable instructions, the internal memory being an operating system and computer readable instructions in a non-volatile storage medium
- the runtime providing environment when executed by the processor, causes the processor to perform a keyword extraction method.
- the network interface is used for network communication with an external terminal.
- FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied.
- the specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.
- a keyword extraction method is provided.
- the method runs in the server 120 shown in FIG. 1, and the method includes the following steps:
- S310 Acquire each to-be-determined word of the text to be processed.
- the text to be processed usually consists of a single word. Compared with single words, words can express semantics more practically.
- the text to be processed can be preprocessed to obtain the to-be-determined words of the text to be processed.
- the word to be judged is a word in the text to be processed that needs to be judged whether it is a keyword of the text to be processed.
- the word to be judged may be a word of the text to be processed obtained after the word segmentation processing, that is, the preprocessing may include word segmentation processing.
- the to-be-determined word may also be a word with practical meaning extracted from the words of the text to be processed, that is, the pre-processing may further include a process of identifying the stop word and excluding the stop word.
- the method may further include the step of: acquiring the text to be processed.
- the user inputs text through the user terminal, and the server acquires the text input by the user through the communication network to obtain the text to be processed.
- S320 Determine a pre-word corresponding to each of the to-be-determined words, where the pre-word is a word appearing in the above-mentioned word to be judged in the to-be-determined word.
- the preceding vocabulary is the word appearing in the above-mentioned word to be judged in the text to be processed, and the preceding vocabulary corresponding to each of the to-be-determined words may be determined according to the to-be-processed text. Specifically, after the pre-processing (eg, word segmentation processing) of the text to be processed, the order in which the obtained words appear in the to-be-processed text determines the pre-words appearing in the above-mentioned words to be judged.
- pre-processing eg, word segmentation processing
- S330 Determine a sequence of words of each to-be-determined word according to an order in which the preceding words corresponding to each of the to-be-determined words and the respective to-be-determined words appear in the to-be-processed text.
- the first word to be judged in the text to be processed may not have a corresponding pre-word, and the sequence of words of the first word to be judged may be composed of the first word to be judged.
- the corresponding word sequence is the pre-word corresponding to the word to be judged plus itself, according to these words (each of the preceding words plus The sequence of words to be determined in the order in which the text to be processed appears.
- S350 input the word sequence of each word to be judged into the trained cyclic neural network model, respectively, and obtain the probability that each word to be determined is a keyword of the text to be processed.
- the cyclic neural network model in this embodiment may adopt an RNN (Recurrent Neural Net) model, a Long Short-Term Memory (LSTM) model, or a GRU (Gated Recurrent Unit) model.
- the cyclic neural network model includes an input layer, an implicit layer and an output layer, wherein the hidden unit in the hidden layer completes the most important work, and according to the input word sequence of the word to be judged, the word to be judged is the text to be processed.
- the probability of a keyword Since the sequence of words input into the trained cyclic neural network model is determined by the preceding words of the word to be judged and the word to be judged, the above information can be fully considered, and a more accurate word to be determined is obtained. The probability of the keyword.
- S360 Determine a keyword of the to-be-processed text according to a probability that each to-be judged word is a keyword of the to-be-processed text and a preset threshold.
- the probability that each of the to-be-determined words in the to-be-processed text is a keyword is compared with a preset threshold, and the to-be-determined word whose probability is greater than or not less than the preset threshold is determined as a keyword in the to-be-processed text.
- the threshold setting is related to the specific requirements.
- the threshold is set high, the accuracy is high, and the recall rate is correspondingly reduced. If the threshold is set low, the accuracy is low, and the recall rate is high.
- the user can set the threshold as needed. For example, the threshold can be set to 0.5.
- the above keyword extraction method does not need to manually extract effective features according to the characteristics of the data, but inputs the word sequence into the trained cyclic neural network model to obtain the probability that the corresponding word to be judged is a keyword, and inputs the The sequence of words in the trained cyclic neural network model is determined by the preceding words of the word to be judged and the word to be judged. Therefore, the above information can be fully considered, and a more accurate word to be judged is the keyword of the text to be processed. Probability, which improves the accuracy of the extracted keywords.
- step S310 includes the following steps:
- Step a Perform word segmentation processing on the processed text to obtain words in the text to be processed.
- Step b Identify the stop words in the text to be processed, and determine words other than the stop words in the text to be processed as the words to be judged.
- the stop words in the stop vocabulary can be compared to the words in the pending text to determine the stop words in the pending text.
- commonly used stop words include "of", "a", "what", etc. These words must not be used as keywords.
- the words other than the stop words in the text to be processed are determined as the words to be judged, and the words other than the stop words are usually real words, and the actual words are used as judgment words, instead of Using the stop word as the word to be judged, on the one hand, it can avoid the accuracy of the keyword extraction due to the output result of the stop word, and on the other hand, can improve the speed of keyword extraction.
- the pre-word includes words appearing in the to-be-determined word in the text to be processed, except for the stop word. It can be understood that the words appearing in the above-mentioned words to be judged in the to-be-determined words, except for the stop words, are the actual words appearing in the above-mentioned words to be judged in the text to be processed.
- the pre-word may include all words appearing in the above-mentioned word to be judged in the text to be processed, that is, including the stop words and the substantive words appearing in the above-mentioned words to be judged in the to-be-determined words. .
- step S330 can include:
- Step a acquiring a word vector of the preceding text corresponding to each of the to-be-determined words and a word vector of each to-be-determined word.
- Step b determining, according to the order in which the preceding words and the to-be-determined words respectively appear in the to-be-processed text, the word vector of the preceding word corresponding to each of the to-be-determined words, and the word vector of each to-be-determined word.
- the sequence of words to be judged, the sequence of words is a sequence of word vectors.
- the word vector is a vector representation of a word, which is a way to digitize words in natural language.
- Word vectors can be trained using language models.
- the commonly used language model is Word2vec, which uses the idea of deep learning to simplify the processing of text content into vector operations in K-dimensional vector space through training.
- the word vector of each word can be obtained by using Word2vec training through large-scale text data, and the word vector of each word in the text to be processed can be obtained by searching, so that the words to be judged can be obtained.
- word vectors are used to represent each word, so that the semantic information of the word level can be better obtained, thereby further improving the accuracy of the extracted keywords.
- the result of the hidden layer output of the trained cyclic neural network model is also a vector, in order to map the vector to Within the range of 0-1 to indicate the probability of each word to be judged, a Softmax function or a Sigmoid function may be used.
- the Softmax function is a commonly used multi-class regression model. Judging whether the word to be judged is a keyword can be constructed as a two-dimensional problem, the corresponding Softmax function has two-dimensional, one-dimensional representation is the probability of the keyword, and the second dimension represents the probability of not being the keyword.
- the method for obtaining the word vector corresponding to the pre-word and the word to be judged is obtained through large-scale corpus training.
- the semantic information of the word can be fully utilized to help identify the keyword from the semantic level, so that the accuracy of the extracted keyword can be further improved.
- the method further includes the following steps:
- S340 acquiring a training sample, training the cyclic neural network model to obtain a trained cyclic neural network model; the training sample includes an element pair, the element pair includes a training word corresponding to the training text, and the training word is the training The probability of the keyword of the text.
- the value of the probability that the training word in the pair is the keyword of the training text includes 0 and 1; when the value is 0, the training word is not the keyword of the training text; when the value is 1, the training is indicated The word is the keyword of the training text.
- the Gaussian distribution can be used to initialize the network parameters of the cyclic neural network model.
- the i-th to be judged words of the training text and the preceding words of the to-be-determined words are formed in the order in which the words appear in the text to be processed.
- the word sequence inputs each word vector in the word sequence into the cyclic neural network model, and obtains the loss of the i-th word to be judged, thereby obtaining the loss of each word to be judged.
- the gradient descent method can be used to update the parameters of the cyclic neural network model.
- the cyclic neural network model is an LSTM model.
- the LSTM model is based on the RNN model.
- the implicit unit in the cyclic neural network model is the LSTM unit.
- a LSTM unit structure diagram is shown in Figure 5.
- the memory unit is used to store history information, and the update and use of the history information are controlled by three gates, an Input Gate, a Forget Gate, and an Output Gate, respectively. Since the LSTM model can overcome the shortcomings of the input of the indefinite length sequence, the history information can be better stored, and therefore, the accuracy of the extracted keywords can be further improved.
- FIG. 5 and FIG. 6 together with the cyclic neural network model as an example for the process of processing the word sequence of a word to be judged by the LSTM model, and the word vector of each pre-word in the word sequence corresponding to the word to be judged (Word) Embedding and the word vector of the word to be judged are input to the trained LSTM model in the order in which they appear in the text to be processed. Therefore, the LSTM model is used multiple times according to the number of words to be judged, and each word to be judged is obtained. The probability of dealing with keywords in text.
- each of the words to be judged is the input of the last LSTM unit of the LSTM model, and the output result of each word to be judged takes into account the above historical information of each word to be judged, that is, each corresponding to each word to be judged
- the semantic information of the preceding words is the output of the last hidden layer (LSTM unit) of the LSTM model.
- a model corresponding to the keyword extraction method includes an LSTM unit and a Sotfmax function.
- a classifier can be constructed to determine the probability that it will become a keyword for each word to be judged of the processed text. For a word to be judged, all the words starting from the beginning of the sentence are extracted into a sequence of words.
- the input of the model is a word vector, and each LSTM unit can output a result of a word corresponding to the word vector input by the LSTM unit, and combine the result with the next word vector in the word sequence as an input to the next LSTM unit.
- the last LSTM unit combines the output result of the previous LSTM unit with the word vector of the word to be judged as the input of the last LSTM unit, and the output result is the result corresponding to the word to be determined expressed in vector form, and the result of the vector form
- the probability of the word to be judged is a keyword is determined by the Sotfmax function.
- the probability that the word to be judged is a keyword is compared with a preset threshold to determine whether the word to be judged is a keyword.
- the words to be judged include “Ningbo”, “Specialty”, “Shanghai”, “Expo”, “Occupation”. "And "a place.” Separating the word vector of each word to be judged and the word vector of each preceding word above the word to be judged into the trained cyclic neural network model in the order in which they appear in the text to be processed, respectively, The word is the probability of a keyword of the text to be processed.
- the corresponding word vector can be input to the circulating neural network in the order of “Ningbo”, “Yes”, “Specialty”, “Shanghai” and “Expo” as shown in Fig. 6.
- the word vector of “Ningbo” is input to the first LSTM unit of the LSTM model
- the word vector of “Yes” is input to the second LSTM unit
- the word vector input of the word “Expo” is judged.
- each LSTM unit is affected by the output of the previous LSTM unit.
- the output of the LSTM model is the probability value corresponding to the output vector of the last LSTM unit through the Softmax function mapping, thereby obtaining the probability that each to-be judged word is a keyword of the text to be processed. Since the input word vector sequence itself includes the vector of each pre-word corresponding to the word to be judged and the word vector of the word to be judged, the above information is considered, and within the LSTM model, the history information can be better stored, and thus Further obtaining a more accurate probability that the word to be judged is a keyword of the text to be processed.
- FIGS. 3 and 4 are sequentially displayed as indicated by the arrows, these steps are not necessarily performed in the order indicated by the arrows. Except as explicitly stated herein, the execution of these steps is not strictly limited, and the steps may be performed in other orders. Moreover, at least some of the steps in FIGS. 3 and 4 may include a plurality of sub-steps or stages, which are not necessarily performed at the same time, but may be performed at different times, or The order of execution of the stages is also not necessarily sequential, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.
- a computer device is further provided, and the internal structure of the computer device is as shown in FIG. 2, the computer device is provided with a keyword extracting device, and the keyword extracting device includes each module, and each module can be It is implemented in whole or in part by software, hardware or a combination thereof.
- a keyword extracting apparatus includes: a sentence judgment module 710, a pre-word determination module 720, a word sequence determination module 730, a probability determination module 750, and a keyword determination module 760. .
- the to-be-determined word obtaining module 710 is configured to obtain each to-be-determined word of the text to be processed.
- the pre-word determination module 720 is configured to determine a pre-word corresponding to each of the to-be-determined words, where the pre-word is a word appearing in the above-mentioned word to be judged in the to-be-determined word.
- the word sequence determining module 730 is configured to determine the word sequence according to the order in which the preceding words corresponding to each of the to-be-determined words and the respective to-be-determined words appear in the to-be-processed text.
- the probability determining module 750 is configured to input the word sequences of the to-be-determined words into the trained cyclic neural network model respectively, and obtain the probability that each of the to-be-determined words is a keyword of the text to be processed.
- the keyword determining module 760 is configured to determine a keyword of the to-be-processed text according to a probability that each to-be-determined word is a keyword of the to-be-processed text and a preset threshold.
- the above keyword extracting device does not need to manually extract valid features according to the characteristics of the data, and inputs the word sequence into the trained cyclic neural network model to obtain the probability that the corresponding word to be judged is a keyword, and inputs to the trained
- the sequence of words in the cyclic neural network model is determined by the word to be judged and the preceding words of the word to be judged. Therefore, the above information can be fully considered, and the probability that the more-determined word is the keyword of the text to be processed is obtained. , thereby improving the accuracy of the extracted keywords.
- the to-be-acquired module 710 includes a word segmentation processing unit 711 and an identification determination unit 713.
- the word segmentation processing unit 711 is configured to perform word segmentation processing on the text to be processed, and obtain words in the text to be processed.
- the identification determining unit 713 is configured to identify a stop word in the to-be-processed text, and determine a word other than the stop word in the to-be-processed text as the to-be-determined word.
- the pre-word includes words appearing in the to-be-determined word in the text to be processed, except for the stop word.
- the word sequence determining module 730 includes: a word vector obtaining unit 731 and a word sequence determining unit 733;
- the word vector obtaining unit 731 is configured to acquire a word vector of the preceding text corresponding to each of the to-be-determined words and a word vector of each to-be-determined word;
- the word sequence determining unit 733 determines the word sequence of each word to be judged, and the word sequence is a sequence of word vectors.
- the method further includes:
- a model training module 740 configured to acquire a training sample, and train the cyclic neural network model to obtain a trained cyclic neural network model;
- the training sample includes an element pair, the element pair includes a training word corresponding to the training text, and the training The word is the probability of the keyword of the training text.
- the cyclic neural network model is an LSTM model.
- a computer apparatus comprising a memory and a processor, the memory storing computer readable instructions that, when executed by the processor, cause the processor to execute The following steps:
- the step of obtaining a to-be-determined word of the text to be processed includes:
- the determining the word sequence of each of the to-be-determined words according to the order in which the preceding words corresponding to each of the to-be-determined words and the respective to-be-determined words respectively appear in the to-be-processed text Steps including:
- the word vector of the preceding vocabulary corresponding to each of the to-be-determined words determines the word sequence of each of the to-be-determined words, and the word sequence is a sequence of word vectors.
- the method before inputting the word sequence of each of the to-be-determined words into the trained cyclic neural network model, the method further includes the steps of:
- the training sample includes an element pair, the element pair includes a training word corresponding to the training text, and the training word is the training The probability of the keyword of the text.
- the cyclic neural network model is an LSTM model.
- Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
- Volatile memory can include random access memory (RAM) or external cache memory.
- RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain.
- SRAM static RAM
- DRAM dynamic RAM
- SDRAM synchronous DRAM
- DDRSDRAM double data rate SDRAM
- ESDRAM enhanced SDRAM
- Synchlink DRAM SLDRAM
- Memory Bus Radbus
- RDRAM Direct RAM
- DRAM Direct Memory Bus Dynamic RAM
- RDRAM Memory Bus Dynamic RAM
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Mathematical Optimization (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Pure & Applied Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Medical Informatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Operations Research (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Character Discrimination (AREA)
Abstract
Description
Claims (15)
- 一种关键词提取方法,其特征在于,应用于用户终端或服务器,包括:获取待处理文本的各待判断词;确定各所述待判断词分别对应的前文词,所述前文词为所述待处理文本中出现在所述待判断词的上文中的词;根据各所述待判断词、各所述待判断词分别对应的前文词在所述待处理文本中出现的顺序,确定各所述待判断词的词序列;分别将各所述待判断词的所述词序列输入到经过训练的循环神经网络模型中,得到各所述待判断词是所述待处理文本的关键词的概率;根据各所述待判断词是所述待处理文本的关键词的概率及预设阈值,确定所述待处理文本的关键词。
- 根据权利要求1所述的关键词提取方法,其特征在于,所述获取待处理文本的待判断词的步骤,包括:对所述待处理文本进行分词处理,获得所述待处理文本中的词;识别所述待处理文本中的停用词,将所述待处理文本中除所述停用词之外的词确定为待判断词。
- 根据权利要求1所述的关键词提取方法,其特征在于,所述根据各所述待判断词、各所述待判断词分别对应的前文词在所述待处理文本中出现的顺序,确定各所述待判断词的词序列的步骤,包括:获取各所述待判断词分别对应的前文词的词向量和各所述待判断词的词向量;根据各所述待判断词分别对应的前文词及各所述待判断词在所述待处理文本中出现的顺序,采用各所述待判断词分别对应的前文词的词向量和所述各待判断词的词向量确定各所述待判断词的词序列,所述词序列为词向量序列。
- 根据权利要求1所述的关键词提取方法,其特征在于,在分别将各所述待判断词的所述词序列输入到经过训练的循环神经网络模型中之前,还包 括步骤:获取训练样本,对循环神经网络模型进行训练获得所述经过训练的循环神经网络模型;所述训练样本包括元素对,所述元素对包括训练文本对应的训练词及所述训练词为所述训练文本的关键词的概率。
- 根据权利要求1所述的关键词提取方法,其特征在于,所述循环神经网络模型为LSTM模型。
- 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如下步骤:获取待处理文本的各待判断词;确定各所述待判断词分别对应的前文词,所述前文词为所述待处理文本中出现在所述待判断词的上文中的词;根据各所述待判断词、各所述待判断词分别对应的前文词在所述待处理文本中出现的顺序,确定各所述待判断词的词序列;分别将各所述待判断词的所述词序列输入到经过训练的循环神经网络模型中,得到各所述待判断词是所述待处理文本的关键词的概率;根据各所述待判断词是所述待处理文本的关键词的概率及预设阈值,确定所述待处理文本的关键词。
- 根据权利要求6所述的计算机设备,其特征在于,所述获取待处理文本的待判断词的步骤,包括:对所述待处理文本进行分词处理,获得所述待处理文本中的词;识别所述待处理文本中的停用词,将所述待处理文本中除所述停用词之外的词确定为待判断词。
- 根据权利要求6所述的计算机设备,其特征在于,所述根据各所述待判断词、各所述待判断词分别对应的前文词在所述待处理文本中出现的顺序,确定各所述待判断词的词序列的步骤,包括:获取各所述待判断词分别对应的前文词的词向量和各所述待判断词的词 向量;根据各所述待判断词分别对应的前文词及各所述待判断词在所述待处理文本中出现的顺序,采用各所述待判断词分别对应的前文词的词向量和所述各待判断词的词向量确定各所述待判断词的词序列,所述词序列为词向量序列。
- 根据权利要求6所述的计算机设备,其特征在于,在分别将各所述待判断词的所述词序列输入到经过训练的循环神经网络模型中之前,还包括步骤:获取训练样本,对循环神经网络模型进行训练获得所述经过训练的循环神经网络模型;所述训练样本包括元素对,所述元素对包括训练文本对应的训练词及所述训练词为所述训练文本的关键词的概率。
- 根据权利要求6所述的计算机设备,其特征在于,所述循环神经网络模型为LSTM模型。
- 一个或多个存储有计算机可读指令的非易失性存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:获取待处理文本的各待判断词;确定各所述待判断词分别对应的前文词,所述前文词为所述待处理文本中出现在所述待判断词的上文中的词;根据各所述待判断词、各所述待判断词分别对应的前文词在所述待处理文本中出现的顺序,确定各所述待判断词的词序列;分别将各所述待判断词的所述词序列输入到经过训练的循环神经网络模型中,得到各所述待判断词是所述待处理文本的关键词的概率;根据各所述待判断词是所述待处理文本的关键词的概率及预设阈值,确定所述待处理文本的关键词。
- 根据权利要求11所述的存储介质,其特征在于,所述获取待处理文本的待判断词的步骤,包括:对所述待处理文本进行分词处理,获得所述待处理文本中的词;识别所述待处理文本中的停用词,将所述待处理文本中除所述停用词之外的词确定为待判断词。
- 根据权利要求11所述的存储介质,其特征在于,所述根据各所述待判断词、各所述待判断词分别对应的前文词在所述待处理文本中出现的顺序,确定各所述待判断词的词序列的步骤,包括:获取各所述待判断词分别对应的前文词的词向量和各所述待判断词的词向量;根据各所述待判断词分别对应的前文词及各所述待判断词在所述待处理文本中出现的顺序,采用各所述待判断词分别对应的前文词的词向量和所述各待判断词的词向量确定各所述待判断词的词序列,所述词序列为词向量序列。
- 根据权利要求11所述的存储介质,其特征在于,在分别将各所述待判断词的所述词序列输入到经过训练的循环神经网络模型中之前,还包括步骤:获取训练样本,对循环神经网络模型进行训练获得所述经过训练的循环神经网络模型;所述训练样本包括元素对,所述元素对包括训练文本对应的训练词及所述训练词为所述训练文本的关键词的概率。
- 根据权利要求11所述的存储介质,其特征在于,所述循环神经网络模型为LSTM模型。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2019521096A JP6956177B2 (ja) | 2017-02-23 | 2018-02-08 | キーワード抽出方法、コンピュータ装置及び記憶媒体 |
EP18758452.9A EP3518122A4 (en) | 2017-02-23 | 2018-02-08 | METHOD OF EXTRACTING KEYWORDS, COMPUTER DEVICE AND INFORMATION CARRIER |
KR1020197017920A KR102304673B1 (ko) | 2017-02-23 | 2018-02-08 | 키워드 추출 방법, 컴퓨터 장치, 및 저장 매체 |
US16/363,646 US10963637B2 (en) | 2017-02-23 | 2019-03-25 | Keyword extraction method, computer equipment and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710101013.1 | 2017-02-23 | ||
CN201710101013.1A CN108304365A (zh) | 2017-02-23 | 2017-02-23 | 关键词提取方法及装置 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/363,646 Continuation US10963637B2 (en) | 2017-02-23 | 2019-03-25 | Keyword extraction method, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018153265A1 true WO2018153265A1 (zh) | 2018-08-30 |
Family
ID=62872540
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/075711 WO2018153265A1 (zh) | 2017-02-23 | 2018-02-08 | 关键词提取方法、计算机设备和存储介质 |
Country Status (6)
Country | Link |
---|---|
US (1) | US10963637B2 (zh) |
EP (1) | EP3518122A4 (zh) |
JP (1) | JP6956177B2 (zh) |
KR (1) | KR102304673B1 (zh) |
CN (1) | CN108304365A (zh) |
WO (1) | WO2018153265A1 (zh) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109615241A (zh) * | 2018-12-13 | 2019-04-12 | 大连海事大学 | 一种基于卷积和循环神经网络的软件Bug分派方法 |
CN109635288A (zh) * | 2018-11-29 | 2019-04-16 | 东莞理工学院 | 一种基于深度神经网络的简历抽取方法 |
CN109740152A (zh) * | 2018-12-25 | 2019-05-10 | 腾讯科技(深圳)有限公司 | 文本类目的确定方法、装置、存储介质和计算机设备 |
CN109902273A (zh) * | 2019-01-30 | 2019-06-18 | 平安科技(深圳)有限公司 | 关键词生成模型的建模方法和装置 |
CN110110330A (zh) * | 2019-04-30 | 2019-08-09 | 腾讯科技(深圳)有限公司 | 基于文本的关键词提取方法和计算机设备 |
CN111460096A (zh) * | 2020-03-26 | 2020-07-28 | 北京金山安全软件有限公司 | 一种碎片文本的处理方法、装置及电子设备 |
CN111709230A (zh) * | 2020-04-30 | 2020-09-25 | 昆明理工大学 | 基于词性软模板注意力机制的短文本自动摘要方法 |
CN111859940A (zh) * | 2019-04-23 | 2020-10-30 | 北京嘀嘀无限科技发展有限公司 | 一种关键词提取方法、装置、电子设备及存储介质 |
CN112015884A (zh) * | 2020-08-28 | 2020-12-01 | 欧冶云商股份有限公司 | 一种用户走访数据关键词提取方法、装置及存储介质 |
CN113076756A (zh) * | 2020-01-06 | 2021-07-06 | 北京沃东天骏信息技术有限公司 | 一种文本生成方法和装置 |
Families Citing this family (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220215292A1 (en) * | 2018-08-01 | 2022-07-07 | Intuit Inc. | Method to identify incorrect account numbers |
CN111027313A (zh) * | 2018-10-08 | 2020-04-17 | 中国科学院沈阳计算技术研究所有限公司 | 基于注意力机制的BiGRU判决结果倾向性分析方法 |
US11537664B2 (en) | 2019-05-23 | 2022-12-27 | Google Llc | Learning to select vocabularies for categorical features |
US11316810B2 (en) * | 2019-06-07 | 2022-04-26 | International Business Machines Corporation | Messaging system for automatically generating semantic contextual messages |
CN110598095B (zh) * | 2019-08-27 | 2024-02-13 | 深圳市雅阅科技有限公司 | 一种识别包含指定信息文章的方法、装置及存储介质 |
CN111144127B (zh) * | 2019-12-25 | 2023-07-25 | 科大讯飞股份有限公司 | 文本语义识别方法及其模型的获取方法及相关装置 |
CN111738791B (zh) * | 2020-01-20 | 2024-05-24 | 北京沃东天骏信息技术有限公司 | 一种文本处理方法、装置、设备和存储介质 |
CN113221553A (zh) * | 2020-01-21 | 2021-08-06 | 腾讯科技(深圳)有限公司 | 一种文本处理方法、装置、设备以及可读存储介质 |
KR102216066B1 (ko) * | 2020-05-04 | 2021-02-18 | 호서대학교 산학협력단 | 문장형 쿼리에 대해 검색결과를 제공하는 방법 |
KR102418260B1 (ko) * | 2020-05-27 | 2022-07-06 | 삼성생명보험주식회사 | 고객 상담 기록 분석 방법 |
CN111737996B (zh) * | 2020-05-29 | 2024-03-26 | 北京百度网讯科技有限公司 | 基于语言模型获取词向量的方法、装置、设备及存储介质 |
CN111831821B (zh) * | 2020-06-03 | 2024-01-09 | 北京百度网讯科技有限公司 | 文本分类模型的训练样本生成方法、装置和电子设备 |
CN111967268B (zh) * | 2020-06-30 | 2024-03-19 | 北京百度网讯科技有限公司 | 文本中的事件抽取方法、装置、电子设备和存储介质 |
CN112131877B (zh) * | 2020-09-21 | 2024-04-09 | 民生科技有限责任公司 | 一种海量数据下的实时中文文本分词方法 |
CN112052375B (zh) * | 2020-09-30 | 2024-06-11 | 北京百度网讯科技有限公司 | 舆情获取和词粘度模型训练方法及设备、服务器和介质 |
CN112884440B (zh) * | 2021-03-02 | 2024-05-24 | 岭东核电有限公司 | 核电试验中的试验工序执行方法、装置和计算机设备 |
KR102620697B1 (ko) * | 2021-07-12 | 2024-01-02 | 주식회사 카카오뱅크 | 딥러닝 기반의 자연어 처리를 통한 메시지 내 이체 정보 판단 방법 및 장치 |
CN113761161A (zh) * | 2021-08-10 | 2021-12-07 | 紫金诚征信有限公司 | 文本关键词提取方法、装置、计算机设备和存储介质 |
CN113808758B (zh) * | 2021-08-31 | 2024-06-07 | 联仁健康医疗大数据科技股份有限公司 | 一种检验数据标准化的方法、装置、电子设备和存储介质 |
CN113609843B (zh) * | 2021-10-12 | 2022-02-01 | 京华信息科技股份有限公司 | 一种基于梯度提升决策树的句词概率计算方法及系统 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101122909A (zh) * | 2006-08-10 | 2008-02-13 | 株式会社日立制作所 | 文本信息检索装置以及文本信息检索方法 |
CN101944099A (zh) * | 2010-06-24 | 2011-01-12 | 西北工业大学 | 一种使用本体进行文本文档自动分类的方法 |
US20110313865A1 (en) * | 2010-06-17 | 2011-12-22 | Microsoft Corporation | Ad copy quality detection and scoring |
CN105139237A (zh) * | 2015-09-25 | 2015-12-09 | 百度在线网络技术(北京)有限公司 | 信息推送的方法和装置 |
Family Cites Families (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4360122B2 (ja) * | 2003-05-26 | 2009-11-11 | 富士ゼロックス株式会社 | キーワード抽出装置 |
US7493322B2 (en) * | 2003-10-15 | 2009-02-17 | Xerox Corporation | System and method for computing a measure of similarity between documents |
US7519588B2 (en) * | 2005-06-20 | 2009-04-14 | Efficient Frontier | Keyword characterization and application |
US8346534B2 (en) * | 2008-11-06 | 2013-01-01 | University of North Texas System | Method, system and apparatus for automatic keyword extraction |
US9715660B2 (en) * | 2013-11-04 | 2017-07-25 | Google Inc. | Transfer learning for deep neural network based hotword detection |
JP6230190B2 (ja) * | 2014-01-09 | 2017-11-15 | 日本放送協会 | 重要語抽出装置、及びプログラム |
KR102305584B1 (ko) * | 2015-01-19 | 2021-09-27 | 삼성전자주식회사 | 언어 모델 학습 방법 및 장치, 언어 인식 방법 및 장치 |
KR101656741B1 (ko) * | 2015-04-23 | 2016-09-12 | 고려대학교 산학협력단 | 프레임 기반 의견스팸 판단장치, 프레임 기반 의견스팸 판단방법, 프레임 기반으로 의견스팸을 판단하기 위한 컴퓨터 프로그램 및 컴퓨터 판독가능 기록매체 |
US9916376B2 (en) * | 2015-08-11 | 2018-03-13 | Fujitsu Limited | Digital document keyword generation |
CN105260359B (zh) * | 2015-10-16 | 2018-10-02 | 晶赞广告(上海)有限公司 | 语义关键词提取方法及装置 |
CN105955952A (zh) * | 2016-05-03 | 2016-09-21 | 成都数联铭品科技有限公司 | 一种基于双向递归神经网络的信息提取方法 |
CN106095749A (zh) * | 2016-06-03 | 2016-11-09 | 杭州量知数据科技有限公司 | 一种基于深度学习的文本关键词提取方法 |
US10056083B2 (en) * | 2016-10-18 | 2018-08-21 | Yen4Ken, Inc. | Method and system for processing multimedia content to dynamically generate text transcript |
US10255269B2 (en) * | 2016-12-30 | 2019-04-09 | Microsoft Technology Licensing, Llc | Graph long short term memory for syntactic relationship discovery |
CN111078838B (zh) * | 2019-12-13 | 2023-08-18 | 北京小米智能科技有限公司 | 关键词提取方法、关键词提取装置及电子设备 |
-
2017
- 2017-02-23 CN CN201710101013.1A patent/CN108304365A/zh active Pending
-
2018
- 2018-02-08 JP JP2019521096A patent/JP6956177B2/ja active Active
- 2018-02-08 KR KR1020197017920A patent/KR102304673B1/ko active IP Right Grant
- 2018-02-08 WO PCT/CN2018/075711 patent/WO2018153265A1/zh unknown
- 2018-02-08 EP EP18758452.9A patent/EP3518122A4/en not_active Ceased
-
2019
- 2019-03-25 US US16/363,646 patent/US10963637B2/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101122909A (zh) * | 2006-08-10 | 2008-02-13 | 株式会社日立制作所 | 文本信息检索装置以及文本信息检索方法 |
US20110313865A1 (en) * | 2010-06-17 | 2011-12-22 | Microsoft Corporation | Ad copy quality detection and scoring |
CN101944099A (zh) * | 2010-06-24 | 2011-01-12 | 西北工业大学 | 一种使用本体进行文本文档自动分类的方法 |
CN105139237A (zh) * | 2015-09-25 | 2015-12-09 | 百度在线网络技术(北京)有限公司 | 信息推送的方法和装置 |
Non-Patent Citations (1)
Title |
---|
See also references of EP3518122A4 * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109635288B (zh) * | 2018-11-29 | 2023-05-23 | 东莞理工学院 | 一种基于深度神经网络的简历抽取方法 |
CN109635288A (zh) * | 2018-11-29 | 2019-04-16 | 东莞理工学院 | 一种基于深度神经网络的简历抽取方法 |
CN109615241A (zh) * | 2018-12-13 | 2019-04-12 | 大连海事大学 | 一种基于卷积和循环神经网络的软件Bug分派方法 |
CN109740152A (zh) * | 2018-12-25 | 2019-05-10 | 腾讯科技(深圳)有限公司 | 文本类目的确定方法、装置、存储介质和计算机设备 |
CN109902273A (zh) * | 2019-01-30 | 2019-06-18 | 平安科技(深圳)有限公司 | 关键词生成模型的建模方法和装置 |
CN109902273B (zh) * | 2019-01-30 | 2024-05-07 | 平安科技(深圳)有限公司 | 关键词生成模型的建模方法和装置 |
CN111859940B (zh) * | 2019-04-23 | 2024-05-14 | 北京嘀嘀无限科技发展有限公司 | 一种关键词提取方法、装置、电子设备及存储介质 |
CN111859940A (zh) * | 2019-04-23 | 2020-10-30 | 北京嘀嘀无限科技发展有限公司 | 一种关键词提取方法、装置、电子设备及存储介质 |
CN110110330B (zh) * | 2019-04-30 | 2023-08-11 | 腾讯科技(深圳)有限公司 | 基于文本的关键词提取方法和计算机设备 |
CN110110330A (zh) * | 2019-04-30 | 2019-08-09 | 腾讯科技(深圳)有限公司 | 基于文本的关键词提取方法和计算机设备 |
CN113076756A (zh) * | 2020-01-06 | 2021-07-06 | 北京沃东天骏信息技术有限公司 | 一种文本生成方法和装置 |
CN111460096B (zh) * | 2020-03-26 | 2023-12-22 | 北京金山安全软件有限公司 | 一种碎片文本的处理方法、装置及电子设备 |
CN111460096A (zh) * | 2020-03-26 | 2020-07-28 | 北京金山安全软件有限公司 | 一种碎片文本的处理方法、装置及电子设备 |
CN111709230B (zh) * | 2020-04-30 | 2023-04-07 | 昆明理工大学 | 基于词性软模板注意力机制的短文本自动摘要方法 |
CN111709230A (zh) * | 2020-04-30 | 2020-09-25 | 昆明理工大学 | 基于词性软模板注意力机制的短文本自动摘要方法 |
CN112015884A (zh) * | 2020-08-28 | 2020-12-01 | 欧冶云商股份有限公司 | 一种用户走访数据关键词提取方法、装置及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN108304365A (zh) | 2018-07-20 |
KR20190085098A (ko) | 2019-07-17 |
US20190220514A1 (en) | 2019-07-18 |
JP6956177B2 (ja) | 2021-11-02 |
EP3518122A1 (en) | 2019-07-31 |
EP3518122A4 (en) | 2019-11-20 |
US10963637B2 (en) | 2021-03-30 |
JP2019531562A (ja) | 2019-10-31 |
KR102304673B1 (ko) | 2021-09-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2018153265A1 (zh) | 关键词提取方法、计算机设备和存储介质 | |
AU2018214675B2 (en) | Systems and methods for automatic semantic token tagging | |
CN109783655B (zh) | 一种跨模态检索方法、装置、计算机设备和存储介质 | |
US11941366B2 (en) | Context-based multi-turn dialogue method and storage medium | |
CN108694225B (zh) | 一种图像搜索方法、特征向量的生成方法、装置及电子设备 | |
WO2021068321A1 (zh) | 基于人机交互的信息推送方法、装置和计算机设备 | |
WO2020177230A1 (zh) | 基于机器学习的医疗数据分类方法、装置、计算机设备及存储介质 | |
WO2021042503A1 (zh) | 信息分类抽取方法、装置、计算机设备和存储介质 | |
WO2019136993A1 (zh) | 文本相似度计算方法、装置、计算机设备和存储介质 | |
CN111709243B (zh) | 一种基于深度学习的知识抽取方法与装置 | |
US11657802B2 (en) | Utilizing a dynamic memory network for state tracking | |
CN112015900B (zh) | 医学属性知识图谱构建方法、装置、设备及介质 | |
KR102194200B1 (ko) | 인공신경망 모델을 이용한 뉴스 기사 분석에 의한 주가지수 예측 방법 및 장치 | |
WO2020114100A1 (zh) | 一种信息处理方法、装置和计算机存储介质 | |
CN112766319B (zh) | 对话意图识别模型训练方法、装置、计算机设备及介质 | |
CN111191032B (zh) | 语料扩充方法、装置、计算机设备和存储介质 | |
WO2022116436A1 (zh) | 长短句文本语义匹配方法、装置、计算机设备及存储介质 | |
CN111191002A (zh) | 一种基于分层嵌入的神经代码搜索方法及装置 | |
WO2022134805A1 (zh) | 文档分类预测方法、装置、计算机设备及存储介质 | |
CN112380837B (zh) | 基于翻译模型的相似句子匹配方法、装置、设备及介质 | |
CN112580329B (zh) | 文本噪声数据识别方法、装置、计算机设备和存储介质 | |
CN111680132B (zh) | 一种用于互联网文本信息的噪声过滤和自动分类方法 | |
WO2020132933A1 (zh) | 短文本过滤方法、装置、介质及计算机设备 | |
CN112307048A (zh) | 语义匹配模型训练方法、匹配方法、装置、设备及存储介质 | |
CN115374786A (zh) | 实体和关系联合抽取方法及装置、存储介质和终端 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18758452 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2019521096 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 2018758452 Country of ref document: EP Effective date: 20190425 |
|
ENP | Entry into the national phase |
Ref document number: 20197017920 Country of ref document: KR Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |