WO2020258662A1 - Keyword determination method and apparatus, electronic device, and storage medium - Google Patents

Keyword determination method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2020258662A1
WO2020258662A1 PCT/CN2019/118013 CN2019118013W WO2020258662A1 WO 2020258662 A1 WO2020258662 A1 WO 2020258662A1 CN 2019118013 W CN2019118013 W CN 2019118013W WO 2020258662 A1 WO2020258662 A1 WO 2020258662A1
Authority
WO
WIPO (PCT)
Prior art keywords
words
keyword
sub
event text
topic
Prior art date
Application number
PCT/CN2019/118013
Other languages
French (fr)
Chinese (zh)
Inventor
郑子欧
汪伟
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020258662A1 publication Critical patent/WO2020258662A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Definitions

  • This application relates to the field of data analysis technology, and in particular to a method, device, electronic device, and storage medium for determining keywords.
  • the key words of the event text are mainly determined manually, that is, the key words of the event are manually determined first, and then the event extraction is performed. Not only is the manual workload large, but also there are fewer fine-grained tags for the event. It is to put an event label on the whole text, and then use the label as the key word of the text.
  • the above method has certain drawbacks. First of all, because most of them need to be classified and extracted by manual differentiation, the accuracy rate is not guaranteed, and the efficiency is low; second, because there is no quantitative analysis, the obtained keywords cannot be fully and in-depth described in the original text. The name is not true; in the end, the traditional method only extracts the text keywords, but does not expand the trigger keywords, which makes it impossible to dig deeper into the text.
  • a keyword determination method includes: when a keyword extraction instruction is received, obtaining event text; introducing background words using an ET-TAG model to extract subtopics of the event text; combining the subtopics to obtain Target sub-topic; perform feature learning on the target sub-topic to obtain the first keyword of the target sub-topic; retrieve the related words of the event text from the configuration database by calculating the degree of relevance; Determine the second keyword of the event text in the related words; merge the first keyword and the second keyword to obtain the target keyword of the event text.
  • An electronic device comprising: a memory that stores at least one instruction; and a processor that executes the instructions stored in the memory to implement the keyword determination method.
  • a computer-readable storage medium stores at least one instruction, and the at least one instruction is executed by a processor in an electronic device to implement the keyword determination method.
  • this application can obtain the event text when the keyword extraction instruction is received, and use the ET-TAG model to introduce background words, extract the subtopics of the event text, and make the keyword extraction more accurate.
  • the sub-topics are further merged to obtain the target sub-topic, and feature learning is performed on the target sub-topic to obtain the first keyword of the target sub-topic, and then the event is retrieved from the configuration database through relevance calculation
  • the related words of the text based on the contribution value, determine the second keyword of the event text from the related words, and then introduce external data as extended data, merge the first keyword and the second keyword to obtain the
  • the target keyword of the event text is automatically determined, and the data in other databases is introduced to make the keyword determination more accurate.
  • Fig. 1 is a flowchart of a preferred embodiment of a method for determining keywords of the present application.
  • Fig. 2 is a functional module diagram of a preferred embodiment of the keyword determining device of the present application.
  • Fig. 3 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the keyword determining method according to the present application.
  • FIG. 1 it is a flowchart of a preferred embodiment of the method for determining keywords of the present application. According to different needs, the order of the steps in the flowchart can be changed, and some steps can be omitted.
  • the keyword determination method is applied to one or more electronic devices.
  • the electronic device is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions. Its hardware includes but not Limited to microprocessors, application specific integrated circuits (ASICs), programmable gate arrays (Field-Programmable Gate Arrays, FPGAs), digital processors (Digital Signal Processors, DSPs), embedded devices, etc.
  • ASICs application specific integrated circuits
  • FPGAs Field-Programmable Gate Arrays
  • DSPs Digital Signal Processors, DSPs
  • embedded devices etc.
  • the electronic device can be any electronic product that can interact with a user with a human machine, such as a personal computer, a tablet computer, a smart phone, a personal digital assistant (PDA), a game console, an interactive network television ( Internet Protocol Television, IPTV), smart wearable devices, etc.
  • a human machine such as a personal computer, a tablet computer, a smart phone, a personal digital assistant (PDA), a game console, an interactive network television ( Internet Protocol Television, IPTV), smart wearable devices, etc.
  • PDA personal digital assistant
  • IPTV Internet Protocol Television
  • smart wearable devices etc.
  • the electronic device may also include a network device and/or user equipment.
  • the network device includes, but is not limited to, a single network server, a server group composed of multiple network servers, or a cloud composed of a large number of hosts or network servers based on Cloud Computing.
  • the network where the electronic device is located includes but is not limited to the Internet, wide area network, metropolitan area network, local area network, virtual private network (Virtual Private Network, VPN), etc.
  • the keyword extraction instruction may be triggered by the user, which is not limited in the present application.
  • the event text refers to articles such as news reports related to public opinion events.
  • the electronic device adopts the ET-TAG model to introduce background words, and extracting subtopics of the event text includes:
  • the electronic device Based on the ET-TAG model, the electronic device adopts the PLSA-BLM (probabilitistic Latent Semantic Analysis) algorithm to introduce background words and delete the background words from the event text to obtain updated text. Topics are extracted from the update text as subtopics of the event text.
  • PLSA-BLM probabilitistic Latent Semantic Analysis
  • the electronic device first loads the event text from which keywords need to be extracted, and uses the ET-TAG model to discover the characteristics of the event subtopics from the perspective of word distribution.
  • the idea of supervised is adopted to introduce the conceptual system of specific categories of public opinion events from the external knowledge base, and treat them as the tags of the internal subtopics of the event to improve the intelligibility of the tags.
  • the PLSA-BLM algorithm in the ET-TAG model models the hidden topics in the text.
  • the electronic device introduces a background theme X on the basis of the original themes A, B, C..., and after introducing the background theme X, the high-frequency words, The stop words etc. are summarized into the background topic X, and this part of words will have no influence on topic extraction.
  • the model can extract the subtopics of the event text according to the distinctive topic features.
  • the electronic device will use the obtained difference in word frequency distribution between different sub-topics to measure the difference between the sub-topics, and merge the sub-topics.
  • the electronic device merging the sub-topics to obtain the target sub-topic includes:
  • the electronic device uses KL divergence (Kullback-Leibler Divergence, KL) algorithm to calculate the divergence between the sub-topics.
  • KL divergence Kullback-Leibler Divergence, KL
  • the electronic device uses KL divergence (Kullback-Leibler Divergence, KL) algorithm to calculate the divergence between the sub-topics.
  • KL divergence Kullback-Leibler Divergence, KL
  • KL KL divergence
  • the electronic device uses the KL divergence algorithm to calculate the divergence between the sub-topics on the obtained sub-topics.
  • the KL divergence algorithm is also called relative entropy, which can measure the difference between two distributions in the same event space.
  • the physical meaning of KL divergence is that in the same event space, if the probability P(x) is coded with Q(x), how many bits increase in the average code length of each symbol, using D(P
  • Q) ⁇ P(x)log(P(x)/Q(x)), and then calculate the distance of word frequency distribution under different subtopics.
  • the electronic device merges the two sub-topics, and determines the merged sub-topic as the target Subtopic.
  • the electronic device can remove redundancy and improve the accuracy of sub-topic division, so that the obtained target sub-topic is more accurate.
  • S13 Perform feature learning on the target subtopic to obtain the first keyword of the target subtopic.
  • the electronic device performs feature learning on the target subtopic, and obtaining the first keyword of the target subtopic includes:
  • the electronic device extracts the label words in the target sub-topic, and uses the Lasso classification model to perform linear regression analysis on the label words to calculate the score of the label words, and the electronic device obtains the score in the label words Words greater than or equal to the preset value are used as the first keyword.
  • the Lasso classification model based on the self-attention algorithm is a model for linear regression feature reduction and selection.
  • feature learning can be performed on the label words in the target subtopic.
  • the electronic device uses the Lasso classification model to calculate the score of the tag word.
  • the score of each candidate label word is equal to the weight score ⁇ i given by the Lasso classification model.
  • the greater the value of ⁇ i the stronger the correlation between the label term and the sub-topic. Therefore, the electronic device uses the label term with the largest ⁇ i term as the keyword of the sub-topic, that is, as the first keyword .
  • the electronic device extracting the tag words in the target subtopic includes:
  • the electronic device obtains all the words in the target sub-topic, and uses the PLSA-BLM algorithm to calculate the word frequency of all the words, and the electronic device sorts all the words according to the word frequency of all the words, And the word in the preset position among all the words is selected as the label word.
  • the preset position can be customized to obtain the label words meeting the quantity requirement.
  • the words under each subtopic are arranged in descending order of word frequency.
  • the configuration database may be any form of database, and all text reports, commentary articles, etc. related to events (current events, news, etc.) are stored in the configuration database.
  • the electronic device calls the knowledge base related to the event text in the search engine.
  • the electronic device retrieves the associated words of the event text from the configuration database by calculating the degree of association, including:
  • the electronic device calculates the degree of relevance between all words in the configuration database and the event text, acquires words with the degree of relevance greater than or equal to the degree of relevance to the configuration as words related to the event text, and compares Words related to the event text are preprocessed by word segmentation to obtain the related words.
  • the configuration association degree can be customized, which is not limited in this application.
  • the electronic device sorts and preprocesses the knowledge base data according to internal concepts to calculate the probability that each word in the knowledge base is related to the first keyword.
  • S15 Determine a second keyword of the event text from the related words based on the contribution value.
  • the electronic device determining the second keyword of the event text from the related words based on the contribution value includes:
  • the electronic device calculates the contribution value of the related word to the first keyword, and obtains the related word with the highest contribution value as the second keyword.
  • the word with the largest contribution value to the first keyword is used as the second keyword, thereby realizing the expansion of the keyword.
  • the electronic device aggregates the expansion trigger vocabulary of the sub-keyword to obtain the expansion trigger word of the sub-topic. Finally, after summarizing, the expanded trigger words of the entire text are obtained, and the expanded trigger words of the event text are obtained after summarizing the expanded trigger vocabulary of sub-keywords.
  • this application can obtain the event text when the keyword extraction instruction is received, and use the ET-TAG model to introduce background words, extract the subtopics of the event text, and make the keyword extraction more accurate.
  • the sub-topics are further merged to obtain the target sub-topic, and feature learning is performed on the target sub-topic to obtain the first keyword of the target sub-topic, and then the event is retrieved from the configuration database through relevance calculation
  • the related words of the text based on the contribution value, determine the second keyword of the event text from the related words, and then introduce external data as extended data, merge the first keyword and the second keyword to obtain the
  • the target keyword of the event text is automatically determined, and the data in other databases is introduced to make the keyword determination more accurate.
  • the keyword determination device 11 includes an acquisition unit 110, an extraction unit 111, a merging unit 112, a learning unit 113, a retrieval unit 114, and a determination unit 115.
  • the module/unit referred to in this application refers to a series of computer program segments that can be executed by the processor 13 and can complete fixed functions, and are stored in the memory 12. In this embodiment, the functions of each module/unit will be described in detail in subsequent embodiments.
  • the obtaining unit 110 obtains the event text.
  • the keyword extraction instruction may be triggered by the user, which is not limited in the present application.
  • the event text refers to articles such as news reports related to public opinion events.
  • the extraction unit 111 adopts the ET-TAG model to introduce background words, and extracts subtopics of the event text.
  • the extraction unit 111 adopts an ET-TAG model to introduce background words, and extracting subtopics of the event text includes:
  • the extraction unit 111 adopts the PLSA-BLM (probabilitistic Latent Semantic Analysis) algorithm based on the ET-TAG model to introduce background words, and delete the background words from the event text to obtain updated text, the extraction unit 111 extracts topics from the updated text as subtopics of the event text.
  • PLSA-BLM probabilitistic Latent Semantic Analysis
  • the extraction unit 111 first loads the event text for which keywords need to be extracted, and uses the ET-TAG model to discover the characteristics of the event subtopic from the perspective of word distribution.
  • the idea of supervised is adopted to introduce the conceptual system of specific categories of public opinion events from the external knowledge base, and treat them as the tags of the internal subtopics of the event to improve the intelligibility of the tags.
  • the PLSA-BLM algorithm in the ET-TAG model models the hidden topics in the text.
  • the extraction unit 111 introduces a background theme X on the basis of the original themes A, B, C..., and after introducing the background theme X, removes the high-frequency words that appear in the original theme a lot , Stop words, etc. are summarized into the background topic X, and these words will have no effect on topic extraction.
  • the model can extract the subtopics of the event text according to the distinctive topic features.
  • the merging unit 112 merges the sub-topics to obtain the target sub-topic.
  • the merging unit 112 will use the obtained difference in word frequency distribution between the different sub-topics to measure the difference between the sub-topics, and merge the sub-topics.
  • the merging unit 112 merging the sub-topics to obtain the target sub-topic includes:
  • the merging unit 112 uses KL divergence (Kullback-Leibler Divergence, KL) algorithm to calculate the divergence between the sub-topics. When the divergence between two sub-topics in the sub-topic is less than a configured threshold, the merge The unit 112 merges the two sub-topics, and determines the merged sub-topic as the target sub-topic.
  • KL divergence Kullback-Leibler Divergence
  • the merging unit 112 uses the KL divergence algorithm to calculate the divergence between the sub-topics on the obtained subtopics.
  • the KL divergence algorithm is also called relative entropy, which can measure the difference between two distributions in the same event space. difference.
  • the physical meaning of KL divergence is that in the same event space, if the probability P(x) is coded with Q(x), how many bits increase in the average code length of each symbol, using D(P
  • Q) ⁇ P(x)log(P(x)/Q(x)), and then calculate the distance of word frequency distribution under different subtopics.
  • the merging unit 112 merges the two sub-topics, and determines the merged sub-topic as the Target subtopic.
  • the learning unit 113 performs feature learning on the target sub-topic to obtain the first keyword of the target sub-topic.
  • the learning unit 113 performs feature learning on the target subtopic, and obtaining the first keyword of the target subtopic includes:
  • the learning unit 113 extracts the label words in the target subtopic, and uses the Lasso classification model to perform linear regression analysis on the label words to calculate the score of the label words, and the learning unit 113 obtains the label words A word with a middle score greater than or equal to a preset value is used as the first keyword.
  • the Lasso classification model based on the self-attention algorithm is a model for linear regression feature reduction and selection.
  • feature learning can be performed on the label words in the target subtopic.
  • the learning unit 113 uses the Lasso classification model to calculate the score of the label word.
  • the score of each candidate label word is equal to the weight score ⁇ i given by the Lasso classification model.
  • the greater the value of ⁇ i the stronger the correlation between the label term and the sub-topic. Therefore, the learning unit 113 uses the label term with the largest ⁇ i term as the keyword of the sub-topic, which is the first key. word.
  • the extraction of label words in the target subtopic by the learning unit 113 includes:
  • the learning unit 113 obtains all the words in the target sub-topic, and uses the PLSA-BLM algorithm to calculate the word frequency of all the words.
  • the learning unit 113 performs a calculation on all the words according to the word frequencies of all the words. Sort, and filter out the words in the preset position among all the words as the label words.
  • the preset position can be customized to obtain the label words meeting the quantity requirement.
  • the words under each sub-topic are arranged in descending order of word frequency.
  • the higher the word frequency the more representative the word is in the current sub-topic.
  • the retrieval unit 114 retrieves the related words of the event text from the configuration database by calculating the degree of relevance.
  • the configuration database may be any form of database, and all text reports, commentary articles, etc. related to events (current events, news, etc.) are stored in the configuration database.
  • the retrieval unit 114 calls the knowledge base related to the event text in the search engine.
  • the retrieval unit 114 retrieves the related words of the event text from the configuration database by calculating the degree of relevance, including:
  • the retrieval unit 114 calculates the degree of relevance between all words in the configuration database and the event text, acquires words with the degree of relevance greater than or equal to the configuration relevance degree as words related to the event text, and compares The words related to the event text are preprocessed by word segmentation to obtain the related words.
  • the configuration association degree can be customized, which is not limited in this application.
  • the retrieval unit 114 sorts the knowledge base data according to internal concepts and preprocesses word segmentation to calculate the probability that each word in the knowledge base is related to the first keyword.
  • the determining unit 115 determines the second keyword of the event text from the related words based on the contribution value.
  • the determining unit 115 determining the second keyword of the event text from the related words based on the contribution value includes:
  • the determining unit 115 calculates the contribution value of the related word to the first keyword, and obtains the related word with the highest contribution value as the second keyword.
  • the word with the largest contribution to the first keyword is used as the second keyword, thereby realizing the expansion of the keyword.
  • the merging unit 112 merges the first keyword and the second keyword to obtain the target keyword of the event text.
  • the merging unit 112 aggregates the expansion trigger vocabulary of the sub-keyword to obtain the expansion trigger word of the sub-topic. Finally, after summarizing, the expanded trigger words of the entire text are obtained, and the expanded trigger words of the event text are obtained after summarizing the expanded trigger vocabulary of sub-keywords.
  • this application can obtain the event text when the keyword extraction instruction is received, and use the ET-TAG model to introduce background words, extract the subtopics of the event text, and make the keyword extraction more accurate.
  • the sub-topics are further merged to obtain the target sub-topic, and feature learning is performed on the target sub-topic to obtain the first keyword of the target sub-topic, and then the event is retrieved from the configuration database through relevance calculation
  • the related words of the text based on the contribution value, determine the second keyword of the event text from the related words, and then introduce external data as extended data, merge the first keyword and the second keyword to obtain the
  • the target keyword of the event text is automatically determined, and the data in other databases is introduced to make the keyword determination more accurate.
  • FIG. 3 it is a schematic structural diagram of an electronic device implementing a preferred embodiment of the keyword determining method according to the present application.
  • the electronic device 1 is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC) ), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • embedded equipment etc.
  • the electronic device 1 can also be, but is not limited to, any electronic product that can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device, for example, a personal computer, a tablet computer, or a smart phone. , Personal Digital Assistant (PDA), game consoles, interactive network TV (Internet Protocol Television, IPTV), smart wearable devices, etc.
  • PDA Personal Digital Assistant
  • IPTV Internet Protocol Television
  • smart wearable devices etc.
  • the electronic device 1 may also be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the network where the electronic device 1 is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), etc.
  • the electronic device 1 includes, but is not limited to, a memory 12, a processor 13, and a computer program stored in the memory 12 and running on the processor 13, such as Keyword determination procedure.
  • the schematic diagram is only an example of the electronic device 1 and does not constitute a limitation on the electronic device 1. It may include more or less components than those shown in the figure, or a combination of certain components, or different components. Components, for example, the electronic device 1 may also include input and output devices, network access devices, buses, and the like.
  • the processor 13 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc.
  • the processor 13 is the computing core and control center of the electronic device 1 and connects the entire electronic device with various interfaces and lines. Each part of 1, and executes the operating system of the electronic device 1, and various installed applications, program codes, etc.
  • the processor 13 executes the operating system of the electronic device 1 and various installed applications.
  • the processor 13 executes the application program to implement the steps in the above keyword determination method embodiments, such as steps S10, S11, S12, S13, S14, S15, and S16 shown in FIG. 1.
  • the functions of the modules/units in the above device embodiments are implemented, for example: when a keyword extraction instruction is received, the event text is obtained; the ET-TAG model is used to introduce background words , Extract the sub-topics of the event text; merge the sub-topics to obtain the target sub-topic; perform feature learning on the target sub-topic to obtain the first keyword of the target sub-topic; calculate the degree of relevance from the configuration Retrieve the related words of the event text from the database; determine the second keyword of the event text from the related words based on the contribution value; merge the first keyword and the second keyword to obtain the event The target keyword of the text.
  • the computer program may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 12 and executed by the processor 13 to complete this Application.
  • the one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program in the electronic device 1.
  • the computer program may be divided into an acquiring unit 110, an extracting unit 111, a merging unit 112, a learning unit 113, a retrieval unit 114, and a determining unit 115.
  • the memory 12 may be used to store the computer program and/or module, and the processor 13 runs or executes the computer program and/or module stored in the memory 12 and calls the data stored in the memory 12, Various functions of the electronic device 1 are realized.
  • the memory 12 may mainly include a storage program area and a storage data area.
  • the storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may Store data (such as audio data, phone book, etc.) created based on the use of mobile phones.
  • the memory 12 may include a high-speed random access memory, and may also include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), and a Secure Digital (SD) Card, Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • a non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), and a Secure Digital (SD) Card, Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • the memory 12 may be an external memory and/or an internal memory of the electronic device 1. Further, the memory 12 may be a circuit with a storage function without a physical form in an integrated circuit, such as RAM (Random-Access Memory, random access memory), FIFO (First In First Out), etc. Alternatively, the memory 12 may also be a memory in physical form, such as a memory stick, a TF card (Trans-flash Card), and so on.
  • RAM Random-Access Memory
  • FIFO First In First Out
  • the memory 12 may also be a memory in physical form, such as a memory stick, a TF card (Trans-flash Card), and so on.
  • the integrated modules/units of the electronic device 1 are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer readable storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through a computer program.
  • the computer program can be stored in a computer-readable storage medium. When the program is executed by the processor, the steps of the foregoing method embodiments can be implemented.
  • the computer program includes computer program code
  • the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signal, telecommunications signal, and software distribution media, etc.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • electrical carrier signal telecommunications signal
  • software distribution media etc.
  • the content contained in the computer-readable medium can be appropriately added or deleted according to the requirements of the legislation and patent practice in the jurisdiction.
  • the computer-readable medium Does not include electrical carrier signals and telecommunication signals.
  • the memory 12 in the electronic device 1 stores multiple instructions to implement a keyword determination method, and the processor 13 can execute the multiple instructions to implement: when a keyword extraction instruction is received When, obtain the event text; use the ET-TAG model to introduce background words to extract the subtopics of the event text; merge the subtopics to obtain the target subtopic; perform feature learning on the target subtopic to obtain the target subtopic
  • the first keyword of the topic; the related words of the event text are retrieved from the configuration database through the degree of relevance calculation; the second keyword of the event text is determined from the related words based on the contribution value; the first keyword is merged
  • the keyword and the second keyword are used to obtain the target keyword of the event text.
  • the execution of multiple instructions by the processor 13 includes:
  • the processor 13 further executing multiple instructions includes:
  • the combined sub-topic is determined as the target sub-topic.
  • the processor 13 further executing multiple instructions includes:
  • the processor 13 further executing multiple instructions includes:
  • a word in a preset position among all the words is selected as the label word.
  • the processor 13 further executing multiple instructions includes:
  • the processor 13 further executing multiple instructions includes:
  • the related word with the highest contribution value is obtained as the second keyword.
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional modules in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A keyword determination method and apparatus, an electronic device, and a storage medium. The keyword determination method is capable of: obtaining an event text when receiving a keyword extraction instruction; extracting sub-topics of the event text using an ET-TAG model to make keyword extraction more accurate; further combining the sub-topics to obtain a target sub-topic; performing feature learning on the target sub-topic to obtain a first keyword of the target sub-topic; then calling keywords of the event text from a configuration database by means of degree of correlation calculation; determining a second keyword of the event text from the keywords on the basis of a contribution value to introduce external data as extension data; and combining the first keyword and the second keyword to obtain a target keyword of the event text. Therefore, the keyword of the event text is automatically determined, and because data in other databases is introduced, the determination of the keyword is more accurate.

Description

关键词确定方法、装置、电子设备及存储介质Keyword determination method, device, electronic equipment and storage medium
本申请要求于2019年6月25日提交中国专利局,申请号为201910554221.6、发明名称为“关键词确定方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on June 25, 2019, the application number is 201910554221.6, and the invention title is "Keyword Determination Method, Device, Electronic Equipment and Storage Medium", the entire content of which is incorporated by reference Incorporated in this application.
技术领域Technical field
本申请涉及数据分析技术领域,尤其涉及一种关键词确定方法、装置、电子设备及存储介质。This application relates to the field of data analysis technology, and in particular to a method, device, electronic device, and storage medium for determining keywords.
背景技术Background technique
现有技术方案中,确定事件文本关键词主要由人工进行,即先人工确定事件的关键词,再进行事件提取,不仅人工工作量大,而且对于事件的细粒度的标签较少,大多数都是针对文本整体打上一个事件标签,再将该标签作为文本的关键词。In the prior art solution, the key words of the event text are mainly determined manually, that is, the key words of the event are manually determined first, and then the event extraction is performed. Not only is the manual workload large, but also there are fewer fine-grained tags for the event. It is to put an event label on the whole text, and then use the label as the key word of the text.
上述方式存在一定的弊端。首先,由于大部分需要借助人工区分的方式进行分类提取,因此准确率没有保证,且效率低下;其次,由于没有定量的分析,导致获取的关键词不能对原文本进行全面深入的刻画,关键词之名名不副实;最后,传统方法仅仅提取了文本关键词,并未扩展出触发关键词,导致对文本不能进行深入挖掘。The above method has certain drawbacks. First of all, because most of them need to be classified and extracted by manual differentiation, the accuracy rate is not guaranteed, and the efficiency is low; second, because there is no quantitative analysis, the obtained keywords cannot be fully and in-depth described in the original text. The name is not true; in the end, the traditional method only extracts the text keywords, but does not expand the trigger keywords, which makes it impossible to dig deeper into the text.
发明内容Summary of the invention
鉴于以上内容,有必要提供一种关键词确定方法、装置、电子设备及存储介质,能够自动确定事件文本的关键词,且由于引入了其他数据库中的数据,使关键词的确定更加准确。In view of the above, it is necessary to provide a keyword determination method, device, electronic device, and storage medium that can automatically determine the keywords of the event text, and because the data in other databases is introduced, the keyword determination is more accurate.
一种关键词确定方法,所述方法包括:当接收到关键词提取指令时,获取事件文本;采用ET-TAG模型引入背景词,提取所述事件文本的子话题;合并所述子话题,得到目标子话题;对所述目标子话题进行特征学习,得到所述目标子话题的第一关键词;通过关联度计算,从配置数据库中调取所述事件文本的关联词;基于贡献值,从所述关联词中确定所述事件文本的第二关键词;合并所述第一关键词及所述第二关键词,得到所述事件文本的目标关键词。A keyword determination method, the method includes: when a keyword extraction instruction is received, obtaining event text; introducing background words using an ET-TAG model to extract subtopics of the event text; combining the subtopics to obtain Target sub-topic; perform feature learning on the target sub-topic to obtain the first keyword of the target sub-topic; retrieve the related words of the event text from the configuration database by calculating the degree of relevance; Determine the second keyword of the event text in the related words; merge the first keyword and the second keyword to obtain the target keyword of the event text.
一种电子设备,所述电子设备包括:存储器,存储至少一个指令;及处理器,执行所述存储器中存储的指令以实现所述关键词确定方法。An electronic device, the electronic device comprising: a memory that stores at least one instruction; and a processor that executes the instructions stored in the memory to implement the keyword determination method.
一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一个指令,所述至少一个指令被电子设备中的处理器执行以实现所述关键词确定方法。A computer-readable storage medium stores at least one instruction, and the at least one instruction is executed by a processor in an electronic device to implement the keyword determination method.
由以上技术方案可以看出,本申请能够当接收到关键词提取指令时,获取 事件文本,并采用ET-TAG模型引入背景词,提取所述事件文本的子话题,使关键词提取更加准确,进一步合并所述子话题,得到目标子话题,并对所述目标子话题进行特征学习,得到所述目标子话题的第一关键词,再通过关联度计算,从配置数据库中调取所述事件文本的关联词,基于贡献值,从所述关联词中确定所述事件文本的第二关键词,进而引入外部数据作为扩展数据,合并所述第一关键词及所述第二关键词,得到所述事件文本的目标关键词,从而自动确定事件文本的关键词,且由于引入了其他数据库中的数据,使关键词的确定更加准确。It can be seen from the above technical solutions that this application can obtain the event text when the keyword extraction instruction is received, and use the ET-TAG model to introduce background words, extract the subtopics of the event text, and make the keyword extraction more accurate. The sub-topics are further merged to obtain the target sub-topic, and feature learning is performed on the target sub-topic to obtain the first keyword of the target sub-topic, and then the event is retrieved from the configuration database through relevance calculation The related words of the text, based on the contribution value, determine the second keyword of the event text from the related words, and then introduce external data as extended data, merge the first keyword and the second keyword to obtain the The target keyword of the event text is automatically determined, and the data in other databases is introduced to make the keyword determination more accurate.
附图说明Description of the drawings
图1是本申请关键词确定方法的较佳实施例的流程图。Fig. 1 is a flowchart of a preferred embodiment of a method for determining keywords of the present application.
图2是本申请关键词确定装置的较佳实施例的功能模块图。Fig. 2 is a functional module diagram of a preferred embodiment of the keyword determining device of the present application.
图3是本申请实现关键词确定方法的较佳实施例的电子设备的结构示意图。Fig. 3 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the keyword determining method according to the present application.
主要元件符号说明Symbol description of main components
电子设备 Electronic equipment 11
存储器 Memory 1212
处理器 processor 1313
关键词确定装置 Keyword determining device 1111
获取单元Get unit 110110
提取单元 Extraction unit 111111
合并单元 Merge unit 112112
学习单元Unit of study 113113
调取单元 Retrieve unit 114114
确定单元Determine the unit 115115
具体实施方式Detailed ways
为了使本申请的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本申请进行详细描述。In order to make the objectives, technical solutions, and advantages of the present application clearer, the present application will be described in detail below with reference to the accompanying drawings and specific embodiments.
如图1所示,是本申请关键词确定方法的较佳实施例的流程图。根据不同的需求,该流程图中步骤的顺序可以改变,某些步骤可以省略。As shown in Fig. 1, it is a flowchart of a preferred embodiment of the method for determining keywords of the present application. According to different needs, the order of the steps in the flowchart can be changed, and some steps can be omitted.
所述关键词确定方法应用于一个或者多个电子设备中,所述电子设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。The keyword determination method is applied to one or more electronic devices. The electronic device is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions. Its hardware includes but not Limited to microprocessors, application specific integrated circuits (ASICs), programmable gate arrays (Field-Programmable Gate Arrays, FPGAs), digital processors (Digital Signal Processors, DSPs), embedded devices, etc.
所述电子设备可以是任何一种可与用户进行人机交互的电子产品,例如,个人计算机、平板电脑、智能手机、个人数字助理(Personal Digital  Assistant,PDA)、游戏机、交互式网络电视(Internet Protocol Television,IPTV)、智能式穿戴式设备等。The electronic device can be any electronic product that can interact with a user with a human machine, such as a personal computer, a tablet computer, a smart phone, a personal digital assistant (PDA), a game console, an interactive network television ( Internet Protocol Television, IPTV), smart wearable devices, etc.
所述电子设备还可以包括网络设备和/或用户设备。其中,所述网络设备包括,但不限于单个网络服务器、多个网络服务器组成的服务器组或基于云计算(Cloud Computing)的由大量主机或网络服务器构成的云。The electronic device may also include a network device and/or user equipment. Wherein, the network device includes, but is not limited to, a single network server, a server group composed of multiple network servers, or a cloud composed of a large number of hosts or network servers based on Cloud Computing.
所述电子设备所处的网络包括但不限于互联网、广域网、城域网、局域网、虚拟专用网络(Virtual Private Network,VPN)等。The network where the electronic device is located includes but is not limited to the Internet, wide area network, metropolitan area network, local area network, virtual private network (Virtual Private Network, VPN), etc.
S10,当接收到关键词提取指令时,获取事件文本。S10: Obtain the event text when the keyword extraction instruction is received.
在本申请的至少一个实施例中,所述关键词提取指令可以由用户触发,本申请不限制。In at least one embodiment of the present application, the keyword extraction instruction may be triggered by the user, which is not limited in the present application.
在本申请的至少一个实施例中,所述事件文本是指与舆论事件相关的新闻报道等文章。In at least one embodiment of the present application, the event text refers to articles such as news reports related to public opinion events.
S11,采用ET-TAG模型引入背景词,提取所述事件文本的子话题。S11: Use the ET-TAG model to introduce background words, and extract subtopics of the event text.
在本申请的至少一个实施例中,所述电子设备采用ET-TAG模型引入背景词,提取所述事件文本的子话题包括:In at least one embodiment of the present application, the electronic device adopts the ET-TAG model to introduce background words, and extracting subtopics of the event text includes:
所述电子设备基于所述ET-TAG模型,采用PLSA-BLM(probabilitistic Latent Semantic Analysis)算法,引入背景词,并从所述事件文本中删除所述背景词,得到更新文本,所述电子设备从所述更新文本中提取话题,作为所述事件文本的子话题。Based on the ET-TAG model, the electronic device adopts the PLSA-BLM (probabilitistic Latent Semantic Analysis) algorithm to introduce background words and delete the background words from the event text to obtain updated text. Topics are extracted from the update text as subtopics of the event text.
具体地,所述电子设备首先载入需要提取关键词的事件文本,并利用ET-TAG模型从词分布的角度发现事件子话题的特性。Specifically, the electronic device first loads the event text from which keywords need to be extracted, and uses the ET-TAG model to discover the characteristics of the event subtopics from the perspective of word distribution.
进一步地,在ET-TAG模型中,采用有监督的思想,从外部的知识库引入特定类别舆情事件的概念体系,将其当作事件内部子话题的标签可以提高标签的可理解性,再通过所述ET-TAG模型中的PLSA-BLM算法,针对文本中的隐含主题进行建模。Furthermore, in the ET-TAG model, the idea of supervised is adopted to introduce the conceptual system of specific categories of public opinion events from the external knowledge base, and treat them as the tags of the internal subtopics of the event to improve the intelligibility of the tags. The PLSA-BLM algorithm in the ET-TAG model models the hidden topics in the text.
更进一步地,所述电子设备在原有主题A、B、C...的基础上,引入背景主题X,并在引入所述背景主题X之后,将原有主题中大量出现的高频词、停用词等归纳到所述背景主题X中,这部分词对于主题提取将不会有影响。Furthermore, the electronic device introduces a background theme X on the basis of the original themes A, B, C..., and after introducing the background theme X, the high-frequency words, The stop words etc. are summarized into the background topic X, and this part of words will have no influence on topic extraction.
通过上述实施方式,在剔除这一部分冗余词汇以后,模型就可以根据鲜明的主题特征提取出所述事件文本的子话题。Through the above implementation, after removing this part of redundant vocabulary, the model can extract the subtopics of the event text according to the distinctive topic features.
S12,合并所述子话题,得到目标子话题。S12. Combine the subtopics to obtain a target subtopic.
可以理解的是,利用PLSA-BLM算法,提取出每个子话题的同时能够得到不同的词频分布,但是得到的词频分布有可能有很强的近似性。因此,所述电子设备将利用得到的不同子话题间词频分布的差异来度量所述子话题之间的差别,并对所述子话题进行合并。It is understandable that using the PLSA-BLM algorithm, different word frequency distributions can be obtained while extracting each subtopic, but the obtained word frequency distribution may have a strong approximation. Therefore, the electronic device will use the obtained difference in word frequency distribution between different sub-topics to measure the difference between the sub-topics, and merge the sub-topics.
在本申请的至少一个实施例中,所述电子设备合并所述子话题,得到目标子话题包括:In at least one embodiment of the present application, the electronic device merging the sub-topics to obtain the target sub-topic includes:
所述电子设备采用KL散度(Kullback-Leibler Divergence,K-L)算法计算所述子话题间的散度,当所述子话题中有两个子话题间的散度小于配置阈值时,所述电子设备合并所述两个子话题,并将合并后的所述子话题确定为 所述目标子话题。The electronic device uses KL divergence (Kullback-Leibler Divergence, KL) algorithm to calculate the divergence between the sub-topics. When the divergence between two sub-topics in the sub-topic is less than a configured threshold, the electronic device The two sub-topics are merged, and the merged sub-topic is determined as the target sub-topic.
具体地,所述电子设备在得到的各个子话题上运用KL散度算法计算所述子话题间的散度,KL散度算法也称为相对熵,可以衡量相同事件空间下两个分布的差异。根据信息论中的定义,KL散度的物理意义是在相同的事件空间下,概率P(x)若用Q(x)编码,平均每个符号编码长度增加了多少比特,用D(P||Q)来表示,其计算公式为:D(P||Q)=∑P(x)log(P(x)/Q(x)),进而计算不同子话题下词频分布的距离。Specifically, the electronic device uses the KL divergence algorithm to calculate the divergence between the sub-topics on the obtained sub-topics. The KL divergence algorithm is also called relative entropy, which can measure the difference between two distributions in the same event space. . According to the definition in information theory, the physical meaning of KL divergence is that in the same event space, if the probability P(x) is coded with Q(x), how many bits increase in the average code length of each symbol, using D(P|| Q), the calculation formula is: D(P||Q)=∑P(x)log(P(x)/Q(x)), and then calculate the distance of word frequency distribution under different subtopics.
进一步地,当所述子话题中有两个子话题间的散度(距离)小于配置阈值时,所述电子设备合并所述两个子话题,并将合并后的所述子话题确定为所述目标子话题。Further, when the divergence (distance) between two sub-topics in the sub-topics is less than a configured threshold, the electronic device merges the two sub-topics, and determines the merged sub-topic as the target Subtopic.
通过上述实施方式,所述电子设备能够去除冗余,并提高子话题划分精确度,使得到的所述目标子话题更加准确。Through the foregoing implementation manner, the electronic device can remove redundancy and improve the accuracy of sub-topic division, so that the obtained target sub-topic is more accurate.
S13,对所述目标子话题进行特征学习,得到所述目标子话题的第一关键词。S13: Perform feature learning on the target subtopic to obtain the first keyword of the target subtopic.
在本申请的至少一个实施例中,所述电子设备对所述目标子话题进行特征学习,得到所述目标子话题的第一关键词包括:In at least one embodiment of the present application, the electronic device performs feature learning on the target subtopic, and obtaining the first keyword of the target subtopic includes:
所述电子设备抽取所述目标子话题中的标签词语,并采用Lasso分类模型对所述标签词语进行线性回归分析,以计算所述标签词语的评分,所述电子设备获取所述标签词语中评分大于或者等于预设值的词语,作为所述第一关键词。The electronic device extracts the label words in the target sub-topic, and uses the Lasso classification model to perform linear regression analysis on the label words to calculate the score of the label words, and the electronic device obtains the score in the label words Words greater than or equal to the preset value are used as the first keyword.
其中,基于self-attention算法的Lasso分类模型是一种用于线性回归特征缩减和选择的模型,通过Lasso分类模型,能够对所述目标子话题中的标签词语进行特征学习。Among them, the Lasso classification model based on the self-attention algorithm is a model for linear regression feature reduction and selection. Through the Lasso classification model, feature learning can be performed on the label words in the target subtopic.
具体地,所述电子设备构造标准训练集D={(x1,y1),(x2,y2),…,(xi,yi)},其中,xi表示各个标签词语的词向量,yi表示各词向量和所属子话题的相关系数。Specifically, the electronic device constructs a standard training set D={(x1, y1), (x2, y2),..., (xi, yi)}, where xi represents the word vector of each label word, and yi represents each word The correlation coefficient between the vector and the subtopic.
进一步地,所述电子设备使用Lasso分类模型计算所述标签词语的评分。每一个候选标签词语的得分等于Lasso分类模型给出的权重得分βi。且βi的值越大,则该标签词语与该子话题的关联性越强,因此,所述电子设备将βi词最大的标签词语作为所属子话题的关键词,即作为所述第一关键词。Further, the electronic device uses the Lasso classification model to calculate the score of the tag word. The score of each candidate label word is equal to the weight score βi given by the Lasso classification model. And the greater the value of βi, the stronger the correlation between the label term and the sub-topic. Therefore, the electronic device uses the label term with the largest βi term as the keyword of the sub-topic, that is, as the first keyword .
在本申请的至少一个实施例中,所述电子设备抽取所述目标子话题中的标签词语包括:In at least one embodiment of the present application, the electronic device extracting the tag words in the target subtopic includes:
所述电子设备获取所述目标子话题中的所有词语,并采用PLSA-BLM算法,计算所述所有词语的词频,所述电子设备根据所述所有词语的词频,对所述所有词语进行排序,并筛选出所述所有词语中处于预设位置的词语作为所述标签词语。The electronic device obtains all the words in the target sub-topic, and uses the PLSA-BLM algorithm to calculate the word frequency of all the words, and the electronic device sorts all the words according to the word frequency of all the words, And the word in the preset position among all the words is selected as the label word.
其中,所述预设位置可以进行自定义配置,以便获取到满足数量需求的所述标签词语。Wherein, the preset position can be customized to obtain the label words meeting the quantity requirement.
例如:在所述子话题合并之后,将每个子话题下的词语按照词频降序排列,词频越高说明这个词在当前子话题下越有代表性,所述电子设备截取频 率最高的k(例如:k=100)个词,并将所述频率最高的k个词作为所述标签词语。For example: after the subtopics are merged, the words under each subtopic are arranged in descending order of word frequency. The higher the word frequency, the more representative the word is in the current subtopic. The electronic device intercepts the k with the highest frequency (for example: k =100) words, and the k words with the highest frequency are used as the label words.
通过上述实施方式,能够获取到足够且准确的标签词语,以供后续特征学习使用。Through the foregoing implementation manners, sufficient and accurate label words can be obtained for subsequent feature learning.
S14,通过关联度计算,从配置数据库中调取所述事件文本的关联词。S14: Retrieve the associated words of the event text from the configuration database by calculating the degree of association.
在本申请的至少一个实施例中,所述配置数据库可以是任意形式的数据库,所述配置数据库中存储着所有与事件(时事、新闻等)相关的文本报道、评论性文章等。In at least one embodiment of the present application, the configuration database may be any form of database, and all text reports, commentary articles, etc. related to events (current events, news, etc.) are stored in the configuration database.
例如:所述电子设备调用搜索引擎中和所述事件文本相关的知识库。For example: the electronic device calls the knowledge base related to the event text in the search engine.
可以理解的是,扩展关键词需要一些包含和所述事件文本相关的舆情事件的知识库,这些知识库涵盖了所在领域的各个方面的信息。依据知识库可对所述事件文本进行不同侧面的信息汇总,扩展关键词的过程实质上是建立关键词到舆情知识库事件概念体系的映射关系,从而利用相关词汇进行扩展。It is understandable that expanding keywords requires some knowledge bases containing public opinion events related to the event text, and these knowledge bases cover all aspects of information in the field. According to the knowledge base, the event text can be summarized from different sides. The process of expanding the keywords is essentially to establish a mapping relationship between the keywords and the event conceptual system of the public opinion knowledge base, thereby using related vocabulary to expand.
具体地,所述电子设备通过关联度计算,从配置数据库中调取所述事件文本的关联词包括:Specifically, the electronic device retrieves the associated words of the event text from the configuration database by calculating the degree of association, including:
所述电子设备计算所述配置数据库中的所有词语与所述事件文本的关联度,获取所述关联度大于或者等于配置关联度的词语作为与所述事件文本相关的词语,并对与所述事件文本相关的词语进行分词预处理,得到所述关联词。The electronic device calculates the degree of relevance between all words in the configuration database and the event text, acquires words with the degree of relevance greater than or equal to the degree of relevance to the configuration as words related to the event text, and compares Words related to the event text are preprocessed by word segmentation to obtain the related words.
其中,所述配置关联度可以进行自定义设置,本申请不限制。Wherein, the configuration association degree can be customized, which is not limited in this application.
例如:所述电子设备对知识库数据按内部概念进行整理、分词预处理,以计算出知识库中的每一个词和所述第一关键词相关的概率。For example, the electronic device sorts and preprocesses the knowledge base data according to internal concepts to calculate the probability that each word in the knowledge base is related to the first keyword.
通过上述实施方式,能够引入外部数据库中的数据作为基础,使关键字的确定更加准确。Through the foregoing implementation manners, data in an external database can be introduced as a basis to make the determination of keywords more accurate.
S15,基于贡献值,从所述关联词中确定所述事件文本的第二关键词。S15: Determine a second keyword of the event text from the related words based on the contribution value.
在本申请的至少一个实施例中,所述电子设备基于贡献值,从所述关联词中确定所述事件文本的第二关键词包括:In at least one embodiment of the present application, the electronic device determining the second keyword of the event text from the related words based on the contribution value includes:
所述电子设备计算所述关联词对所述第一关键词的贡献值,并获取贡献值最高的关联词作为所述第二关键词。The electronic device calculates the contribution value of the related word to the first keyword, and obtains the related word with the highest contribution value as the second keyword.
通过上述实施方式,将对所述第一关键词贡献值最大的词作为所述第二关键词,从而实现对关键词的扩展。Through the foregoing implementation manner, the word with the largest contribution value to the first keyword is used as the second keyword, thereby realizing the expansion of the keyword.
S16,合并所述第一关键词及所述第二关键词,得到所述事件文本的目标关键词。S16: Combine the first keyword and the second keyword to obtain the target keyword of the event text.
在本申请的至少一个实施例中,所述电子设备将子关键词的扩展触发词汇总后得到子话题的扩展触发词。最后汇总后得到整个文本扩展的触发词,实现将子关键词的扩展触发词汇总后得到所述事件文本的扩展触发词。In at least one embodiment of the present application, the electronic device aggregates the expansion trigger vocabulary of the sub-keyword to obtain the expansion trigger word of the sub-topic. Finally, after summarizing, the expanded trigger words of the entire text are obtained, and the expanded trigger words of the event text are obtained after summarizing the expanded trigger vocabulary of sub-keywords.
由以上技术方案可以看出,本申请能够当接收到关键词提取指令时,获取事件文本,并采用ET-TAG模型引入背景词,提取所述事件文本的子话题,使关键词提取更加准确,进一步合并所述子话题,得到目标子话题,并对所述目标子话题进行特征学习,得到所述目标子话题的第一关键词,再通过关 联度计算,从配置数据库中调取所述事件文本的关联词,基于贡献值,从所述关联词中确定所述事件文本的第二关键词,进而引入外部数据作为扩展数据,合并所述第一关键词及所述第二关键词,得到所述事件文本的目标关键词,从而自动确定事件文本的关键词,且由于引入了其他数据库中的数据,使关键词的确定更加准确。It can be seen from the above technical solutions that this application can obtain the event text when the keyword extraction instruction is received, and use the ET-TAG model to introduce background words, extract the subtopics of the event text, and make the keyword extraction more accurate. The sub-topics are further merged to obtain the target sub-topic, and feature learning is performed on the target sub-topic to obtain the first keyword of the target sub-topic, and then the event is retrieved from the configuration database through relevance calculation The related words of the text, based on the contribution value, determine the second keyword of the event text from the related words, and then introduce external data as extended data, merge the first keyword and the second keyword to obtain the The target keyword of the event text is automatically determined, and the data in other databases is introduced to make the keyword determination more accurate.
如图2所示,是本申请关键词确定装置的较佳实施例的功能模块图。所述关键词确定装置11包括获取单元110、提取单元111、合并单元112、学习单元113、调取单元114及确定单元115。本申请所称的模块/单元是指一种能够被处理器13所执行,并且能够完成固定功能的一系列计算机程序段,其存储在存储器12中。在本实施例中,关于各模块/单元的功能将在后续的实施例中详述。As shown in FIG. 2, it is a functional module diagram of a preferred embodiment of the keyword determining device of the present application. The keyword determination device 11 includes an acquisition unit 110, an extraction unit 111, a merging unit 112, a learning unit 113, a retrieval unit 114, and a determination unit 115. The module/unit referred to in this application refers to a series of computer program segments that can be executed by the processor 13 and can complete fixed functions, and are stored in the memory 12. In this embodiment, the functions of each module/unit will be described in detail in subsequent embodiments.
当接收到关键词提取指令时,获取单元110获取事件文本。When the keyword extraction instruction is received, the obtaining unit 110 obtains the event text.
在本申请的至少一个实施例中,所述关键词提取指令可以由用户触发,本申请不限制。In at least one embodiment of the present application, the keyword extraction instruction may be triggered by the user, which is not limited in the present application.
在本申请的至少一个实施例中,所述事件文本是指与舆论事件相关的新闻报道等文章。In at least one embodiment of the present application, the event text refers to articles such as news reports related to public opinion events.
提取单元111采用ET-TAG模型引入背景词,提取所述事件文本的子话题。The extraction unit 111 adopts the ET-TAG model to introduce background words, and extracts subtopics of the event text.
在本申请的至少一个实施例中,所述提取单元111采用ET-TAG模型引入背景词,提取所述事件文本的子话题包括:In at least one embodiment of the present application, the extraction unit 111 adopts an ET-TAG model to introduce background words, and extracting subtopics of the event text includes:
所述提取单元111基于所述ET-TAG模型,采用PLSA-BLM(probabilitistic Latent Semantic Analysis)算法,引入背景词,并从所述事件文本中删除所述背景词,得到更新文本,所述提取单元111从所述更新文本中提取话题,作为所述事件文本的子话题。The extraction unit 111 adopts the PLSA-BLM (probabilitistic Latent Semantic Analysis) algorithm based on the ET-TAG model to introduce background words, and delete the background words from the event text to obtain updated text, the extraction unit 111 extracts topics from the updated text as subtopics of the event text.
具体地,所述提取单元111首先载入需要提取关键词的事件文本,并利用ET-TAG模型从词分布的角度发现事件子话题的特性。Specifically, the extraction unit 111 first loads the event text for which keywords need to be extracted, and uses the ET-TAG model to discover the characteristics of the event subtopic from the perspective of word distribution.
进一步地,在ET-TAG模型中,采用有监督的思想,从外部的知识库引入特定类别舆情事件的概念体系,将其当作事件内部子话题的标签可以提高标签的可理解性,再通过所述ET-TAG模型中的PLSA-BLM算法,针对文本中的隐含主题进行建模。Furthermore, in the ET-TAG model, the idea of supervised is adopted to introduce the conceptual system of specific categories of public opinion events from the external knowledge base, and treat them as the tags of the internal subtopics of the event to improve the intelligibility of the tags. The PLSA-BLM algorithm in the ET-TAG model models the hidden topics in the text.
更进一步地,所述提取单元111在原有主题A、B、C...的基础上,引入背景主题X,并在引入所述背景主题X之后,将原有主题中大量出现的高频词、停用词等归纳到所述背景主题X中,这部分词对于主题提取将不会有影响。Furthermore, the extraction unit 111 introduces a background theme X on the basis of the original themes A, B, C..., and after introducing the background theme X, removes the high-frequency words that appear in the original theme a lot , Stop words, etc. are summarized into the background topic X, and these words will have no effect on topic extraction.
通过上述实施方式,在剔除这一部分冗余词汇以后,模型就可以根据鲜明的主题特征提取出所述事件文本的子话题。Through the above implementation, after removing this part of redundant vocabulary, the model can extract the subtopics of the event text according to the distinctive topic features.
合并单元112合并所述子话题,得到目标子话题。The merging unit 112 merges the sub-topics to obtain the target sub-topic.
可以理解的是,利用PLSA-BLM算法,提取出每个子话题的同时能够得到不同的词频分布,但是得到的词频分布有可能有很强的近似性。因此,所 述合并单元112将利用得到的不同子话题间词频分布的差异来度量所述子话题之间的差别,并对所述子话题进行合并。It is understandable that using the PLSA-BLM algorithm, different word frequency distributions can be obtained while extracting each subtopic, but the obtained word frequency distribution may have a strong approximation. Therefore, the merging unit 112 will use the obtained difference in word frequency distribution between the different sub-topics to measure the difference between the sub-topics, and merge the sub-topics.
在本申请的至少一个实施例中,所述合并单元112合并所述子话题,得到目标子话题包括:In at least one embodiment of the present application, the merging unit 112 merging the sub-topics to obtain the target sub-topic includes:
所述合并单元112采用KL散度(Kullback-Leibler Divergence,K-L)算法计算所述子话题间的散度,当所述子话题中有两个子话题间的散度小于配置阈值时,所述合并单元112合并所述两个子话题,并将合并后的所述子话题确定为所述目标子话题。The merging unit 112 uses KL divergence (Kullback-Leibler Divergence, KL) algorithm to calculate the divergence between the sub-topics. When the divergence between two sub-topics in the sub-topic is less than a configured threshold, the merge The unit 112 merges the two sub-topics, and determines the merged sub-topic as the target sub-topic.
具体地,所述合并单元112在得到的各个子话题上运用KL散度算法计算所述子话题间的散度,KL散度算法也称为相对熵,可以衡量相同事件空间下两个分布的差异。根据信息论中的定义,KL散度的物理意义是在相同的事件空间下,概率P(x)若用Q(x)编码,平均每个符号编码长度增加了多少比特,用D(P||Q)来表示,其计算公式为:D(P||Q)=∑P(x)log(P(x)/Q(x)),进而计算不同子话题下词频分布的距离。Specifically, the merging unit 112 uses the KL divergence algorithm to calculate the divergence between the sub-topics on the obtained subtopics. The KL divergence algorithm is also called relative entropy, which can measure the difference between two distributions in the same event space. difference. According to the definition in information theory, the physical meaning of KL divergence is that in the same event space, if the probability P(x) is coded with Q(x), how many bits increase in the average code length of each symbol, using D(P|| Q), the calculation formula is: D(P||Q)=∑P(x)log(P(x)/Q(x)), and then calculate the distance of word frequency distribution under different subtopics.
进一步地,当所述子话题中有两个子话题间的散度(距离)小于配置阈值时,所述合并单元112合并所述两个子话题,并将合并后的所述子话题确定为所述目标子话题。Further, when the divergence (distance) between two sub-topics in the sub-topics is less than a configured threshold, the merging unit 112 merges the two sub-topics, and determines the merged sub-topic as the Target subtopic.
通过上述实施方式,能够去除冗余,并提高子话题划分精确度,使得到的所述目标子话题更加准确。Through the foregoing implementation manner, redundancy can be removed, and the accuracy of sub-topic division can be improved, so that the target sub-topic obtained is more accurate.
学习单元113对所述目标子话题进行特征学习,得到所述目标子话题的第一关键词。The learning unit 113 performs feature learning on the target sub-topic to obtain the first keyword of the target sub-topic.
在本申请的至少一个实施例中,所述学习单元113对所述目标子话题进行特征学习,得到所述目标子话题的第一关键词包括:In at least one embodiment of the present application, the learning unit 113 performs feature learning on the target subtopic, and obtaining the first keyword of the target subtopic includes:
所述学习单元113抽取所述目标子话题中的标签词语,并采用Lasso分类模型对所述标签词语进行线性回归分析,以计算所述标签词语的评分,所述学习单元113获取所述标签词语中评分大于或者等于预设值的词语,作为所述第一关键词。The learning unit 113 extracts the label words in the target subtopic, and uses the Lasso classification model to perform linear regression analysis on the label words to calculate the score of the label words, and the learning unit 113 obtains the label words A word with a middle score greater than or equal to a preset value is used as the first keyword.
其中,基于self-attention算法的Lasso分类模型是一种用于线性回归特征缩减和选择的模型,通过Lasso分类模型,能够对所述目标子话题中的标签词语进行特征学习。Among them, the Lasso classification model based on the self-attention algorithm is a model for linear regression feature reduction and selection. Through the Lasso classification model, feature learning can be performed on the label words in the target subtopic.
具体地,所述学习单元113构造标准训练集D={(x1,y1),(x2,y2),…,(xi,yi)},其中,xi表示各个标签词语的词向量,yi表示各词向量和所属子话题的相关系数。Specifically, the learning unit 113 constructs a standard training set D={(x1, y1), (x2, y2),..., (xi, yi)}, where xi represents the word vector of each label word, and yi represents each The correlation coefficient between the word vector and the subtopic.
进一步地,所述学习单元113使用Lasso分类模型计算所述标签词语的评分。每一个候选标签词语的得分等于Lasso分类模型给出的权重得分βi。且βi的值越大,则该标签词语与该子话题的关联性越强,因此,所述学习单元113将βi词最大的标签词语作为所属子话题的关键词,即作为所述第一关键词。Further, the learning unit 113 uses the Lasso classification model to calculate the score of the label word. The score of each candidate label word is equal to the weight score βi given by the Lasso classification model. And the greater the value of βi, the stronger the correlation between the label term and the sub-topic. Therefore, the learning unit 113 uses the label term with the largest βi term as the keyword of the sub-topic, which is the first key. word.
在本申请的至少一个实施例中,所述学习单元113抽取所述目标子话题中的标签词语包括:In at least one embodiment of the present application, the extraction of label words in the target subtopic by the learning unit 113 includes:
所述学习单元113获取所述目标子话题中的所有词语,并采用PLSA-BLM算法,计算所述所有词语的词频,所述学习单元113根据所述所有词语的词频,对所述所有词语进行排序,并筛选出所述所有词语中处于预设位置的词语作为所述标签词语。The learning unit 113 obtains all the words in the target sub-topic, and uses the PLSA-BLM algorithm to calculate the word frequency of all the words. The learning unit 113 performs a calculation on all the words according to the word frequencies of all the words. Sort, and filter out the words in the preset position among all the words as the label words.
其中,所述预设位置可以进行自定义配置,以便获取到满足数量需求的所述标签词语。Wherein, the preset position can be customized to obtain the label words meeting the quantity requirement.
例如:在所述子话题合并之后,将每个子话题下的词语按照词频降序排列,词频越高说明这个词在当前子话题下越有代表性,所述学习单元113截取频率最高的k(例如:k=100)个词,并将所述频率最高的k个词作为所述标签词语。For example: after the sub-topics are merged, the words under each sub-topic are arranged in descending order of word frequency. The higher the word frequency, the more representative the word is in the current sub-topic. The learning unit 113 intercepts the k with the highest frequency (for example: k=100) words, and the k words with the highest frequency are used as the label words.
通过上述实施方式,能够获取到足够且准确的标签词语,以供后续特征学习使用。Through the foregoing implementation manners, sufficient and accurate label words can be obtained for subsequent feature learning.
调取单元114通过关联度计算,从配置数据库中调取所述事件文本的关联词。The retrieval unit 114 retrieves the related words of the event text from the configuration database by calculating the degree of relevance.
在本申请的至少一个实施例中,所述配置数据库可以是任意形式的数据库,所述配置数据库中存储着所有与事件(时事、新闻等)相关的文本报道、评论性文章等。In at least one embodiment of the present application, the configuration database may be any form of database, and all text reports, commentary articles, etc. related to events (current events, news, etc.) are stored in the configuration database.
例如:所述调取单元114调用搜索引擎中和所述事件文本相关的知识库。For example, the retrieval unit 114 calls the knowledge base related to the event text in the search engine.
可以理解的是,扩展关键词需要一些包含和所述事件文本相关的舆情事件的知识库,这些知识库涵盖了所在领域的各个方面的信息。依据知识库可对所述事件文本进行不同侧面的信息汇总,扩展关键词的过程实质上是建立关键词到舆情知识库事件概念体系的映射关系,从而利用相关词汇进行扩展。It is understandable that expanding keywords requires some knowledge bases containing public opinion events related to the event text, and these knowledge bases cover all aspects of information in the field. According to the knowledge base, the event text can be summarized from different sides. The process of expanding the keywords is essentially to establish a mapping relationship between the keywords and the event conceptual system of the public opinion knowledge base, thereby using related vocabulary to expand.
具体地,所述调取单元114通过关联度计算,从配置数据库中调取所述事件文本的关联词包括:Specifically, the retrieval unit 114 retrieves the related words of the event text from the configuration database by calculating the degree of relevance, including:
所述调取单元114计算所述配置数据库中的所有词语与所述事件文本的关联度,获取所述关联度大于或者等于配置关联度的词语作为与所述事件文本相关的词语,并对与所述事件文本相关的词语进行分词预处理,得到所述关联词。The retrieval unit 114 calculates the degree of relevance between all words in the configuration database and the event text, acquires words with the degree of relevance greater than or equal to the configuration relevance degree as words related to the event text, and compares The words related to the event text are preprocessed by word segmentation to obtain the related words.
其中,所述配置关联度可以进行自定义设置,本申请不限制。Wherein, the configuration association degree can be customized, which is not limited in this application.
例如:所述调取单元114对知识库数据按内部概念进行整理、分词预处理,以计算出知识库中的每一个词和所述第一关键词相关的概率。For example, the retrieval unit 114 sorts the knowledge base data according to internal concepts and preprocesses word segmentation to calculate the probability that each word in the knowledge base is related to the first keyword.
通过上述实施方式,能够引入外部数据库中的数据作为基础,使关键字的确定更加准确。Through the foregoing implementation manners, data in an external database can be introduced as a basis to make the determination of keywords more accurate.
确定单元115基于贡献值,从所述关联词中确定所述事件文本的第二关键词。The determining unit 115 determines the second keyword of the event text from the related words based on the contribution value.
在本申请的至少一个实施例中,所述确定单元115基于贡献值,从所述关联词中确定所述事件文本的第二关键词包括:In at least one embodiment of the present application, the determining unit 115 determining the second keyword of the event text from the related words based on the contribution value includes:
所述确定单元115计算所述关联词对所述第一关键词的贡献值,并获取贡献值最高的关联词作为所述第二关键词。The determining unit 115 calculates the contribution value of the related word to the first keyword, and obtains the related word with the highest contribution value as the second keyword.
通过上述实施方式,将对所述第一关键词贡献值最大的词作为所述第 二关键词,从而实现对关键词的扩展。Through the foregoing implementation manner, the word with the largest contribution to the first keyword is used as the second keyword, thereby realizing the expansion of the keyword.
所述合并单元112合并所述第一关键词及所述第二关键词,得到所述事件文本的目标关键词。The merging unit 112 merges the first keyword and the second keyword to obtain the target keyword of the event text.
在本申请的至少一个实施例中,所述合并单元112将子关键词的扩展触发词汇总后得到子话题的扩展触发词。最后汇总后得到整个文本扩展的触发词,实现将子关键词的扩展触发词汇总后得到所述事件文本的扩展触发词。In at least one embodiment of the present application, the merging unit 112 aggregates the expansion trigger vocabulary of the sub-keyword to obtain the expansion trigger word of the sub-topic. Finally, after summarizing, the expanded trigger words of the entire text are obtained, and the expanded trigger words of the event text are obtained after summarizing the expanded trigger vocabulary of sub-keywords.
由以上技术方案可以看出,本申请能够当接收到关键词提取指令时,获取事件文本,并采用ET-TAG模型引入背景词,提取所述事件文本的子话题,使关键词提取更加准确,进一步合并所述子话题,得到目标子话题,并对所述目标子话题进行特征学习,得到所述目标子话题的第一关键词,再通过关联度计算,从配置数据库中调取所述事件文本的关联词,基于贡献值,从所述关联词中确定所述事件文本的第二关键词,进而引入外部数据作为扩展数据,合并所述第一关键词及所述第二关键词,得到所述事件文本的目标关键词,从而自动确定事件文本的关键词,且由于引入了其他数据库中的数据,使关键词的确定更加准确。It can be seen from the above technical solutions that this application can obtain the event text when the keyword extraction instruction is received, and use the ET-TAG model to introduce background words, extract the subtopics of the event text, and make the keyword extraction more accurate. The sub-topics are further merged to obtain the target sub-topic, and feature learning is performed on the target sub-topic to obtain the first keyword of the target sub-topic, and then the event is retrieved from the configuration database through relevance calculation The related words of the text, based on the contribution value, determine the second keyword of the event text from the related words, and then introduce external data as extended data, merge the first keyword and the second keyword to obtain the The target keyword of the event text is automatically determined, and the data in other databases is introduced to make the keyword determination more accurate.
如图3所示,是本申请实现关键词确定方法的较佳实施例的电子设备的结构示意图。As shown in FIG. 3, it is a schematic structural diagram of an electronic device implementing a preferred embodiment of the keyword determining method according to the present application.
所述电子设备1是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。The electronic device 1 is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC) ), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
所述电子设备1还可以是但不限于任何一种可与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互的电子产品,例如,个人计算机、平板电脑、智能手机、个人数字助理(Personal Digital Assistant,PDA)、游戏机、交互式网络电视(Internet Protocol Television,IPTV)、智能式穿戴式设备等。The electronic device 1 can also be, but is not limited to, any electronic product that can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device, for example, a personal computer, a tablet computer, or a smart phone. , Personal Digital Assistant (PDA), game consoles, interactive network TV (Internet Protocol Television, IPTV), smart wearable devices, etc.
所述电子设备1还可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。The electronic device 1 may also be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
所述电子设备1所处的网络包括但不限于互联网、广域网、城域网、局域网、虚拟专用网络(Virtual Private Network,VPN)等。The network where the electronic device 1 is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), etc.
在本申请的一个实施例中,所述电子设备1包括,但不限于,存储器12、处理器13,以及存储在所述存储器12中并可在所述处理器13上运行的计算机程序,例如关键词确定程序。In an embodiment of the present application, the electronic device 1 includes, but is not limited to, a memory 12, a processor 13, and a computer program stored in the memory 12 and running on the processor 13, such as Keyword determination procedure.
本领域技术人员可以理解,所述示意图仅仅是电子设备1的示例,并不构成对电子设备1的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述电子设备1还可以包括输入输出设备、网络接入设备、总线等。Those skilled in the art can understand that the schematic diagram is only an example of the electronic device 1 and does not constitute a limitation on the electronic device 1. It may include more or less components than those shown in the figure, or a combination of certain components, or different components. Components, for example, the electronic device 1 may also include input and output devices, network access devices, buses, and the like.
所述处理器13可以是中央处理单元(Central Processing Unit,CPU),还 可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,所述处理器13是所述电子设备1的运算核心和控制中心,利用各种接口和线路连接整个电子设备1的各个部分,及执行所述电子设备1的操作系统以及安装的各类应用程序、程序代码等。The processor 13 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc. The processor 13 is the computing core and control center of the electronic device 1 and connects the entire electronic device with various interfaces and lines. Each part of 1, and executes the operating system of the electronic device 1, and various installed applications, program codes, etc.
所述处理器13执行所述电子设备1的操作系统以及安装的各类应用程序。所述处理器13执行所述应用程序以实现上述各个关键词确定方法实施例中的步骤,例如图1所示的步骤S10、S11、S12、S13、S14、S15、S16。The processor 13 executes the operating system of the electronic device 1 and various installed applications. The processor 13 executes the application program to implement the steps in the above keyword determination method embodiments, such as steps S10, S11, S12, S13, S14, S15, and S16 shown in FIG. 1.
或者,所述处理器13执行所述计算机程序时实现上述各装置实施例中各模块/单元的功能,例如:当接收到关键词提取指令时,获取事件文本;采用ET-TAG模型引入背景词,提取所述事件文本的子话题;合并所述子话题,得到目标子话题;对所述目标子话题进行特征学习,得到所述目标子话题的第一关键词;通过关联度计算,从配置数据库中调取所述事件文本的关联词;基于贡献值,从所述关联词中确定所述事件文本的第二关键词;合并所述第一关键词及所述第二关键词,得到所述事件文本的目标关键词。Alternatively, when the processor 13 executes the computer program, the functions of the modules/units in the above device embodiments are implemented, for example: when a keyword extraction instruction is received, the event text is obtained; the ET-TAG model is used to introduce background words , Extract the sub-topics of the event text; merge the sub-topics to obtain the target sub-topic; perform feature learning on the target sub-topic to obtain the first keyword of the target sub-topic; calculate the degree of relevance from the configuration Retrieve the related words of the event text from the database; determine the second keyword of the event text from the related words based on the contribution value; merge the first keyword and the second keyword to obtain the event The target keyword of the text.
示例性的,所述计算机程序可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器12中,并由所述处理器13执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序在所述电子设备1中的执行过程。例如,所述计算机程序可以被分割成获取单元110、提取单元111、合并单元112、学习单元113、调取单元114及确定单元115。Exemplarily, the computer program may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 12 and executed by the processor 13 to complete this Application. The one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program in the electronic device 1. For example, the computer program may be divided into an acquiring unit 110, an extracting unit 111, a merging unit 112, a learning unit 113, a retrieval unit 114, and a determining unit 115.
所述存储器12可用于存储所述计算机程序和/或模块,所述处理器13通过运行或执行存储在所述存储器12内的计算机程序和/或模块,以及调用存储在存储器12内的数据,实现所述电子设备1的各种功能。所述存储器12可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器12可以包括高速随机存取存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory 12 may be used to store the computer program and/or module, and the processor 13 runs or executes the computer program and/or module stored in the memory 12 and calls the data stored in the memory 12, Various functions of the electronic device 1 are realized. The memory 12 may mainly include a storage program area and a storage data area. The storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may Store data (such as audio data, phone book, etc.) created based on the use of mobile phones. In addition, the memory 12 may include a high-speed random access memory, and may also include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), and a Secure Digital (SD) Card, Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
所述存储器12可以是电子设备1的外部存储器和/或内部存储器。进一步地,所述存储器12可以是集成电路中没有实物形式的具有存储功能的电路,如RAM(Random-Access Memory,随机存取存储器)、FIFO(First In First Out,)等。或者,所述存储器12也可以是具有实物形式的存储器,如内存条、TF卡(Trans-flash Card)等等。The memory 12 may be an external memory and/or an internal memory of the electronic device 1. Further, the memory 12 may be a circuit with a storage function without a physical form in an integrated circuit, such as RAM (Random-Access Memory, random access memory), FIFO (First In First Out), etc. Alternatively, the memory 12 may also be a memory in physical form, such as a memory stick, a TF card (Trans-flash Card), and so on.
所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作 为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。If the integrated modules/units of the electronic device 1 are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer readable storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. When the program is executed by the processor, the steps of the foregoing method embodiments can be implemented.
其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signal, telecommunications signal, and software distribution media, etc. It should be noted that the content contained in the computer-readable medium can be appropriately added or deleted according to the requirements of the legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to the legislation and patent practice, the computer-readable medium Does not include electrical carrier signals and telecommunication signals.
结合图1,所述电子设备1中的所述存储器12存储多个指令以实现一种关键词确定方法,所述处理器13可执行所述多个指令从而实现:当接收到关键词提取指令时,获取事件文本;采用ET-TAG模型引入背景词,提取所述事件文本的子话题;合并所述子话题,得到目标子话题;对所述目标子话题进行特征学习,得到所述目标子话题的第一关键词;通过关联度计算,从配置数据库中调取所述事件文本的关联词;基于贡献值,从所述关联词中确定所述事件文本的第二关键词;合并所述第一关键词及所述第二关键词,得到所述事件文本的目标关键词。With reference to FIG. 1, the memory 12 in the electronic device 1 stores multiple instructions to implement a keyword determination method, and the processor 13 can execute the multiple instructions to implement: when a keyword extraction instruction is received When, obtain the event text; use the ET-TAG model to introduce background words to extract the subtopics of the event text; merge the subtopics to obtain the target subtopic; perform feature learning on the target subtopic to obtain the target subtopic The first keyword of the topic; the related words of the event text are retrieved from the configuration database through the degree of relevance calculation; the second keyword of the event text is determined from the related words based on the contribution value; the first keyword is merged The keyword and the second keyword are used to obtain the target keyword of the event text.
根据本申请优选实施例,所述处理器13执行多个指令包括:According to a preferred embodiment of the present application, the execution of multiple instructions by the processor 13 includes:
基于所述ET-TAG模型,采用PLSA-BLM算法计算在所述事件文本中出现的概率大于预设概率的词作为所述背景词;Based on the ET-TAG model, using a PLSA-BLM algorithm to calculate words with a probability greater than a preset probability in the event text as the background words;
从所述事件文本中删除所述背景词,得到更新文本;Delete the background word from the event text to obtain an updated text;
从所述更新文本中提取话题,作为所述事件文本的子话题。Extract topics from the updated text as subtopics of the event text.
根据本申请优选实施例,所述处理器13还执行多个指令包括:According to a preferred embodiment of the present application, the processor 13 further executing multiple instructions includes:
采用KL散度算法计算所述子话题间的散度;Use KL divergence algorithm to calculate the divergence between the sub-topics;
当所述子话题中有两个子话题间的散度小于配置阈值时,合并所述两个子话题;When the divergence between two sub-topics in the sub-topics is less than the configured threshold, merge the two sub-topics;
将合并后的所述子话题确定为所述目标子话题。The combined sub-topic is determined as the target sub-topic.
根据本申请优选实施例,所述处理器13还执行多个指令包括:According to a preferred embodiment of the present application, the processor 13 further executing multiple instructions includes:
抽取所述目标子话题中的标签词语;Extract the label words in the target subtopic;
采用Lasso分类模型对所述标签词语进行线性回归分析,以计算所述标签词语的评分;Using a Lasso classification model to perform linear regression analysis on the label words to calculate the score of the label words;
获取所述标签词语中评分大于或者等于预设值的词语,作为所述第一关键词。Acquiring words with a score greater than or equal to a preset value in the tag words as the first keyword.
根据本申请优选实施例,所述处理器13还执行多个指令包括:According to a preferred embodiment of the present application, the processor 13 further executing multiple instructions includes:
获取所述目标子话题中的所有词语;Obtain all words in the target subtopic;
采用PLSA-BLM算法,计算所述所有词语的词频;Use the PLSA-BLM algorithm to calculate the word frequency of all words;
根据所述所有词语的词频,对所述所有词语进行排序;Sort all the words according to the word frequency of all the words;
筛选出所述所有词语中处于预设位置的词语作为所述标签词语。A word in a preset position among all the words is selected as the label word.
根据本申请优选实施例,所述处理器13还执行多个指令包括:According to a preferred embodiment of the present application, the processor 13 further executing multiple instructions includes:
计算所述配置数据库中的所有词语与所述事件文本的关联度;Calculating the degree of relevance between all words in the configuration database and the event text;
获取所述关联度大于或者等于配置关联度的词语作为与所述事件文本相关的词语;Acquiring words whose relevance degree is greater than or equal to the configuration relevance degree as words related to the event text;
对与所述事件文本相关的词语进行分词预处理,得到所述关联词。Perform word segmentation preprocessing on words related to the event text to obtain the related words.
根据本申请优选实施例,所述处理器13还执行多个指令包括:According to a preferred embodiment of the present application, the processor 13 further executing multiple instructions includes:
计算所述关联词对所述第一关键词的贡献值;Calculating the contribution value of the related word to the first keyword;
获取贡献值最高的关联词作为所述第二关键词。The related word with the highest contribution value is obtained as the second keyword.
具体地,所述处理器13对上述指令的具体实现方法可参考图1对应实施例中相关步骤的描述,在此不赘述。Specifically, for the specific implementation method of the processor 13 for the foregoing instructions, reference may be made to the description of the relevant steps in the embodiment corresponding to FIG. 1, which is not repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation.
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, the functional modules in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional modules.
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。For those skilled in the art, it is obvious that the present application is not limited to the details of the foregoing exemplary embodiments, and the present application can be implemented in other specific forms without departing from the spirit or basic characteristics of the application.
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。Therefore, no matter from which point of view, the embodiments should be regarded as exemplary and non-limiting. The scope of this application is defined by the appended claims rather than the above description, and therefore it is intended to fall into the claims. All changes in the meaning and scope of the equivalent elements of are included in this application. Any associated diagram marks in the claims should not be regarded as limiting the claims involved.
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。In addition, it is obvious that the word "including" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units or devices stated in the system claims can also be implemented by one unit or device through software or hardware. The second class words are used to indicate names, and do not indicate any specific order.
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application and not to limit them. Although the application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the application can be Modifications or equivalent replacements are made without departing from the spirit and scope of the technical solution of this application.

Claims (20)

  1. 一种关键词确定方法,其特征在于,所述方法包括:A method for determining keywords, characterized in that the method includes:
    当接收到关键词提取指令时,获取事件文本;When the keyword extraction instruction is received, the event text is obtained;
    采用ET-TAG模型引入背景词,以提取所述事件文本的子话题;Use the ET-TAG model to introduce background words to extract subtopics of the event text;
    合并所述子话题,得到目标子话题;Combine the sub-topics to obtain the target sub-topic;
    对所述目标子话题进行特征学习,得到所述目标子话题的第一关键词;Performing feature learning on the target sub-topic to obtain the first keyword of the target sub-topic;
    通过关联度计算,从配置数据库中调取所述事件文本的关联词;Retrieve the related words of the event text from the configuration database by calculating the degree of relevance;
    基于贡献值,从所述关联词中确定所述事件文本的第二关键词;Based on the contribution value, determining the second keyword of the event text from the related words;
    合并所述第一关键词及所述第二关键词,得到所述事件文本的目标关键词。Combine the first keyword and the second keyword to obtain the target keyword of the event text.
  2. 如权利要求1所述的关键词确定方法,其特征在于,所述采用ET-TAG模型引入背景词,提取所述事件文本的子话题包括:5. The method for determining keywords according to claim 1, wherein said adopting the ET-TAG model to introduce background words and extracting subtopics of said event text comprises:
    基于所述ET-TAG模型,采用PLSA-BLM算法计算在所述事件文本中出现的概率大于预设概率的词作为所述背景词;Based on the ET-TAG model, using a PLSA-BLM algorithm to calculate words with a probability greater than a preset probability in the event text as the background words;
    从所述事件文本中删除所述背景词,得到更新文本;Delete the background word from the event text to obtain an updated text;
    从所述更新文本中提取话题,作为所述事件文本的子话题。Extract topics from the updated text as subtopics of the event text.
  3. 如权利要求1所述的关键词确定方法,其特征在于,所述合并所述子话题,得到目标子话题包括:The method for determining keywords according to claim 1, wherein said merging said sub-topics to obtain the target sub-topic comprises:
    采用KL散度算法计算所述子话题间的散度;Use KL divergence algorithm to calculate the divergence between the sub-topics;
    当所述子话题中有两个子话题间的散度小于配置阈值时,合并所述两个子话题;When the divergence between two sub-topics in the sub-topics is less than the configured threshold, merge the two sub-topics;
    将合并后的所述子话题确定为所述目标子话题。The combined sub-topic is determined as the target sub-topic.
  4. 如权利要求1所述的关键词确定方法,其特征在于,所述对所述目标子话题进行特征学习,得到所述目标子话题的第一关键词包括:5. The method for determining keywords according to claim 1, wherein said performing feature learning on said target sub-topic to obtain the first keyword of said target sub-topic comprises:
    抽取所述目标子话题中的标签词语;Extract the label words in the target subtopic;
    采用Lasso分类模型对所述标签词语进行线性回归分析,以计算所述标签词语的评分;Using a Lasso classification model to perform linear regression analysis on the label words to calculate the score of the label words;
    获取所述标签词语中评分大于或者等于预设值的词语,作为所述第一关键词。Acquiring words with a score greater than or equal to a preset value in the tag words as the first keyword.
  5. 如权利要求4所述的关键词确定方法,其特征在于,所述抽取所述目标子话题中的标签词语包括:The method for determining keywords according to claim 4, wherein said extracting the label words in the target sub-topic comprises:
    获取所述目标子话题中的所有词语;Obtain all words in the target subtopic;
    采用PLSA-BLM算法,计算所述所有词语的词频;Use the PLSA-BLM algorithm to calculate the word frequency of all words;
    根据所述所有词语的词频,对所述所有词语进行排序;Sort all the words according to the word frequency of all the words;
    筛选出所述所有词语中处于预设位置的词语作为所述标签词语。A word in a preset position among all the words is selected as the label word.
  6. 如权利要求1所述的关键词确定方法,其特征在于,所述通过关联度计算,从配置数据库中调取所述事件文本的关联词包括:The method for determining keywords according to claim 1, wherein said retrieving the related words of the event text from the configuration database by calculating the degree of relevance comprises:
    计算所述配置数据库中的所有词语与所述事件文本的关联度;Calculating the degree of relevance between all words in the configuration database and the event text;
    获取所述关联度大于或者等于配置关联度的词语作为与所述事件文本相关的词语;Acquiring words whose relevance degree is greater than or equal to the configuration relevance degree as words related to the event text;
    对与所述事件文本相关的词语进行分词预处理,得到所述关联词。Perform word segmentation preprocessing on words related to the event text to obtain the related words.
  7. 如权利要求1所述的关键词确定方法,其特征在于,所述基于贡献值,从所述关联词中确定所述事件文本的第二关键词包括:5. The keyword determining method according to claim 1, wherein the determining the second keyword of the event text from the related words based on the contribution value comprises:
    计算所述关联词对所述第一关键词的贡献值;Calculating the contribution value of the related word to the first keyword;
    获取贡献值最高的关联词作为所述第二关键词。The related word with the highest contribution value is obtained as the second keyword.
  8. 一种电子设备,其特征在于,所述电子设备包括:存储器,存储至少一个指令;及处理器,执行所述存储器中存储的指令以实现如下步骤:An electronic device, characterized in that, the electronic device includes: a memory that stores at least one instruction; and a processor that executes the instructions stored in the memory to implement the following steps:
    当接收到关键词提取指令时,获取事件文本;When the keyword extraction instruction is received, the event text is obtained;
    采用ET-TAG模型引入背景词,以提取所述事件文本的子话题;Use the ET-TAG model to introduce background words to extract subtopics of the event text;
    合并所述子话题,得到目标子话题;Combine the sub-topics to obtain the target sub-topic;
    对所述目标子话题进行特征学习,得到所述目标子话题的第一关键词;Performing feature learning on the target sub-topic to obtain the first keyword of the target sub-topic;
    通过关联度计算,从配置数据库中调取所述事件文本的关联词;Retrieve the related words of the event text from the configuration database by calculating the degree of relevance;
    基于贡献值,从所述关联词中确定所述事件文本的第二关键词;Based on the contribution value, determining the second keyword of the event text from the related words;
    合并所述第一关键词及所述第二关键词,得到所述事件文本的目标关键词。Combine the first keyword and the second keyword to obtain the target keyword of the event text.
  9. 如权利要求8所述的电子设备,其特征在于,所述采用ET-TAG模型引入背景词,提取所述事件文本的子话题包括:8. The electronic device according to claim 8, wherein the introduction of background words using the ET-TAG model and extracting subtopics of the event text comprise:
    基于所述ET-TAG模型,采用PLSA-BLM算法计算在所述事件文本中出现的概率大于预设概率的词作为所述背景词;Based on the ET-TAG model, using a PLSA-BLM algorithm to calculate words with a probability greater than a preset probability in the event text as the background words;
    从所述事件文本中删除所述背景词,得到更新文本;Delete the background word from the event text to obtain an updated text;
    从所述更新文本中提取话题,作为所述事件文本的子话题。Extract topics from the updated text as subtopics of the event text.
  10. 如权利要求8所述的电子设备,其特征在于,所述合并所述子话题,得到目标子话题包括:8. The electronic device of claim 8, wherein the combining the sub-topics to obtain the target sub-topic comprises:
    采用KL散度算法计算所述子话题间的散度;Use KL divergence algorithm to calculate the divergence between the sub-topics;
    当所述子话题中有两个子话题间的散度小于配置阈值时,合并所述两个子话题;When the divergence between two sub-topics in the sub-topics is less than the configured threshold, merge the two sub-topics;
    将合并后的所述子话题确定为所述目标子话题。The combined sub-topic is determined as the target sub-topic.
  11. 如权利要求8所述的电子设备,其特征在于,所述对所述目标子话题进行特征学习,得到所述目标子话题的第一关键词包括:8. The electronic device according to claim 8, wherein the performing feature learning on the target subtopic to obtain the first keyword of the target subtopic comprises:
    抽取所述目标子话题中的标签词语;Extract the label words in the target subtopic;
    采用Lasso分类模型对所述标签词语进行线性回归分析,以计算所述标签词语的评分;Using a Lasso classification model to perform linear regression analysis on the label words to calculate the score of the label words;
    获取所述标签词语中评分大于或者等于预设值的词语,作为所述第一关键词。Acquire words with a score greater than or equal to a preset value in the tag words as the first keyword.
  12. 如权利要求11所述的电子设备,其特征在于,所述抽取所述目标子话题中的标签词语包括:The electronic device according to claim 11, wherein said extracting the label words in the target sub-topic comprises:
    获取所述目标子话题中的所有词语;Obtain all words in the target subtopic;
    采用PLSA-BLM算法,计算所述所有词语的词频;Use the PLSA-BLM algorithm to calculate the word frequency of all words;
    根据所述所有词语的词频,对所述所有词语进行排序;Sort all the words according to the word frequency of all the words;
    筛选出所述所有词语中处于预设位置的词语作为所述标签词语。A word in a preset position among all the words is selected as the label word.
  13. 如权利要求8所述的电子设备,其特征在于,所述通过关联度计算,从配置数据库中调取所述事件文本的关联词包括:8. The electronic device according to claim 8, wherein said retrieving the associated words of the event text from a configuration database by calculating the degree of relevance comprises:
    计算所述配置数据库中的所有词语与所述事件文本的关联度;Calculating the degree of relevance between all words in the configuration database and the event text;
    获取所述关联度大于或者等于配置关联度的词语作为与所述事件文本相关的词语;Acquiring words with the relevance degree greater than or equal to the configuration relevance degree as words related to the event text;
    对与所述事件文本相关的词语进行分词预处理,得到所述关联词。Perform word segmentation preprocessing on words related to the event text to obtain the related words.
  14. 如权利要求8所述的电子设备,其特征在于,所述基于贡献值,从所述关联词中确定所述事件文本的第二关键词包括:8. The electronic device according to claim 8, wherein the determining the second keyword of the event text from the related words based on the contribution value comprises:
    计算所述关联词对所述第一关键词的贡献值;Calculating the contribution value of the related word to the first keyword;
    获取贡献值最高的关联词作为所述第二关键词。The related word with the highest contribution value is obtained as the second keyword.
  15. 一种计算机可读存储介质,其特征在于:所述计算机可读存储介质中存储有至少一个指令,所述至少一个指令被电子设备中的处理器执行以实现如下步骤:A computer-readable storage medium, characterized in that: the computer-readable storage medium stores at least one instruction, and the at least one instruction is executed by a processor in an electronic device to implement the following steps:
    当接收到关键词提取指令时,获取事件文本;When the keyword extraction instruction is received, the event text is obtained;
    采用ET-TAG模型引入背景词,以提取所述事件文本的子话题;Use the ET-TAG model to introduce background words to extract subtopics of the event text;
    合并所述子话题,得到目标子话题;Combine the sub-topics to obtain the target sub-topic;
    对所述目标子话题进行特征学习,得到所述目标子话题的第一关键词;Performing feature learning on the target sub-topic to obtain the first keyword of the target sub-topic;
    通过关联度计算,从配置数据库中调取所述事件文本的关联词;Retrieve the related words of the event text from the configuration database by calculating the degree of relevance;
    基于贡献值,从所述关联词中确定所述事件文本的第二关键词;Based on the contribution value, determining the second keyword of the event text from the related words;
    合并所述第一关键词及所述第二关键词,得到所述事件文本的目标关键词。Combine the first keyword and the second keyword to obtain the target keyword of the event text.
  16. 如权利要求15所述的计算机可读存储介质,其特征在于,所述采用ET-TAG模型引入背景词,提取所述事件文本的子话题包括:15. The computer-readable storage medium according to claim 15, wherein said adopting the ET-TAG model to introduce background words and extracting subtopics of said event text comprises:
    基于所述ET-TAG模型,采用PLSA-BLM算法计算在所述事件文本中出现的概率大于预设概率的词作为所述背景词;Based on the ET-TAG model, using a PLSA-BLM algorithm to calculate words with a probability greater than a preset probability in the event text as the background words;
    从所述事件文本中删除所述背景词,得到更新文本;Delete the background word from the event text to obtain an updated text;
    从所述更新文本中提取话题,作为所述事件文本的子话题。Extract topics from the updated text as subtopics of the event text.
  17. 如权利要求15所述的计算机可读存储介质,其特征在于,所述合并所述子话题,得到目标子话题包括:15. The computer-readable storage medium of claim 15, wherein the merging the sub-topics to obtain the target sub-topic comprises:
    采用KL散度算法计算所述子话题间的散度;Use KL divergence algorithm to calculate the divergence between the sub-topics;
    当所述子话题中有两个子话题间的散度小于配置阈值时,合并所述两个子话题;When the divergence between two sub-topics in the sub-topics is less than the configured threshold, merge the two sub-topics;
    将合并后的所述子话题确定为所述目标子话题。The combined sub-topic is determined as the target sub-topic.
  18. 如权利要求15所述的计算机可读存储介质,其特征在于,所述对所述目标子话题进行特征学习,得到所述目标子话题的第一关键词包括:15. The computer-readable storage medium according to claim 15, wherein the performing feature learning on the target sub-topic to obtain the first keyword of the target sub-topic comprises:
    抽取所述目标子话题中的标签词语;Extract the label words in the target subtopic;
    采用Lasso分类模型对所述标签词语进行线性回归分析,以计算所述标签词语的评分;Using a Lasso classification model to perform linear regression analysis on the label words to calculate the score of the label words;
    获取所述标签词语中评分大于或者等于预设值的词语,作为所述第一关键词。Acquire words with a score greater than or equal to a preset value in the tag words as the first keyword.
  19. 如权利要求18所述的计算机可读存储介质,其特征在于,所述抽取所述目标子话题中的标签词语包括:18. The computer-readable storage medium of claim 18, wherein said extracting tag words in said target subtopic comprises:
    获取所述目标子话题中的所有词语;Obtain all words in the target subtopic;
    采用PLSA-BLM算法,计算所述所有词语的词频;Use the PLSA-BLM algorithm to calculate the word frequency of all words;
    根据所述所有词语的词频,对所述所有词语进行排序;Sort all the words according to the word frequency of all the words;
    筛选出所述所有词语中处于预设位置的词语作为所述标签词语。A word in a preset position among all the words is selected as the label word.
  20. 如权利要求15所述的计算机可读存储介质,其特征在于,所述通过关联度计算,从配置数据库中调取所述事件文本的关联词包括:15. The computer-readable storage medium according to claim 15, wherein said retrieving the associated words of the event text from a configuration database by calculating the degree of relevance comprises:
    计算所述配置数据库中的所有词语与所述事件文本的关联度;Calculating the degree of relevance between all words in the configuration database and the event text;
    获取所述关联度大于或者等于配置关联度的词语作为与所述事件文本相关的词语;Acquiring words with the relevance degree greater than or equal to the configuration relevance degree as words related to the event text;
    对与所述事件文本相关的词语进行分词预处理,得到所述关联词。Perform word segmentation preprocessing on words related to the event text to obtain the related words.
PCT/CN2019/118013 2019-06-25 2019-11-13 Keyword determination method and apparatus, electronic device, and storage medium WO2020258662A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910554221.6 2019-06-25
CN201910554221.6A CN110457672B (en) 2019-06-25 2019-06-25 Keyword determination method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2020258662A1 true WO2020258662A1 (en) 2020-12-30

Family

ID=68480833

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118013 WO2020258662A1 (en) 2019-06-25 2019-11-13 Keyword determination method and apparatus, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN110457672B (en)
WO (1) WO2020258662A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926297A (en) * 2021-02-26 2021-06-08 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing information
CN113326350A (en) * 2021-05-31 2021-08-31 江汉大学 Keyword extraction method, system, device and storage medium based on remote learning
CN113704587A (en) * 2021-08-31 2021-11-26 中国平安财产保险股份有限公司 User adhesion analysis method, device, equipment and medium based on stage division
CN114880496A (en) * 2022-04-28 2022-08-09 国家计算机网络与信息安全管理中心 Multimedia information topic analysis method, device, equipment and storage medium
CN114938477A (en) * 2022-06-23 2022-08-23 阿里巴巴(中国)有限公司 Video topic determination method, device and equipment
CN115277616A (en) * 2022-07-25 2022-11-01 深圳美克拉网络技术有限公司 Mobile terminal display method and terminal

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111538891B (en) * 2020-04-21 2023-04-07 招商局金融科技有限公司 Hot event monitoring method and device, computer device and readable storage medium
CN112732893B (en) * 2021-01-13 2024-01-19 上海明略人工智能(集团)有限公司 Text information extraction method and device, storage medium and electronic equipment
CN113536077B (en) * 2021-05-31 2022-06-17 烟台中科网络技术研究所 Mobile APP specific event content detection method and device
CN114662474A (en) * 2022-04-13 2022-06-24 马上消费金融股份有限公司 Keyword determination method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294396A (en) * 2015-05-20 2017-01-04 北京大学 Keyword expansion method and keyword expansion system
CN106570120A (en) * 2016-11-02 2017-04-19 四川用联信息技术有限公司 Process for realizing searching engine optimization through improved keyword optimization
CN107220232A (en) * 2017-04-06 2017-09-29 北京百度网讯科技有限公司 Keyword extracting method and device, equipment and computer-readable recording medium based on artificial intelligence
CN108268668A (en) * 2018-02-28 2018-07-10 福州大学 One kind is based on the multifarious text data viewpoint abstract method for digging of topic

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060276996A1 (en) * 2005-06-01 2006-12-07 Keerthi Sathiya S Fast tracking system and method for generalized LARS/LASSO
CN106708880B (en) * 2015-11-16 2020-05-22 北京国双科技有限公司 Topic associated word acquisition method and device
CN107168943B (en) * 2017-04-07 2018-07-03 平安科技(深圳)有限公司 The method and apparatus of topic early warning
CN107590505B (en) * 2017-08-01 2021-08-27 天津大学 Learning method combining low-rank representation and sparse regression
CN109885675B (en) * 2019-02-25 2020-11-27 合肥工业大学 Text subtopic discovery method based on improved LDA

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294396A (en) * 2015-05-20 2017-01-04 北京大学 Keyword expansion method and keyword expansion system
CN106570120A (en) * 2016-11-02 2017-04-19 四川用联信息技术有限公司 Process for realizing searching engine optimization through improved keyword optimization
CN107220232A (en) * 2017-04-06 2017-09-29 北京百度网讯科技有限公司 Keyword extracting method and device, equipment and computer-readable recording medium based on artificial intelligence
CN108268668A (en) * 2018-02-28 2018-07-10 福州大学 One kind is based on the multifarious text data viewpoint abstract method for digging of topic

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHOU NAN, DU PAN; JIN XIAO-LONG ;LIU YUE; CHENG XUE-QI: "ET-TAG:A Tag Generation Model for the Sub-Topics of Public Opinion Events", CHINESE JOURNAL OF COMPUTERS., vol. 41, no. 7, 1 July 2018 (2018-07-01), pages 1490 - 1503, XP055774603 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926297A (en) * 2021-02-26 2021-06-08 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing information
CN112926297B (en) * 2021-02-26 2023-06-30 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing information
CN113326350A (en) * 2021-05-31 2021-08-31 江汉大学 Keyword extraction method, system, device and storage medium based on remote learning
CN113326350B (en) * 2021-05-31 2023-05-26 江汉大学 Keyword extraction method, system, equipment and storage medium based on remote learning
CN113704587A (en) * 2021-08-31 2021-11-26 中国平安财产保险股份有限公司 User adhesion analysis method, device, equipment and medium based on stage division
CN113704587B (en) * 2021-08-31 2023-06-06 中国平安财产保险股份有限公司 User adhesion analysis method, device, equipment and medium based on stage division
CN114880496A (en) * 2022-04-28 2022-08-09 国家计算机网络与信息安全管理中心 Multimedia information topic analysis method, device, equipment and storage medium
CN114938477A (en) * 2022-06-23 2022-08-23 阿里巴巴(中国)有限公司 Video topic determination method, device and equipment
CN114938477B (en) * 2022-06-23 2024-05-03 阿里巴巴(中国)有限公司 Video topic determination method, device and equipment
CN115277616A (en) * 2022-07-25 2022-11-01 深圳美克拉网络技术有限公司 Mobile terminal display method and terminal

Also Published As

Publication number Publication date
CN110457672A (en) 2019-11-15
CN110457672B (en) 2023-01-17

Similar Documents

Publication Publication Date Title
WO2020258662A1 (en) Keyword determination method and apparatus, electronic device, and storage medium
US9971967B2 (en) Generating a superset of question/answer action paths based on dynamically generated type sets
CN107797984B (en) Intelligent interaction method, equipment and storage medium
WO2021093755A1 (en) Matching method and apparatus for questions, and reply method and apparatus for questions
TWI653542B (en) Method, system and device for discovering and tracking hot topics based on network media data flow
US8577882B2 (en) Method and system for searching multilingual documents
CN111814770B (en) Content keyword extraction method of news video, terminal device and medium
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
US10102191B2 (en) Propagation of changes in master content to variant content
US20150332670A1 (en) Language Modeling For Conversational Understanding Domains Using Semantic Web Resources
KR101723862B1 (en) Apparatus and method for classifying and analyzing documents including text
CN110096573B (en) Text parsing method and device
WO2018201600A1 (en) Information mining method and system, electronic device and readable storage medium
CN108920649B (en) Information recommendation method, device, equipment and medium
CN109906450A (en) For the method and apparatus by similitude association to electronic information ranking
CN110297893B (en) Natural language question-answering method, device, computer device and storage medium
CN111966781B (en) Interaction method and device for data query, electronic equipment and storage medium
CN111898379A (en) Slot filling model training method and natural language understanding model
CN110245357B (en) Main entity identification method and device
WO2021012958A1 (en) Original text screening method, apparatus, device and computer-readable storage medium
CN111859079B (en) Information searching method, device, computer equipment and storage medium
US20200194131A1 (en) Cognitive analysis of data using granular review of documents
WO2021174923A1 (en) Concept word sequence generation method, apparatus, computer device, and storage medium
US20200167664A1 (en) Identifying knowledge gaps utilizing cognitive network meta-analysis
JP2020521246A (en) Automated classification of network accessible content

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19935261

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19935261

Country of ref document: EP

Kind code of ref document: A1