WO2020258662A1

WO2020258662A1 - Keyword determination method and apparatus, electronic device, and storage medium

Info

Publication number: WO2020258662A1
Application number: PCT/CN2019/118013
Authority: WO
Inventors: 郑子欧; 汪伟
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-06-25
Filing date: 2019-11-13
Publication date: 2020-12-30
Also published as: CN110457672A; CN110457672B

Abstract

A keyword determination method and apparatus, an electronic device, and a storage medium. The keyword determination method is capable of: obtaining an event text when receiving a keyword extraction instruction; extracting sub-topics of the event text using an ET-TAG model to make keyword extraction more accurate; further combining the sub-topics to obtain a target sub-topic; performing feature learning on the target sub-topic to obtain a first keyword of the target sub-topic; then calling keywords of the event text from a configuration database by means of degree of correlation calculation; determining a second keyword of the event text from the keywords on the basis of a contribution value to introduce external data as extension data; and combining the first keyword and the second keyword to obtain a target keyword of the event text. Therefore, the keyword of the event text is automatically determined, and because data in other databases is introduced, the determination of the keyword is more accurate.

Description

Keyword determination method, device, electronic equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on June 25, 2019, the application number is 201910554221.6, and the invention title is "Keyword Determination Method, Device, Electronic Equipment and Storage Medium", the entire content of which is incorporated by reference Incorporated in this application.

Technical field

This application relates to the field of data analysis technology, and in particular to a method, device, electronic device, and storage medium for determining keywords.

Background technique

In the prior art solution, the key words of the event text are mainly determined manually, that is, the key words of the event are manually determined first, and then the event extraction is performed. Not only is the manual workload large, but also there are fewer fine-grained tags for the event. It is to put an event label on the whole text, and then use the label as the key word of the text.

The above method has certain drawbacks. First of all, because most of them need to be classified and extracted by manual differentiation, the accuracy rate is not guaranteed, and the efficiency is low; second, because there is no quantitative analysis, the obtained keywords cannot be fully and in-depth described in the original text. The name is not true; in the end, the traditional method only extracts the text keywords, but does not expand the trigger keywords, which makes it impossible to dig deeper into the text.

Summary of the invention

In view of the above, it is necessary to provide a keyword determination method, device, electronic device, and storage medium that can automatically determine the keywords of the event text, and because the data in other databases is introduced, the keyword determination is more accurate.

A keyword determination method, the method includes: when a keyword extraction instruction is received, obtaining event text; introducing background words using an ET-TAG model to extract subtopics of the event text; combining the subtopics to obtain Target sub-topic; perform feature learning on the target sub-topic to obtain the first keyword of the target sub-topic; retrieve the related words of the event text from the configuration database by calculating the degree of relevance; Determine the second keyword of the event text in the related words; merge the first keyword and the second keyword to obtain the target keyword of the event text.

An electronic device, the electronic device comprising: a memory that stores at least one instruction; and a processor that executes the instructions stored in the memory to implement the keyword determination method.

A computer-readable storage medium stores at least one instruction, and the at least one instruction is executed by a processor in an electronic device to implement the keyword determination method.

It can be seen from the above technical solutions that this application can obtain the event text when the keyword extraction instruction is received, and use the ET-TAG model to introduce background words, extract the subtopics of the event text, and make the keyword extraction more accurate. The sub-topics are further merged to obtain the target sub-topic, and feature learning is performed on the target sub-topic to obtain the first keyword of the target sub-topic, and then the event is retrieved from the configuration database through relevance calculation The related words of the text, based on the contribution value, determine the second keyword of the event text from the related words, and then introduce external data as extended data, merge the first keyword and the second keyword to obtain the The target keyword of the event text is automatically determined, and the data in other databases is introduced to make the keyword determination more accurate.

Description of the drawings

Fig. 1 is a flowchart of a preferred embodiment of a method for determining keywords of the present application.

Fig. 2 is a functional module diagram of a preferred embodiment of the keyword determining device of the present application.

Fig. 3 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the keyword determining method according to the present application.

Symbol description of main components

电子设备 Electronic equipment	11
存储器 Memory	1212
处理器 processor	1313
关键词确定装置 Keyword determining device	1111
获取单元Get unit	110110
提取单元 Extraction unit	111111
合并单元 Merge unit	112112
学习单元Unit of study	113113
调取单元 Retrieve unit	114114
确定单元Determine the unit	115115

Detailed ways

In order to make the objectives, technical solutions, and advantages of the present application clearer, the present application will be described in detail below with reference to the accompanying drawings and specific embodiments.

As shown in Fig. 1, it is a flowchart of a preferred embodiment of the method for determining keywords of the present application. According to different needs, the order of the steps in the flowchart can be changed, and some steps can be omitted.

The keyword determination method is applied to one or more electronic devices. The electronic device is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions. Its hardware includes but not Limited to microprocessors, application specific integrated circuits (ASICs), programmable gate arrays (Field-Programmable Gate Arrays, FPGAs), digital processors (Digital Signal Processors, DSPs), embedded devices, etc.

The electronic device can be any electronic product that can interact with a user with a human machine, such as a personal computer, a tablet computer, a smart phone, a personal digital assistant (PDA), a game console, an interactive network television ( Internet Protocol Television, IPTV), smart wearable devices, etc.

The electronic device may also include a network device and/or user equipment. Wherein, the network device includes, but is not limited to, a single network server, a server group composed of multiple network servers, or a cloud composed of a large number of hosts or network servers based on Cloud Computing.

The network where the electronic device is located includes but is not limited to the Internet, wide area network, metropolitan area network, local area network, virtual private network (Virtual Private Network, VPN), etc.

S10: Obtain the event text when the keyword extraction instruction is received.

In at least one embodiment of the present application, the keyword extraction instruction may be triggered by the user, which is not limited in the present application.

In at least one embodiment of the present application, the event text refers to articles such as news reports related to public opinion events.

S11: Use the ET-TAG model to introduce background words, and extract subtopics of the event text.

In at least one embodiment of the present application, the electronic device adopts the ET-TAG model to introduce background words, and extracting subtopics of the event text includes:

Based on the ET-TAG model, the electronic device adopts the PLSA-BLM (probabilitistic Latent Semantic Analysis) algorithm to introduce background words and delete the background words from the event text to obtain updated text. Topics are extracted from the update text as subtopics of the event text.

Specifically, the electronic device first loads the event text from which keywords need to be extracted, and uses the ET-TAG model to discover the characteristics of the event subtopics from the perspective of word distribution.

Furthermore, in the ET-TAG model, the idea of supervised is adopted to introduce the conceptual system of specific categories of public opinion events from the external knowledge base, and treat them as the tags of the internal subtopics of the event to improve the intelligibility of the tags. The PLSA-BLM algorithm in the ET-TAG model models the hidden topics in the text.

Furthermore, the electronic device introduces a background theme X on the basis of the original themes A, B, C..., and after introducing the background theme X, the high-frequency words, The stop words etc. are summarized into the background topic X, and this part of words will have no influence on topic extraction.

Through the above implementation, after removing this part of redundant vocabulary, the model can extract the subtopics of the event text according to the distinctive topic features.

S12. Combine the subtopics to obtain a target subtopic.

It is understandable that using the PLSA-BLM algorithm, different word frequency distributions can be obtained while extracting each subtopic, but the obtained word frequency distribution may have a strong approximation. Therefore, the electronic device will use the obtained difference in word frequency distribution between different sub-topics to measure the difference between the sub-topics, and merge the sub-topics.

In at least one embodiment of the present application, the electronic device merging the sub-topics to obtain the target sub-topic includes:

The electronic device uses KL divergence (Kullback-Leibler Divergence, KL) algorithm to calculate the divergence between the sub-topics. When the divergence between two sub-topics in the sub-topic is less than a configured threshold, the electronic device The two sub-topics are merged, and the merged sub-topic is determined as the target sub-topic.

Specifically, the electronic device uses the KL divergence algorithm to calculate the divergence between the sub-topics on the obtained sub-topics. The KL divergence algorithm is also called relative entropy, which can measure the difference between two distributions in the same event space. . According to the definition in information theory, the physical meaning of KL divergence is that in the same event space, if the probability P(x) is coded with Q(x), how many bits increase in the average code length of each symbol, using D(P|| Q), the calculation formula is: D(P||Q)=∑P(x)log(P(x)/Q(x)), and then calculate the distance of word frequency distribution under different subtopics.

Further, when the divergence (distance) between two sub-topics in the sub-topics is less than a configured threshold, the electronic device merges the two sub-topics, and determines the merged sub-topic as the target Subtopic.

Through the foregoing implementation manner, the electronic device can remove redundancy and improve the accuracy of sub-topic division, so that the obtained target sub-topic is more accurate.

S13: Perform feature learning on the target subtopic to obtain the first keyword of the target subtopic.

In at least one embodiment of the present application, the electronic device performs feature learning on the target subtopic, and obtaining the first keyword of the target subtopic includes:

The electronic device extracts the label words in the target sub-topic, and uses the Lasso classification model to perform linear regression analysis on the label words to calculate the score of the label words, and the electronic device obtains the score in the label words Words greater than or equal to the preset value are used as the first keyword.

Among them, the Lasso classification model based on the self-attention algorithm is a model for linear regression feature reduction and selection. Through the Lasso classification model, feature learning can be performed on the label words in the target subtopic.

Specifically, the electronic device constructs a standard training set D={(x1, y1), (x2, y2),..., (xi, yi)}, where xi represents the word vector of each label word, and yi represents each word The correlation coefficient between the vector and the subtopic.

Further, the electronic device uses the Lasso classification model to calculate the score of the tag word. The score of each candidate label word is equal to the weight score βi given by the Lasso classification model. And the greater the value of βi, the stronger the correlation between the label term and the sub-topic. Therefore, the electronic device uses the label term with the largest βi term as the keyword of the sub-topic, that is, as the first keyword .

In at least one embodiment of the present application, the electronic device extracting the tag words in the target subtopic includes:

The electronic device obtains all the words in the target sub-topic, and uses the PLSA-BLM algorithm to calculate the word frequency of all the words, and the electronic device sorts all the words according to the word frequency of all the words, And the word in the preset position among all the words is selected as the label word.

Wherein, the preset position can be customized to obtain the label words meeting the quantity requirement.

For example: after the subtopics are merged, the words under each subtopic are arranged in descending order of word frequency. The higher the word frequency, the more representative the word is in the current subtopic. The electronic device intercepts the k with the highest frequency (for example: k =100) words, and the k words with the highest frequency are used as the label words.

Through the foregoing implementation manners, sufficient and accurate label words can be obtained for subsequent feature learning.

S14: Retrieve the associated words of the event text from the configuration database by calculating the degree of association.

In at least one embodiment of the present application, the configuration database may be any form of database, and all text reports, commentary articles, etc. related to events (current events, news, etc.) are stored in the configuration database.

For example: the electronic device calls the knowledge base related to the event text in the search engine.

It is understandable that expanding keywords requires some knowledge bases containing public opinion events related to the event text, and these knowledge bases cover all aspects of information in the field. According to the knowledge base, the event text can be summarized from different sides. The process of expanding the keywords is essentially to establish a mapping relationship between the keywords and the event conceptual system of the public opinion knowledge base, thereby using related vocabulary to expand.

Specifically, the electronic device retrieves the associated words of the event text from the configuration database by calculating the degree of association, including:

The electronic device calculates the degree of relevance between all words in the configuration database and the event text, acquires words with the degree of relevance greater than or equal to the degree of relevance to the configuration as words related to the event text, and compares Words related to the event text are preprocessed by word segmentation to obtain the related words.

Wherein, the configuration association degree can be customized, which is not limited in this application.

For example, the electronic device sorts and preprocesses the knowledge base data according to internal concepts to calculate the probability that each word in the knowledge base is related to the first keyword.

Through the foregoing implementation manners, data in an external database can be introduced as a basis to make the determination of keywords more accurate.

S15: Determine a second keyword of the event text from the related words based on the contribution value.

In at least one embodiment of the present application, the electronic device determining the second keyword of the event text from the related words based on the contribution value includes:

The electronic device calculates the contribution value of the related word to the first keyword, and obtains the related word with the highest contribution value as the second keyword.

Through the foregoing implementation manner, the word with the largest contribution value to the first keyword is used as the second keyword, thereby realizing the expansion of the keyword.

S16: Combine the first keyword and the second keyword to obtain the target keyword of the event text.

In at least one embodiment of the present application, the electronic device aggregates the expansion trigger vocabulary of the sub-keyword to obtain the expansion trigger word of the sub-topic. Finally, after summarizing, the expanded trigger words of the entire text are obtained, and the expanded trigger words of the event text are obtained after summarizing the expanded trigger vocabulary of sub-keywords.

As shown in FIG. 2, it is a functional module diagram of a preferred embodiment of the keyword determining device of the present application. The keyword determination device 11 includes an acquisition unit 110, an extraction unit 111, a merging unit 112, a learning unit 113, a retrieval unit 114, and a determination unit 115. The module/unit referred to in this application refers to a series of computer program segments that can be executed by the processor 13 and can complete fixed functions, and are stored in the memory 12. In this embodiment, the functions of each module/unit will be described in detail in subsequent embodiments.

When the keyword extraction instruction is received, the obtaining unit 110 obtains the event text.

The extraction unit 111 adopts the ET-TAG model to introduce background words, and extracts subtopics of the event text.

In at least one embodiment of the present application, the extraction unit 111 adopts an ET-TAG model to introduce background words, and extracting subtopics of the event text includes:

The extraction unit 111 adopts the PLSA-BLM (probabilitistic Latent Semantic Analysis) algorithm based on the ET-TAG model to introduce background words, and delete the background words from the event text to obtain updated text, the extraction unit 111 extracts topics from the updated text as subtopics of the event text.

Specifically, the extraction unit 111 first loads the event text for which keywords need to be extracted, and uses the ET-TAG model to discover the characteristics of the event subtopic from the perspective of word distribution.

Furthermore, the extraction unit 111 introduces a background theme X on the basis of the original themes A, B, C..., and after introducing the background theme X, removes the high-frequency words that appear in the original theme a lot , Stop words, etc. are summarized into the background topic X, and these words will have no effect on topic extraction.

The merging unit 112 merges the sub-topics to obtain the target sub-topic.

It is understandable that using the PLSA-BLM algorithm, different word frequency distributions can be obtained while extracting each subtopic, but the obtained word frequency distribution may have a strong approximation. Therefore, the merging unit 112 will use the obtained difference in word frequency distribution between the different sub-topics to measure the difference between the sub-topics, and merge the sub-topics.

In at least one embodiment of the present application, the merging unit 112 merging the sub-topics to obtain the target sub-topic includes:

The merging unit 112 uses KL divergence (Kullback-Leibler Divergence, KL) algorithm to calculate the divergence between the sub-topics. When the divergence between two sub-topics in the sub-topic is less than a configured threshold, the merge The unit 112 merges the two sub-topics, and determines the merged sub-topic as the target sub-topic.

Specifically, the merging unit 112 uses the KL divergence algorithm to calculate the divergence between the sub-topics on the obtained subtopics. The KL divergence algorithm is also called relative entropy, which can measure the difference between two distributions in the same event space. difference. According to the definition in information theory, the physical meaning of KL divergence is that in the same event space, if the probability P(x) is coded with Q(x), how many bits increase in the average code length of each symbol, using D(P|| Q), the calculation formula is: D(P||Q)=∑P(x)log(P(x)/Q(x)), and then calculate the distance of word frequency distribution under different subtopics.

Further, when the divergence (distance) between two sub-topics in the sub-topics is less than a configured threshold, the merging unit 112 merges the two sub-topics, and determines the merged sub-topic as the Target subtopic.

Through the foregoing implementation manner, redundancy can be removed, and the accuracy of sub-topic division can be improved, so that the target sub-topic obtained is more accurate.

The learning unit 113 performs feature learning on the target sub-topic to obtain the first keyword of the target sub-topic.

In at least one embodiment of the present application, the learning unit 113 performs feature learning on the target subtopic, and obtaining the first keyword of the target subtopic includes:

The learning unit 113 extracts the label words in the target subtopic, and uses the Lasso classification model to perform linear regression analysis on the label words to calculate the score of the label words, and the learning unit 113 obtains the label words A word with a middle score greater than or equal to a preset value is used as the first keyword.

Specifically, the learning unit 113 constructs a standard training set D={(x1, y1), (x2, y2),..., (xi, yi)}, where xi represents the word vector of each label word, and yi represents each The correlation coefficient between the word vector and the subtopic.

Further, the learning unit 113 uses the Lasso classification model to calculate the score of the label word. The score of each candidate label word is equal to the weight score βi given by the Lasso classification model. And the greater the value of βi, the stronger the correlation between the label term and the sub-topic. Therefore, the learning unit 113 uses the label term with the largest βi term as the keyword of the sub-topic, which is the first key. word.

In at least one embodiment of the present application, the extraction of label words in the target subtopic by the learning unit 113 includes:

The learning unit 113 obtains all the words in the target sub-topic, and uses the PLSA-BLM algorithm to calculate the word frequency of all the words. The learning unit 113 performs a calculation on all the words according to the word frequencies of all the words. Sort, and filter out the words in the preset position among all the words as the label words.

For example: after the sub-topics are merged, the words under each sub-topic are arranged in descending order of word frequency. The higher the word frequency, the more representative the word is in the current sub-topic. The learning unit 113 intercepts the k with the highest frequency (for example: k=100) words, and the k words with the highest frequency are used as the label words.

The retrieval unit 114 retrieves the related words of the event text from the configuration database by calculating the degree of relevance.

For example, the retrieval unit 114 calls the knowledge base related to the event text in the search engine.

Specifically, the retrieval unit 114 retrieves the related words of the event text from the configuration database by calculating the degree of relevance, including:

The retrieval unit 114 calculates the degree of relevance between all words in the configuration database and the event text, acquires words with the degree of relevance greater than or equal to the configuration relevance degree as words related to the event text, and compares The words related to the event text are preprocessed by word segmentation to obtain the related words.

For example, the retrieval unit 114 sorts the knowledge base data according to internal concepts and preprocesses word segmentation to calculate the probability that each word in the knowledge base is related to the first keyword.

The determining unit 115 determines the second keyword of the event text from the related words based on the contribution value.

In at least one embodiment of the present application, the determining unit 115 determining the second keyword of the event text from the related words based on the contribution value includes:

The determining unit 115 calculates the contribution value of the related word to the first keyword, and obtains the related word with the highest contribution value as the second keyword.

Through the foregoing implementation manner, the word with the largest contribution to the first keyword is used as the second keyword, thereby realizing the expansion of the keyword.

The merging unit 112 merges the first keyword and the second keyword to obtain the target keyword of the event text.

In at least one embodiment of the present application, the merging unit 112 aggregates the expansion trigger vocabulary of the sub-keyword to obtain the expansion trigger word of the sub-topic. Finally, after summarizing, the expanded trigger words of the entire text are obtained, and the expanded trigger words of the event text are obtained after summarizing the expanded trigger vocabulary of sub-keywords.

As shown in FIG. 3, it is a schematic structural diagram of an electronic device implementing a preferred embodiment of the keyword determining method according to the present application.

The electronic device 1 is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC) ), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.

The electronic device 1 can also be, but is not limited to, any electronic product that can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device, for example, a personal computer, a tablet computer, or a smart phone. , Personal Digital Assistant (PDA), game consoles, interactive network TV (Internet Protocol Television, IPTV), smart wearable devices, etc.

The electronic device 1 may also be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.

The network where the electronic device 1 is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), etc.

In an embodiment of the present application, the electronic device 1 includes, but is not limited to, a memory 12, a processor 13, and a computer program stored in the memory 12 and running on the processor 13, such as Keyword determination procedure.

Those skilled in the art can understand that the schematic diagram is only an example of the electronic device 1 and does not constitute a limitation on the electronic device 1. It may include more or less components than those shown in the figure, or a combination of certain components, or different components. Components, for example, the electronic device 1 may also include input and output devices, network access devices, buses, and the like.

The processor 13 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc. The processor 13 is the computing core and control center of the electronic device 1 and connects the entire electronic device with various interfaces and lines. Each part of 1, and executes the operating system of the electronic device 1, and various installed applications, program codes, etc.

The processor 13 executes the operating system of the electronic device 1 and various installed applications. The processor 13 executes the application program to implement the steps in the above keyword determination method embodiments, such as steps S10, S11, S12, S13, S14, S15, and S16 shown in FIG. 1.

Alternatively, when the processor 13 executes the computer program, the functions of the modules/units in the above device embodiments are implemented, for example: when a keyword extraction instruction is received, the event text is obtained; the ET-TAG model is used to introduce background words , Extract the sub-topics of the event text; merge the sub-topics to obtain the target sub-topic; perform feature learning on the target sub-topic to obtain the first keyword of the target sub-topic; calculate the degree of relevance from the configuration Retrieve the related words of the event text from the database; determine the second keyword of the event text from the related words based on the contribution value; merge the first keyword and the second keyword to obtain the event The target keyword of the text.

Exemplarily, the computer program may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 12 and executed by the processor 13 to complete this Application. The one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program in the electronic device 1. For example, the computer program may be divided into an acquiring unit 110, an extracting unit 111, a merging unit 112, a learning unit 113, a retrieval unit 114, and a determining unit 115.

The memory 12 may be used to store the computer program and/or module, and the processor 13 runs or executes the computer program and/or module stored in the memory 12 and calls the data stored in the memory 12, Various functions of the electronic device 1 are realized. The memory 12 may mainly include a storage program area and a storage data area. The storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may Store data (such as audio data, phone book, etc.) created based on the use of mobile phones. In addition, the memory 12 may include a high-speed random access memory, and may also include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), and a Secure Digital (SD) Card, Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The memory 12 may be an external memory and/or an internal memory of the electronic device 1. Further, the memory 12 may be a circuit with a storage function without a physical form in an integrated circuit, such as RAM (Random-Access Memory, random access memory), FIFO (First In First Out), etc. Alternatively, the memory 12 may also be a memory in physical form, such as a memory stick, a TF card (Trans-flash Card), and so on.

If the integrated modules/units of the electronic device 1 are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer readable storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. When the program is executed by the processor, the steps of the foregoing method embodiments can be implemented.

Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signal, telecommunications signal, and software distribution media, etc. It should be noted that the content contained in the computer-readable medium can be appropriately added or deleted according to the requirements of the legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to the legislation and patent practice, the computer-readable medium Does not include electrical carrier signals and telecommunication signals.

With reference to FIG. 1, the memory 12 in the electronic device 1 stores multiple instructions to implement a keyword determination method, and the processor 13 can execute the multiple instructions to implement: when a keyword extraction instruction is received When, obtain the event text; use the ET-TAG model to introduce background words to extract the subtopics of the event text; merge the subtopics to obtain the target subtopic; perform feature learning on the target subtopic to obtain the target subtopic The first keyword of the topic; the related words of the event text are retrieved from the configuration database through the degree of relevance calculation; the second keyword of the event text is determined from the related words based on the contribution value; the first keyword is merged The keyword and the second keyword are used to obtain the target keyword of the event text.

According to a preferred embodiment of the present application, the execution of multiple instructions by the processor 13 includes:

Based on the ET-TAG model, using a PLSA-BLM algorithm to calculate words with a probability greater than a preset probability in the event text as the background words;

Delete the background word from the event text to obtain an updated text;

Extract topics from the updated text as subtopics of the event text.

According to a preferred embodiment of the present application, the processor 13 further executing multiple instructions includes:

Use KL divergence algorithm to calculate the divergence between the sub-topics;

When the divergence between two sub-topics in the sub-topics is less than the configured threshold, merge the two sub-topics;

The combined sub-topic is determined as the target sub-topic.

Extract the label words in the target subtopic;

Using a Lasso classification model to perform linear regression analysis on the label words to calculate the score of the label words;

Acquiring words with a score greater than or equal to a preset value in the tag words as the first keyword.

Obtain all words in the target subtopic;

Use the PLSA-BLM algorithm to calculate the word frequency of all words;

Sort all the words according to the word frequency of all the words;

A word in a preset position among all the words is selected as the label word.

Calculating the degree of relevance between all words in the configuration database and the event text;

Acquiring words whose relevance degree is greater than or equal to the configuration relevance degree as words related to the event text;

Perform word segmentation preprocessing on words related to the event text to obtain the related words.

Calculating the contribution value of the related word to the first keyword;

The related word with the highest contribution value is obtained as the second keyword.

Specifically, for the specific implementation method of the processor 13 for the foregoing instructions, reference may be made to the description of the relevant steps in the embodiment corresponding to FIG. 1, which is not repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation.

The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional modules in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional modules.

For those skilled in the art, it is obvious that the present application is not limited to the details of the foregoing exemplary embodiments, and the present application can be implemented in other specific forms without departing from the spirit or basic characteristics of the application.

Therefore, no matter from which point of view, the embodiments should be regarded as exemplary and non-limiting. The scope of this application is defined by the appended claims rather than the above description, and therefore it is intended to fall into the claims. All changes in the meaning and scope of the equivalent elements of are included in this application. Any associated diagram marks in the claims should not be regarded as limiting the claims involved.

In addition, it is obvious that the word "including" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units or devices stated in the system claims can also be implemented by one unit or device through software or hardware. The second class words are used to indicate names, and do not indicate any specific order.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application and not to limit them. Although the application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the application can be Modifications or equivalent replacements are made without departing from the spirit and scope of the technical solution of this application.

Claims

A method for determining keywords, characterized in that the method includes:

When the keyword extraction instruction is received, the event text is obtained;

Use the ET-TAG model to introduce background words to extract subtopics of the event text;

Combine the sub-topics to obtain the target sub-topic;

Performing feature learning on the target sub-topic to obtain the first keyword of the target sub-topic;

Retrieve the related words of the event text from the configuration database by calculating the degree of relevance;

Based on the contribution value, determining the second keyword of the event text from the related words;

Combine the first keyword and the second keyword to obtain the target keyword of the event text.
5. The method for determining keywords according to claim 1, wherein said adopting the ET-TAG model to introduce background words and extracting subtopics of said event text comprises:

Based on the ET-TAG model, using a PLSA-BLM algorithm to calculate words with a probability greater than a preset probability in the event text as the background words;

Delete the background word from the event text to obtain an updated text;

Extract topics from the updated text as subtopics of the event text.
The method for determining keywords according to claim 1, wherein said merging said sub-topics to obtain the target sub-topic comprises:

Use KL divergence algorithm to calculate the divergence between the sub-topics;

When the divergence between two sub-topics in the sub-topics is less than the configured threshold, merge the two sub-topics;

The combined sub-topic is determined as the target sub-topic.
5. The method for determining keywords according to claim 1, wherein said performing feature learning on said target sub-topic to obtain the first keyword of said target sub-topic comprises:

Extract the label words in the target subtopic;

Using a Lasso classification model to perform linear regression analysis on the label words to calculate the score of the label words;

Acquiring words with a score greater than or equal to a preset value in the tag words as the first keyword.
The method for determining keywords according to claim 4, wherein said extracting the label words in the target sub-topic comprises:

Obtain all words in the target subtopic;

Use the PLSA-BLM algorithm to calculate the word frequency of all words;

Sort all the words according to the word frequency of all the words;

A word in a preset position among all the words is selected as the label word.
The method for determining keywords according to claim 1, wherein said retrieving the related words of the event text from the configuration database by calculating the degree of relevance comprises:

Calculating the degree of relevance between all words in the configuration database and the event text;

Acquiring words whose relevance degree is greater than or equal to the configuration relevance degree as words related to the event text;

Perform word segmentation preprocessing on words related to the event text to obtain the related words.
5. The keyword determining method according to claim 1, wherein the determining the second keyword of the event text from the related words based on the contribution value comprises:

Calculating the contribution value of the related word to the first keyword;

The related word with the highest contribution value is obtained as the second keyword.
An electronic device, characterized in that, the electronic device includes: a memory that stores at least one instruction; and a processor that executes the instructions stored in the memory to implement the following steps:

When the keyword extraction instruction is received, the event text is obtained;

Use the ET-TAG model to introduce background words to extract subtopics of the event text;

Combine the sub-topics to obtain the target sub-topic;

Performing feature learning on the target sub-topic to obtain the first keyword of the target sub-topic;

Retrieve the related words of the event text from the configuration database by calculating the degree of relevance;

Based on the contribution value, determining the second keyword of the event text from the related words;

Combine the first keyword and the second keyword to obtain the target keyword of the event text.
8. The electronic device according to claim 8, wherein the introduction of background words using the ET-TAG model and extracting subtopics of the event text comprise:

Based on the ET-TAG model, using a PLSA-BLM algorithm to calculate words with a probability greater than a preset probability in the event text as the background words;

Delete the background word from the event text to obtain an updated text;

Extract topics from the updated text as subtopics of the event text.
8. The electronic device of claim 8, wherein the combining the sub-topics to obtain the target sub-topic comprises:

Use KL divergence algorithm to calculate the divergence between the sub-topics;

When the divergence between two sub-topics in the sub-topics is less than the configured threshold, merge the two sub-topics;

The combined sub-topic is determined as the target sub-topic.
8. The electronic device according to claim 8, wherein the performing feature learning on the target subtopic to obtain the first keyword of the target subtopic comprises:

Extract the label words in the target subtopic;

Using a Lasso classification model to perform linear regression analysis on the label words to calculate the score of the label words;

Acquire words with a score greater than or equal to a preset value in the tag words as the first keyword.
The electronic device according to claim 11, wherein said extracting the label words in the target sub-topic comprises:

Obtain all words in the target subtopic;

Use the PLSA-BLM algorithm to calculate the word frequency of all words;

Sort all the words according to the word frequency of all the words;

A word in a preset position among all the words is selected as the label word.
8. The electronic device according to claim 8, wherein said retrieving the associated words of the event text from a configuration database by calculating the degree of relevance comprises:

Calculating the degree of relevance between all words in the configuration database and the event text;

Acquiring words with the relevance degree greater than or equal to the configuration relevance degree as words related to the event text;

Perform word segmentation preprocessing on words related to the event text to obtain the related words.
8. The electronic device according to claim 8, wherein the determining the second keyword of the event text from the related words based on the contribution value comprises:

Calculating the contribution value of the related word to the first keyword;

The related word with the highest contribution value is obtained as the second keyword.
A computer-readable storage medium, characterized in that: the computer-readable storage medium stores at least one instruction, and the at least one instruction is executed by a processor in an electronic device to implement the following steps:

When the keyword extraction instruction is received, the event text is obtained;

Use the ET-TAG model to introduce background words to extract subtopics of the event text;

Combine the sub-topics to obtain the target sub-topic;

Performing feature learning on the target sub-topic to obtain the first keyword of the target sub-topic;

Retrieve the related words of the event text from the configuration database by calculating the degree of relevance;

Based on the contribution value, determining the second keyword of the event text from the related words;

Combine the first keyword and the second keyword to obtain the target keyword of the event text.
15. The computer-readable storage medium according to claim 15, wherein said adopting the ET-TAG model to introduce background words and extracting subtopics of said event text comprises:

Based on the ET-TAG model, using a PLSA-BLM algorithm to calculate words with a probability greater than a preset probability in the event text as the background words;

Delete the background word from the event text to obtain an updated text;

Extract topics from the updated text as subtopics of the event text.
15. The computer-readable storage medium of claim 15, wherein the merging the sub-topics to obtain the target sub-topic comprises:

Use KL divergence algorithm to calculate the divergence between the sub-topics;

When the divergence between two sub-topics in the sub-topics is less than the configured threshold, merge the two sub-topics;

The combined sub-topic is determined as the target sub-topic.
15. The computer-readable storage medium according to claim 15, wherein the performing feature learning on the target sub-topic to obtain the first keyword of the target sub-topic comprises:

Extract the label words in the target subtopic;

Using a Lasso classification model to perform linear regression analysis on the label words to calculate the score of the label words;

Acquire words with a score greater than or equal to a preset value in the tag words as the first keyword.
18. The computer-readable storage medium of claim 18, wherein said extracting tag words in said target subtopic comprises:

Obtain all words in the target subtopic;

Use the PLSA-BLM algorithm to calculate the word frequency of all words;

Sort all the words according to the word frequency of all the words;

A word in a preset position among all the words is selected as the label word.
15. The computer-readable storage medium according to claim 15, wherein said retrieving the associated words of the event text from a configuration database by calculating the degree of relevance comprises:

Calculating the degree of relevance between all words in the configuration database and the event text;

Acquiring words with the relevance degree greater than or equal to the configuration relevance degree as words related to the event text;

Perform word segmentation preprocessing on words related to the event text to obtain the related words.