CN117454888A - Keyword extraction and model training method, device, equipment and medium thereof - Google Patents

Keyword extraction and model training method, device, equipment and medium thereof Download PDF

Info

Publication number
CN117454888A
CN117454888A CN202311238472.6A CN202311238472A CN117454888A CN 117454888 A CN117454888 A CN 117454888A CN 202311238472 A CN202311238472 A CN 202311238472A CN 117454888 A CN117454888 A CN 117454888A
Authority
CN
China
Prior art keywords
sample
words
word
keyword extraction
dimension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311238472.6A
Other languages
Chinese (zh)
Inventor
刘杨
张文斌
林跃
卢品吟
李运洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donson Times Information Technology Co ltd
Original Assignee
Donson Times Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donson Times Information Technology Co ltd filed Critical Donson Times Information Technology Co ltd
Priority to CN202311238472.6A priority Critical patent/CN117454888A/en
Publication of CN117454888A publication Critical patent/CN117454888A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a keyword extraction model training method. The method comprises the following steps: processing the acquired text data to obtain sample data; matching the preset subject word database with sample data through a matching layer in a preset training model to obtain sample subject words; extracting the content of the sample subject word in the sample data through the boundary distance in the boundary layer to obtain boundary content; mining dimension words in boundary content based on sample subject words by presetting association rules in a mining layer to obtain sample dimension words; calculating semantic similarity between the sample main word and each sample dimension word, and determining a prediction loss value; and determining the converged preset training model as a keyword extraction model. According to the invention, through the boundary distance, the extraction of the dimension words in the context is avoided, and the keyword extraction efficiency is improved. Through the mining of the preset association rules between the main words and the dimension words, the efficient keyword extraction is realized.

Description

Keyword extraction and model training method, device, equipment and medium thereof
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a keyword extraction model training method, device, equipment and medium.
Background
In the development of the technical field, a subject word recognition and dimension word recognition model belongs to the research field of entity recognition in natural language processing. NLP is an important branch of the field of artificial intelligence, aimed at enabling computers to understand, process and generate natural language. As the demand for large-scale text data continues to grow, subject word and dimension word recognition techniques have also gained widespread attention and research.
In the prior art, sequence labeling models are usually used for identifying dimension words, such as a hidden Markov model, a conditional random field and the like. These models can capture the dependency between subject words and dimension words in text and make predictions based on context. However, some dimension words have meaning only between the words before and after the subject word, and thus, not only is time-consuming and the predicted dimension word has no reference meaning by context prediction of the full text, but also the extraction efficiency is low.
Disclosure of Invention
Based on the above, it is necessary to provide a keyword extraction model training method, device, equipment and medium to solve the problem of low extraction efficiency caused by predicting the dimension word in the whole text in the prior art.
A keyword extraction model training method comprises the following steps:
acquiring text data, and performing word segmentation cleaning processing on all the text data to obtain sample data corresponding to each text data;
inputting all the sample data into a preset training model, and carrying out subject word matching on a preset subject word database and all the sample data through a matching layer in the preset training model to obtain sample subject words;
extracting the content of all the sample subject words in each sample data through boundary distances in boundary layers in the preset training model to obtain boundary content;
mining dimension words in the boundary content based on the sample subject words through preset association rules in a mining layer in the preset training model to obtain sample dimension words corresponding to the sample subject words;
calculating semantic similarity between the sample subject word and each corresponding sample dimension word, and determining a prediction loss value of a preset training model;
and when the predicted loss value reaches a preset convergence condition, determining a preset training model after convergence as a keyword extraction model.
A keyword extraction method, comprising:
Acquiring at least one text to be processed;
invoking a keyword extraction model, wherein the keyword extraction model is obtained according to the keyword extraction model training method;
and extracting keywords from all the texts to be processed based on the keyword extraction model to obtain keyword extraction results.
A keyword extraction model training method and device comprise the following steps:
the sample data acquisition module is used for acquiring text data, and performing word segmentation cleaning treatment on all the text data to obtain sample data corresponding to each text data;
the main body word extraction module is used for inputting all the sample data into a preset training model, and carrying out main body word matching on a preset main body word database and all the sample data through a matching layer in the preset training model to obtain sample main body words;
the boundary content extraction module is used for extracting the content of all the sample subject words in each sample data through boundary distances in boundary layers in the preset training model to obtain boundary content;
the dimension word mining module is used for mining dimension words in the boundary content based on the sample subject words through preset association rules in a mining layer in the preset training model to obtain sample dimension words corresponding to the sample subject words;
The loss value prediction module is used for calculating semantic similarity between the sample subject word and each corresponding sample dimension word and determining a predicted loss value of a preset training model;
and the model convergence module is used for determining a preset training model after convergence as a keyword extraction model when the predicted loss value reaches a preset convergence condition.
A keyword extraction apparatus comprising:
the acquisition module is used for acquiring at least one text to be processed;
the calling module is used for calling a keyword extraction model, wherein the keyword extraction model is obtained according to the keyword extraction model training method;
and the extraction module is used for extracting keywords from all the texts to be processed based on the keyword extraction model to obtain keyword extraction results.
A computer device comprising a memory, a processor and computer readable instructions stored in the memory and executable on the processor, the processor implementing the keyword extraction model training method or the keyword extraction method when executing the computer program.
One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the keyword extraction model training method described above, or to perform the keyword extraction method described above.
According to the keyword extraction model training method, device, equipment and medium, the method realizes the acquisition of sample data by performing word segmentation cleaning treatment on all acquired text data, so that the extraction efficiency is improved. And carrying out subject word matching on a preset subject word database and all sample data through a matching layer in a preset training model, so that the extraction of the subject words in all the sample data is realized. And extracting the contents of all the sample main words in each sample data through boundary distances in the boundary layer, so that the extraction of the boundary contents is realized, and the mining of all the dimension words in the boundary contents is further realized. And calculating the semantic similarity between the sample main word and each corresponding sample dimension word, so as to realize the calculation of model loss values. Further, through the boundary distance in the boundary layer, extraction of all dimension words in the context is avoided, and the keyword extraction efficiency is improved. Through the mining of the preset association rules between the main words and the dimension words, the efficient keyword extraction is realized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a keyword extraction model training method according to an embodiment of the invention;
FIG. 2 is a flow chart of a keyword extraction method according to an embodiment of the invention;
FIG. 3 is a schematic diagram of a training device for keyword extraction model according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a keyword extraction apparatus according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a computer device in accordance with an embodiment of the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The keyword extraction model training method provided by the embodiment can be applied to an application environment as shown in fig. 1, wherein a client communicates with a server. Clients include, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be implemented by a stand-alone server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 1, a keyword extraction model training method is provided, which includes the following steps:
s10, acquiring text data, and performing word segmentation cleaning processing on all the text data to obtain sample data corresponding to each text data.
It will be appreciated that text data may be collected from different databases, but also text or sentences collected from different clients. Such as descriptive text or comment text for a good, or news stories for something, etc. The sample data is text obtained by washing the text data.
Specifically, at least one text data is obtained, preprocessing is performed on all the text data, namely word segmentation is performed on the text data firstly, namely word segmentation is performed on the text data through a Chinese word segmentation algorithm, word segmentation is performed on the text data according to the relation between contexts, namely all possible word segmentation results are listed, the best segmentation path is selected, all word segmentation results are formed into a directed acyclic graph, the word segmentation results are used as nodes, the weight is given to edges between words, and the weight and the smallest path are found for segmentation, namely the final result is obtained, so that at least one word segmentation can be obtained. Then, the noise data is cleaned, that is, the noise data or the outlier data is cleaned by classification, clustering and linear regression. Then, a filtering list is preset, wherein the filtering list can comprise various stop words or other nonsensical words, so that nonsensical words such as stop words in word segmentation results can be removed by matching and comparing the stop words or other nonsensical words with the words in the filtering list, and sample data corresponding to each text data can be obtained.
S20, inputting all the sample data into a preset training model, and carrying out subject word matching on a preset subject word database and all the sample data through a matching layer in the preset training model to obtain sample subject words.
Sample subject words are understood to mean subject words extracted from sample data. The subject word is a keyword describing an article or an event, and may be a single word or a plurality of combined words, for example, a milk drink. The preset main word database is obtained by loading main words collected by experts into an AC automaton, and extracting and mining text contents containing the main words in massive text data by utilizing an efficient matching algorithm of the main words.
Specifically, all sample data are input into a preset training model, a preset main body word database and all sample data are subjected to main body word matching through a matching layer in the preset training model, namely, the preset main body word database is firstly obtained, then the preset main body word database and all sample data are subjected to main body word matching through the matching layer in the preset training model, namely, the preset main body word in the preset main body word database and characters in all sample data are subjected to main body word matching through a prefix tree in the matching layer, namely, the character of a first preset main body word is firstly determined, the characters of each preset main body word are sequentially matched in a prefix tree mode, so that corresponding main body words are matched in each sample data, and the matched main body words are extracted, so that the sample main body words can be obtained.
In a specific embodiment, the method comprises the steps of matching the wednesday, adding an initialized root node, adding a 'star' character, and the root node pass++, wherein the character string is not finished yet, the 'period', 'three', end are unchanged, creating a node at the other end of the path (two nodes can form a path), and the pass++ of the new node is unchanged as well. The 'period' character is added, a new node is also created to represent the path of the period, and the pass++ and end are unchanged on the new node. The 'three' character is added, a new node is created, a path representing three is created, a pass++ is performed on the new node, and the character string is finished and the end++. Thus, the addition of the character string of 'Tuesday' is completed, the main words of 'Tuesday' and the like are added in the same method, each addition is started from a root node which is used as the first node of the prefix tree, and the pass can represent how many character strings are stored in the tree altogether. The biggest feature of the prefix tree is multiplexing characters, if there is no reusable prefix from the root node, a new path needs to be created, if there is an existing path, the existing path needs to be multiplexed, and the number of characters passed through is marked.
S30, extracting contents of all the sample subject words in each sample data through boundary distances in boundary layers in the preset training model, and obtaining boundary contents.
The boundary content is understandably content extracted from the sample data based on the boundary distance. The boundary distance refers to a distance of a preset number of characters, for example, 30 characters, etc., from the sample subject word.
Specifically, content extraction is performed on all sample subject words in each sample data through boundary distances in boundary layers in a preset training model, namely, the position of each sample subject word in each sample data is determined, then, the boundary of the content in each sample data is determined and extracted through the position and the boundary distances of each sample subject word, namely, a preset number of characters are respectively inquired forward and backward at the position of each sample subject word, the boundary of the extracted content is found, and the part of the content is extracted, so that boundary content can be obtained.
S40, mining dimension words in the boundary content based on the sample subject words through preset association rules in a mining layer in the preset training model to obtain sample dimension words corresponding to the sample subject words.
Sample dimension words are understandably dimension words extracted from sample data. Dimension words are words that describe a subject for different aspects. For example, a good drink of a good drink, a good look of a flower, etc. The preset association rule adopts Apriori algorithm in this embodiment.
Specifically, mining dimension words in boundary content based on sample subject words through preset association rules in a mining layer in a preset training model, namely, performing co-occurrence degree calculation on the sample subject words and candidate words in sample data through the preset association rules, so as to obtain co-occurrence degree values corresponding to the candidate words. And screening all candidate words corresponding to the main body words of each sample through the co-occurrence value and a rule threshold value in a preset association rule, so as to obtain the dimension words of the sample.
S50, calculating semantic similarity between the sample subject word and each corresponding sample dimension word, and determining a predicted loss value of a preset training model.
Understandably, the predictive loss value refers to that generated during the subject word and dimension word extraction of the sample data.
Specifically, after obtaining sample dimension words corresponding to each sample subject word, calculating semantic similarity values between the sample subject word and all sample dimension words corresponding to the sample subject word, that is, when one sample subject word has a plurality of dimension words, calculating the similarity between the sample subject word and all sample dimension words, and obtaining the semantic similarity values. And screening all sample dimension words corresponding to the sample subject words based on the semantic similarity value to obtain reserved dimension words and removed dimension words. And carrying out loss calculation on the reserved dimension words and the removed dimension words corresponding to the same sample subject words through a loss function, thereby obtaining a predicted loss value.
And S60, when the predicted loss value reaches a preset convergence condition, determining a preset training model after convergence as a keyword extraction model.
It is to be understood that the convergence condition may be a condition that the predicted loss value is smaller than a set threshold value, or may be a condition that the predicted loss value is small after 500 times of calculation and does not drop any more, and the training is stopped.
Specifically, after the predicted loss value is obtained, when the predicted loss value does not reach a preset convergence condition, the initial parameters of the preset training model are adjusted through the predicted loss value, all sample data are input into the preset training model for adjusting the initial parameters again, and iterative training is carried out on the preset training model for adjusting the initial parameters, so that the corresponding predicted loss value can be obtained. And when the predicted loss value does not reach the preset convergence condition, the initial parameters of the preset training model are readjusted according to the predicted loss value, so that the predicted loss value of the preset training model with the initial parameters readjusted reaches the preset convergence condition. Therefore, the predicted classification results are continuously drawn to the correct results, the accuracy of the preset training model is higher and higher, and the preset training model after convergence is determined as the keyword extraction model until the predicted loss value of the preset training model reaches the preset convergence condition.
In the embodiment of the invention, the method realizes the acquisition of the sample data by performing word segmentation cleaning processing on all acquired text data, thereby improving the extraction efficiency. And carrying out subject word matching on a preset subject word database and all sample data through a matching layer in a preset training model, so that the extraction of the subject words in all the sample data is realized. And extracting the contents of all the sample main words in each sample data through boundary distances in the boundary layer, so that the extraction of the boundary contents is realized, and the mining of all the dimension words in the boundary contents is further realized. And calculating the semantic similarity between the sample main word and each corresponding sample dimension word, so as to realize the calculation of model loss values. Further, through the boundary distance in the boundary layer, extraction of all dimension words in the context is avoided, and the keyword extraction efficiency is improved. Through the mining of the preset association rules between the main words and the dimension words, the efficient keyword extraction is realized.
In an embodiment, in step S30, that is, extracting the contents of all the sample subject words in each sample data by using the boundary distances in the boundary layer in the preset training model to obtain boundary contents, and obtaining the target dimension word includes:
S301, extracting boundary contents within the boundary distance of the sample subject words in the sample data.
Boundary content is understood to mean content in the sample data that is within the sample subject word boundary distance. The boundary distance refers to the maximum limiting distance between a dimension word and a subject word that it describes.
Specifically, after obtaining the sample subject word, extracting the boundary content in the sample data within the boundary distance of the sample subject word, namely, firstly determining the position of each sample subject word in each sample data, then obtaining the boundary distance, and determining the character position within the boundary distance based on the positions of the first character and the last character in each sample subject word, namely, determining the boundary position in the sample data. And extracting the content of the text in the two boundary positions, so as to obtain the boundary content corresponding to each sample subject word. For example, when the boundary distance is 30 characters, searching 30 characters forward based on the first character in the sample subject word is determined to be a first boundary, then searching 30 characters backward based on the last character in the sample subject word is determined to be a second boundary, and content extraction is performed on the text within the boundary distance, so that the boundary content corresponding to the sample subject word can be obtained. That is, in this embodiment, through the boundary distance in the boundary layer, extraction of all dimension words in the context is avoided, and the efficiency of keyword extraction is improved.
In an embodiment, in step S40, mining the dimension words in the boundary content based on the sample subject words to obtain sample dimension words corresponding to each sample subject word includes:
s401, performing co-occurrence calculation on the sample subject word and the candidate words in the sample data through a preset association rule to obtain co-occurrence values corresponding to the candidate words.
The co-occurrence value is understandably the frequency with which the sample subject word and the candidate word occur. The association rule, namely Apriori algorithm, is preset. In this example, through text content with large data volume, a minimum support threshold is set, frequent item sets are screened out, and then main words and dimension words corresponding to the main words are extracted from the frequent item sets. The rule threshold refers to a co-occurrence threshold. The threshold is set according to the actual situation, for example 10.
Specifically, co-occurrence degree calculation is carried out on the sample main word and the candidate word in the sample data through a preset association rule, namely, a corresponding item set is found out from each sample data; each item set includes at least one subject word and at least one candidate word. Performing connection and pruning operations on all item sets, thereby obtaining frequent item sets; each item in the frequent item set includes a sample subject word and at least one candidate word. Counting the frequency of each item in the frequent item sets in all the sample data to obtain the frequency corresponding to each item in the frequent item sets, and determining the frequency of each item in the frequent item sets as a co-occurrence value.
And S402, screening all candidate words corresponding to each sample subject word through the co-occurrence value and a rule threshold value in the preset association rule to obtain sample dimension words.
Further, screening all candidate words corresponding to each sample subject word through the co-occurrence value and a rule threshold value in a preset association rule, namely comparing the co-occurrence value corresponding to each candidate word with the rule threshold value in the preset association rule, and deleting the candidate word when the co-occurrence value corresponding to the candidate word is smaller than the rule threshold value in the preset association rule. And when the co-occurrence value corresponding to the candidate word is greater than or equal to a rule threshold value in a preset association rule, reserving the candidate word and determining the candidate word as a sample dimension word. And screening all candidate words corresponding to all the sample subject words to obtain corresponding sample dimension words. That is, in this example, by mining between the subject word and the dimension word by the preset association rule, efficient keyword extraction is achieved.
In an embodiment, in step S401, that is, performing co-occurrence computation on the candidate words in the sample data and the sample subject word by using a preset association rule to obtain co-occurrence values corresponding to each candidate word, the method includes:
S4011, finding out a corresponding item set from each sample data; the set of terms includes at least one sample subject word and at least one candidate word.
S4012, performing connection and pruning operations on all the item sets to obtain frequent item sets; each item in the set of frequent items includes a sample subject word and at least one candidate word.
S4013, counting the frequency of each item in the frequent item sets in all the sample data to obtain the frequency corresponding to each item in the frequent item sets, and determining the frequency of each item in the frequent item sets as a co-occurrence value.
A term set is understood to mean a set of subject words and candidate words extracted from each sample data. Each item set includes at least one subject word and at least one candidate word. Each item in the frequent item set includes a sample subject word and at least one candidate word.
Specifically, after obtaining the boundary content, a corresponding item set is found out from each sample data, namely all sample main words and candidate words are extracted in a prefix tree mode, namely root nodes are set, pass and end are set, so that path repetition is avoided, generation of a large number of candidate item sets is avoided, and mining efficiency is improved. And determining each sample subject word and corresponding candidate word as a set of terms. And then, performing connection and pruning operation on all the item sets, namely distinguishing the frequent item sets from the non-empty item sets through connection step and pruning step operation, so as to obtain the frequent item sets. And then, counting the frequency of each item in each frequent item set in all sample data, namely counting the frequency of the sample subject word and the candidate word in each frequent item set in all sample data, so as to obtain the frequency corresponding to each item in each frequent item set, and determining the frequency of each item in each frequent item set as a co-occurrence value. For example, statistics are performed on the number of wednesday and raining, and 6 times are obtained, and the co-occurrence value is 6. And counting the times of wednesday and sunny days to obtain 12 times, wherein the co-occurrence value is 12.
The mining of the preset association rule is divided into two main steps, wherein the first step is to find out all frequent item sets, and the second step is to generate the association rule according to the frequent item sets. Find all frequent item sets: setting a minimum support degree; calculating the frequency of each item in the item set in the total set, and screening out items with the frequency greater than the minimum support; combining the screened items pairwise to generate new item sets, recalculating the occurrence frequency of each new item set in the total set, and screening again according to the minimum support; and the like, until a new item set cannot be combined, all the frequent item sets can be obtained.
In an embodiment, in step S50, that is, calculating semantic similarity between the sample subject word and each corresponding sample dimension word, determining a predicted loss value of a preset training model includes:
s501, determining semantic similarity values between the sample subject word and all corresponding sample dimension words.
S502, screening all sample dimension words corresponding to the sample subject words based on the semantic similarity value to obtain reserved dimension words and removed dimension words.
Understandably, the semantic similarity value refers to the semantic similarity between the sample dimension word and the sample subject word. The reserved dimension words are sample dimension words with larger semantic similarity values. The dimension word removal means sample dimension words except the reserved dimension word in all dimension words, namely sample dimension words with smaller semantic similarity values are screened out.
Specifically, after the sample dimension words are obtained, calculating semantic similarity values between the sample subject words and all sample dimension words corresponding to the sample subject words, namely calculating semantic similarity between the sample subject words and all sample dimension words, and carrying out vector coding on the sample subject words and all sample dimension words to obtain dimension coding vectors corresponding to the sample dimension words and body coding vectors corresponding to the sample subject words. And then, calculating cosine similarity between the main body coding vector and each dimension coding vector, so as to obtain a semantic similarity value corresponding to each sample dimension word. And screening all sample dimension words corresponding to the sample main words based on the semantic similarity values, namely comparing the magnitudes of all the semantic similarity values, arranging the sample dimension words according to the sequence from large to small, selecting a preset number of sample dimension words from the previous sample dimension words, determining the sample dimension words as reserved dimension words, and determining other sample dimension words as removed dimension words. The similarity threshold value can also be set, the semantic similarity value and the similarity threshold value are compared, the sample dimension word corresponding to the semantic similarity value which is larger than or equal to the similarity threshold value is determined to be a reserved dimension word, and the sample dimension word corresponding to the semantic similarity value which is smaller than the similarity threshold value is determined to be a removed dimension word.
And S503, carrying out loss calculation on the reserved dimension word and the removed dimension word corresponding to the same sample subject word through a loss function to obtain a predicted loss value.
Further, the loss function is used for carrying out loss calculation on the reserved dimension words and the removed dimension words corresponding to the same sample main body words, namely, the difference value of the semantic similarity values between the reserved dimension words and the removed dimension words can be calculated, or the difference value of the semantic similarity values between the sample main body words and the removed dimension words can be directly calculated, so that the loss values corresponding to all sample main bodies are obtained, and further, the loss calculation is carried out on all sample main body words, and the predicted loss values can be obtained. That is, in this embodiment, the overall loss value of the preset training model is calculated through the loss function, so that the determination of the predicted loss value of the preset training model is realized, and the adjustment of the parameters of the preset training model is further realized.
In one embodiment, as shown in fig. 2, a keyword extraction method is provided, which includes the following steps:
s11, at least one text to be processed is obtained.
S12, calling a keyword extraction model, wherein the keyword extraction model is obtained according to the keyword extraction model training method.
And S13, extracting keywords from all the texts to be processed based on the keyword extraction model to obtain keyword extraction results.
It is understood that the text to be processed refers to text of keywords to be extracted, and the text to be processed refers to text subjected to word segmentation cleaning processing, for example, a news report. The keyword extraction result refers to a subject word and a dimension word extracted from the text to be processed.
Specifically, the text to be processed may be acquired from different databases, or acquired from different websites, or may be sent from a client to at least one server. And then, calling a keyword extraction model after training, inputting all the texts to be processed into the keyword extraction model, and extracting the keywords of each processed text through the keyword extraction model, namely, firstly matching the preset subject words in the preset subject word stock with the texts to be processed through a matching layer, namely, extracting the subject words the same as the preset subject words from all the texts to be processed based on the subject words in the preset subject word stock, thereby obtaining the target subject words. And then, extracting the content of the target subject words in each text to be processed through boundary layers in the keyword extraction model, namely determining the content text of each target subject word in each text to be processed based on the boundary distance. And then mining dimension words in the content text through preset association rules in a mining layer in the keyword extraction model, namely performing co-occurrence calculation on the target subject word and candidate words in each text to be processed, so as to obtain the co-occurrence corresponding to each candidate word. And when the co-occurrence degree is greater than the rule threshold, extracting candidate words corresponding to the co-occurrence degree greater than the rule threshold, thereby obtaining target dimension words. And determining all target subject words and all corresponding target dimension words extracted from the same text to be processed as keyword extraction results. The keyword extraction result comprises at least one group of target subject words and target dimension words. The specific process is the same as the model training process, and detailed description thereof is omitted. That is, in the present embodiment, by using the keyword extraction model described above, efficient and accurate keyword extraction is achieved.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.
In an embodiment, a keyword extraction device is provided, where the keyword extraction device corresponds to the keyword extraction method in the above embodiment one by one. As shown in fig. 4, the keyword extraction apparatus includes an acquisition module 11, a calling module 12, and an extraction module 13. The functional modules are described in detail as follows:
an obtaining module 11, configured to obtain at least one text to be processed;
the calling module 12 is configured to call a keyword extraction model, where the keyword extraction model is obtained according to the keyword extraction model training method;
and the extraction module 13 is used for extracting keywords from all the texts to be processed based on the keyword extraction model to obtain keyword extraction results.
In an embodiment, a keyword extraction model training device is provided, where the keyword extraction model training device corresponds to the keyword extraction model training method in the above embodiment one by one. As shown in fig. 3, the keyword extraction model training apparatus includes a sample data acquisition module 10, a subject word extraction module 20, a boundary content extraction module 30, a dimension word mining module 40, a loss value prediction module 50, and a model convergence module 60. The functional modules are described in detail as follows:
The sample data acquisition module 10 is used for acquiring text data, and performing word segmentation cleaning processing on all the text data to obtain sample data corresponding to each text data;
the subject word extraction module 20 is configured to input all the sample data into a preset training model, and perform subject word matching on a preset subject word database and all the sample data through a matching layer in the preset training model to obtain sample subject words;
the boundary content extraction module 30 is configured to extract content of all the sample subject words in each sample data through boundary distances in boundary layers in the preset training model, so as to obtain boundary content;
the dimension word mining module 40 is configured to mine dimension words in the boundary content based on the sample subject words through preset association rules in a mining layer in the preset training model, so as to obtain sample dimension words corresponding to each sample subject word;
the loss value prediction module 50 is configured to calculate semantic similarity between the sample subject word and each corresponding sample dimension word, and determine a predicted loss value of a preset training model;
the model convergence module 60 is configured to determine a preset training model after convergence as a keyword extraction model when the predicted loss value reaches a preset convergence condition.
Optionally, the dimension word mining module 40 includes:
the co-occurrence calculating unit is used for calculating the co-occurrence of the sample main word and the candidate words in the sample data through a preset association rule to obtain co-occurrence values corresponding to the candidate words;
and the dimension word screening unit is used for screening all the candidate words corresponding to each sample subject word through the co-occurrence value and a rule threshold value in the preset association rule to obtain sample dimension words.
Optionally, the co-occurrence calculating unit includes:
a term set acquisition subunit, configured to find a corresponding term set from each of the sample data; the set of items includes at least one sample subject word and at least one candidate word;
the frequent item set subunit is used for executing connection and pruning operations on all the item sets to obtain frequent item sets; each item in the set of frequent items includes a subject word and at least one candidate word;
and the frequency statistics subunit is used for counting the frequency of each item in the frequent item sets in all the sample data to obtain the frequency corresponding to each item in the frequent item sets, and determining the frequency of each item in the frequent item sets as a co-occurrence value.
Optionally, the boundary content extraction module 30 includes:
and the boundary content unit is used for extracting boundary contents within the boundary distance of the sample subject word in the sample data.
Optionally, the loss value prediction module 50 includes:
a semantic similarity value unit, configured to determine a semantic similarity value between the sample subject word and all the corresponding sample dimension words;
the target screening unit is used for screening all sample dimension words corresponding to the sample subject words based on the semantic similarity value to obtain reserved dimension words and removed dimension words;
and the loss calculation unit is used for carrying out loss calculation on the reserved dimension word and the removed dimension word corresponding to the same sample subject word through a loss function to obtain a predicted loss value.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a readable storage medium, an internal memory. The non-volatile storage medium stores an operating system and computer readable instructions. The internal memory provides an environment for the execution of an operating system and computer-readable instructions in a readable storage medium. The network interface of the computer device is for communicating with an external server via a network connection. The computer readable instructions, when executed by the processor, implement a keyword extraction model training method, or implement the keyword extraction method described above. The readable storage medium provided by the present embodiment includes a nonvolatile readable storage medium and a volatile readable storage medium.
In one embodiment, a computer device is provided that includes a memory, a processor, and computer readable instructions stored on the memory and executable on the processor, the processor implementing the keyword extraction model training method or the keyword extraction method when executing the computer readable instructions.
In one embodiment, one or more computer-readable storage media are provided having computer-readable instructions stored thereon, the readable storage media provided by the present embodiment including non-volatile readable storage media and volatile readable storage media. The readable storage medium has stored thereon computer readable instructions that when executed by one or more processors implement or perform the keyword extraction model training method described above.
Those skilled in the art will appreciate that implementing all or part of the above described embodiment methods may be accomplished by instructing the associated hardware by computer readable instructions stored on a non-volatile readable storage medium or a volatile readable storage medium, which when executed may comprise the above described embodiment methods. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims (10)

1. The keyword extraction model training method is characterized by comprising the following steps of:
acquiring text data, and performing word segmentation cleaning processing on all the text data to obtain sample data corresponding to each text data;
Inputting all the sample data into a preset training model, and carrying out subject word matching on a preset subject word database and all the sample data through a matching layer in the preset training model to obtain sample subject words;
extracting the content of all the sample subject words in each sample data through boundary distances in boundary layers in the preset training model to obtain boundary content;
mining dimension words in the boundary content based on the sample subject words through preset association rules in a mining layer in the preset training model to obtain sample dimension words corresponding to the sample subject words;
calculating semantic similarity between the sample subject word and each corresponding sample dimension word, and determining a prediction loss value of a preset training model;
and when the predicted loss value reaches a preset convergence condition, determining a preset training model after convergence as a keyword extraction model.
2. The keyword extraction model training method of claim 1, wherein mining the dimension words in the boundary content based on the sample subject words to obtain sample dimension words corresponding to each sample subject word comprises:
Performing co-occurrence calculation on the sample main word and the candidate words in the sample data through a preset association rule to obtain co-occurrence values corresponding to the candidate words;
and screening all candidate words corresponding to each sample subject word through the co-occurrence value and a rule threshold value in the preset association rule to obtain sample dimension words.
3. The keyword extraction model training method of claim 2, wherein the performing co-occurrence computation on the sample subject word and the candidate words in the sample data by using a preset association rule to obtain co-occurrence values corresponding to each candidate word comprises:
finding out a corresponding item set from each sample data; the set of items includes at least one sample subject word and at least one candidate word;
performing connection and pruning operations on all the item sets to obtain frequent item sets; each item in the set of frequent items includes a subject word and at least one candidate word;
counting the frequency of each item in the frequent item sets in all the sample data to obtain the frequency corresponding to each item in the frequent item sets, and determining the frequency of each item in the frequent item sets as a co-occurrence value.
4. The method for training a keyword extraction model according to claim 1, wherein the extracting content of all the sample subject words in each sample data by using boundary distances in boundary layers in the preset training model to obtain boundary content includes:
boundary content within the sample subject word boundary distance is extracted from the sample data.
5. The keyword extraction model training method of claim 1, wherein the calculating semantic similarity between the sample subject word and each corresponding sample dimension word to determine the predictive loss value of the preset training model comprises:
determining semantic similarity values between the sample subject word and all corresponding sample dimension words;
screening all sample dimension words corresponding to the sample subject words based on the semantic similarity value to obtain reserved dimension words and removed dimension words;
and carrying out loss calculation on the reserved dimension word and the removed dimension word corresponding to the same sample subject word through a loss function to obtain a predicted loss value.
6. A keyword extraction method, comprising:
Acquiring at least one text to be processed;
invoking a keyword extraction model, wherein the keyword extraction model is obtained by the keyword extraction model training method according to any one of claims 1 to 5;
and extracting keywords from all the texts to be processed based on the keyword extraction model to obtain keyword extraction results.
7. A keyword extraction model training device, characterized by comprising:
the sample data acquisition module is used for acquiring text data, and performing word segmentation cleaning treatment on all the text data to obtain sample data corresponding to each text data;
the main body word extraction module is used for inputting all the sample data into a preset training model, and carrying out main body word matching on a preset main body word database and all the sample data through a matching layer in the preset training model to obtain sample main body words;
the boundary content extraction module is used for extracting the content of all the sample subject words in each sample data through boundary distances in boundary layers in the preset training model to obtain boundary content;
the dimension word mining module is used for mining dimension words in the boundary content based on the sample subject words through preset association rules in a mining layer in the preset training model to obtain sample dimension words corresponding to the sample subject words;
The loss value prediction module is used for calculating semantic similarity between the sample subject word and each corresponding sample dimension word and determining a predicted loss value of a preset training model;
and the model convergence module is used for determining a preset training model after convergence as a keyword extraction model when the predicted loss value reaches a preset convergence condition.
8. A keyword extraction apparatus, characterized by comprising:
the acquisition module is used for acquiring at least one text to be processed;
the calling module is used for calling a keyword extraction model, wherein the keyword extraction model is obtained by the keyword extraction model training method according to any one of claims 1 to 5;
and the extraction module is used for extracting keywords from all the texts to be processed based on the keyword extraction model to obtain keyword extraction results.
9. A computer device comprising a memory, a processor and computer readable instructions stored in the memory and executable on the processor, wherein the processor when executing the computer program implements the keyword extraction model training method of any one of claims 1 to 5 or the keyword extraction method of claim 6.
10. One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the keyword extraction model training method of any one of claims 1 to 5 or the keyword extraction method of claim 6.
CN202311238472.6A 2023-09-22 2023-09-22 Keyword extraction and model training method, device, equipment and medium thereof Pending CN117454888A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311238472.6A CN117454888A (en) 2023-09-22 2023-09-22 Keyword extraction and model training method, device, equipment and medium thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311238472.6A CN117454888A (en) 2023-09-22 2023-09-22 Keyword extraction and model training method, device, equipment and medium thereof

Publications (1)

Publication Number Publication Date
CN117454888A true CN117454888A (en) 2024-01-26

Family

ID=89578821

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311238472.6A Pending CN117454888A (en) 2023-09-22 2023-09-22 Keyword extraction and model training method, device, equipment and medium thereof

Country Status (1)

Country Link
CN (1) CN117454888A (en)

Similar Documents

Publication Publication Date Title
KR102304673B1 (en) Keyword extraction method, computer device, and storage medium
WO2019136993A1 (en) Text similarity calculation method and device, computer apparatus, and storage medium
CN111859960A (en) Semantic matching method and device based on knowledge distillation, computer equipment and medium
CN108710894B (en) Active learning labeling method and device based on clustering representative points
Kuyumcu et al. An automated new approach in fast text classification (fastText) A case study for Turkish text classification without pre-processing
US20230244704A1 (en) Sequenced data processing method and device, and text processing method and device
CN111985228B (en) Text keyword extraction method, text keyword extraction device, computer equipment and storage medium
CN110377725B (en) Data generation method and device, computer equipment and storage medium
CN111859986A (en) Semantic matching method, device, equipment and medium based on multitask twin network
CN109902290B (en) Text information-based term extraction method, system and equipment
CN112115232A (en) Data error correction method and device and server
CN112256822A (en) Text search method and device, computer equipment and storage medium
CN111460090A (en) Vector-based document retrieval method and device, computer equipment and storage medium
CN111325030A (en) Text label construction method and device, computer equipment and storage medium
Banik et al. Gru based named entity recognition system for bangla online newspapers
CN112580346A (en) Event extraction method and device, computer equipment and storage medium
CN112632261A (en) Intelligent question and answer method, device, equipment and storage medium
CN111859916A (en) Ancient poetry keyword extraction and poetry sentence generation method, device, equipment and medium
CN114510923B (en) Text theme generation method, device, equipment and medium based on artificial intelligence
CN112149410A (en) Semantic recognition method and device, computer equipment and storage medium
Leonandya et al. A semi-supervised algorithm for Indonesian named entity recognition
US10810266B2 (en) Document search using grammatical units
JP2019082860A (en) Generation program, generation method and generation device
CN111400340A (en) Natural language processing method and device, computer equipment and storage medium
CN117454888A (en) Keyword extraction and model training method, device, equipment and medium thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination