CN110705282A

CN110705282A - Keyword extraction method and device, storage medium and electronic equipment

Info

Publication number: CN110705282A
Application number: CN201910833971.7A
Authority: CN
Inventors: 贾弼然; 崔朝辉; 赵立军; 张霞
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2019-09-04
Filing date: 2019-09-04
Publication date: 2020-01-17

Abstract

The disclosure relates to a keyword extraction method, a keyword extraction device, a storage medium and electronic equipment, which are used for increasing weighted value discrimination between words in a text and enabling keyword extraction to be more accurate. The method comprises the following steps: acquiring a first text to be subjected to keyword extraction; performing word segmentation on the first text to obtain a plurality of word segments; inputting a plurality of participles into the word graph model to obtain a weight value corresponding to each participle; extracting keywords of the first text according to the weight value corresponding to each participle; the word graph model is used for determining the weight value of each participle in the following mode: acquiring a target word graph; determining a first edge weight value between a node corresponding to the first participle and a node corresponding to the second participle in the target word graph; determining a second edge weight value between a node corresponding to the first participle and a node corresponding to the second participle in a preset word graph; and determining the weight value of the node corresponding to the first word in the target word graph according to the first edge weight value and the second edge weight value.

Description

Keyword extraction method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of text processing technologies, and in particular, to a keyword extraction method, an apparatus, a storage medium, and an electronic device.

Background

Keywords refer to words that reflect the subject or primary content of the text. Keyword extraction is an important subtask in the field of NLP (natural language Processing). In information retrieval, the efficiency can be greatly improved by accurate keyword extraction; in a dialog system, a machine may understand a user's intention by keywords; keyword extraction is also very helpful in text classification.

In the related technology, the keyword extraction is mostly performed by determining a weight value of a word in a text according to the number of co-occurrences between the word in the text, and then performing keyword extraction according to the weight value. However, in a short text, because the number of words is small, the co-occurrence times between words are not very different, so that the weight value difference between words is not large, and even there may be a case where the weight values are substantially the same, which further affects keyword extraction, for example, all the participles in the text may be used as keywords, and so on.

Disclosure of Invention

The disclosure aims to provide a keyword extraction method, a keyword extraction device, a storage medium and electronic equipment, so as to improve the weighted value discrimination between words and enable keyword extraction to be more accurate.

In order to achieve the above object, in a first aspect, the present disclosure provides a keyword extraction method, including:

acquiring a first text to be subjected to keyword extraction;

performing word segmentation on the first text to obtain a plurality of word segments;

inputting the multiple participles into a word graph model to obtain a weight value corresponding to each participle;

extracting keywords from the first text according to the weight value corresponding to each participle;

wherein the word graph model is used for determining the weight value of each participle by the following method:

acquiring a target word graph, wherein the target word graph is established based on word segmentation in a first text;

determining a first edge weight value between a node corresponding to the first participle and a node corresponding to the second participle in the target word graph;

determining a second edge weight value between a node corresponding to the first participle and a node corresponding to the second participle in a preset word graph, wherein the preset word graph is established based on the participles in a second text, and the word assembling quantity of the second text is greater than that of the first text;

and determining the weight value of the node corresponding to the first word in the target word graph according to the first edge weight value and the second edge weight value.

Optionally, the obtaining of the first text to be subjected to keyword extraction includes:

receiving voice information input by a user;

recognizing the voice information to obtain a target text corresponding to the voice information;

and taking the target text as a first text to be subjected to keyword extraction.

and responding to input completion information triggered by a user, and acquiring a first text to be subjected to keyword extraction in a text box of the client.

Optionally, the determining a second edge weight value between a node corresponding to the first word segmentation and a node corresponding to the second word segmentation includes:

if the first participle in the second text is a low-frequency word and the second participle is a high-frequency word, determining a plurality of first target high-frequency words with the highest similarity to the first participle in the high-frequency words of the second text, wherein the low-frequency word is a word with the occurrence frequency lower than a preset threshold value, and the high-frequency word is a word with the occurrence frequency higher than or equal to the preset threshold value;

and determining a second edge weight value between a node corresponding to the first participle and a node corresponding to the second participle in a preset word graph according to the co-occurrence times of the first participle and the second participle in a preset time length and the average co-occurrence times of the second participle and a plurality of first target high-frequency words in the preset time length.

Optionally, the determining, according to the number of co-occurrence times of the first participle and the second participle in a preset time length and the average number of co-occurrence times of the second participle and the plurality of first target high-frequency words in the preset time length, a second edge weight value between a node corresponding to the first participle and a node corresponding to the second participle includes:

determining a second edge weight value between the node corresponding to the first participle and the node corresponding to the second participle according to the following formula:

wherein, w_12-initRepresents the second edge weight, fr (w)₁,w₂) Representing a first participle w₁With a second participle w₂Number of co-occurrences within a preset duration, fr (w)₂,w_o1) Representing the second participle w₂With the first target high-frequency word w_o1The number of co-occurrence within the preset time length, n represents the first target high-frequency word w_o1A and b are constants, and a + b is 1, max (sim, n) represents the same as the first word w₁And the word set is formed by the n first target high-frequency words with the highest similarity.

if the first participle and the second participle are both low-frequency words, respectively determining a plurality of first target high-frequency words with the highest similarity to the first participle and a plurality of second target high-frequency words with the highest similarity to the second participle in the high-frequency words of the second text;

and determining a second edge weight value between a node corresponding to the first participle and a node corresponding to the second participle according to the co-occurrence frequency of the first participle and the second participle in a preset time length, the average co-occurrence frequency of the second participle and the plurality of first target high-frequency words in the preset time length, and the average co-occurrence frequency of the first participle and the plurality of second target high-frequency words in the preset time length.

Optionally, the determining, according to the number of co-occurrences of the first participle and the second participle in a preset time length, the average number of co-occurrences of the second participle and the plurality of first target high-frequency words in the preset time length, and the average number of co-occurrences of the first participle and the plurality of second target high-frequency words in the preset time length, a second edge weight between a node corresponding to the first participle and a node corresponding to the second participle includes:

wherein, w_12-initRepresents the second edge weight, fr (w)₁,w₂) Representing a first participle w₁With a second participle w₂Number of co-occurrences within a preset duration, fr (w)₂,w_o1) Representing the second participle w₂With the first target high-frequency word w_o1Number of co-occurrences within said preset duration, fr (w)₁,w_o2) Representing a first participle w₁With a second target high-frequency word w_o2Number of co-occurrences, n, within the preset duration₁Representing a first target high-frequency word w_o1Number of (2), n₂Representing a second target high-frequency word w_o2A, b and c are constants, and a + b + c is 1, max (sim, n)₁) Means and first word₁N with the highest similarity₁A first target high frequency word group set, max (sim, n)₂) Represents the second participle w₂N with the highest similarity₂A set of words comprising second target high frequency words.

and if the first participle and the second participle are both high-frequency words, taking the co-occurrence frequency of the first participle and the second participle in a preset time length as a second edge weight value between a node corresponding to the first participle and a node corresponding to the second participle in the second text.

Optionally, the determining, according to the first edge weight and the second edge weight, a weight value of a node corresponding to the first word segmentation in the target word graph includes:

determining the weight value of the node corresponding to the first word in the target word graph according to the following formula:

wherein R (w)₁) Representing a first participle w in a target word graph₁Weight value of corresponding node, R (w)₂) Representing a second participle w in the target word graph₂Weight value of corresponding node, in (w)₁) Representing a second participle w in the target word graph₂Set of nodes, w₁₂Representing a first participle w₁Node and second participle w₂First edge weight between nodes, w_12-initRepresenting a first participle w₁Node and second participle w₂Second edge weight between nodes, out (w)₂) Representing a second participle w in the target word graph₂Participles w in node preset range₃Set of nodes, w₂₃Representing a second participle w in the target word graph₂Node and word segmentation w₃Third edge weight between nodes, w_23-initRepresenting a second participle w in a predetermined word graph₂Node and word segmentation w₃And a fourth edge weight between the nodes, wherein lambda represents a preset constant.

Optionally, the method further comprises:

determining the preset threshold value according to the following formula:

wherein T represents the preset threshold, L represents the number of participles with an occurrence frequency of 1 in the second text, and N represents the number of different participles in the second text.

In a second aspect, the present disclosure also provides a keyword extraction apparatus, the apparatus including:

the text acquisition module is used for acquiring a first text to be subjected to keyword extraction;

the text word segmentation module is used for segmenting the first text to obtain a plurality of segmented words;

the weight determining module is used for inputting the plurality of participles into the word graph model to obtain a weight value corresponding to each participle;

the word segmentation extraction module is used for extracting keywords of the first text according to the weight value corresponding to each word segmentation;

wherein the weight determination module comprises:

the acquisition submodule is used for acquiring a target word graph, and the target word graph is established based on word segmentation in the first text;

the first determining submodule is used for determining a first edge weight value between a node corresponding to the first word segmentation and a node corresponding to the second word segmentation in the target word graph;

a second determining submodule, configured to determine, in a preset word graph, a second edge weight between a node corresponding to the first word segmentation and a node corresponding to the second word segmentation, where the preset word graph is established based on word segmentation in a second text, and a word aggregation amount of the second text is greater than that of the first text;

and the third determining submodule is used for determining the weight value of the node corresponding to the first word segmentation in the target word graph according to the first edge weight value and the second edge weight value.

In a third aspect, the present disclosure also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of any one of the first aspect.

In a fourth aspect, the present disclosure also provides an electronic device, including:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the method of any one of the first aspect.

Through the technical scheme, the process for determining the weight value of each participle in the first text to be subjected to keyword extraction can be as follows: determining a first edge weight value between a node corresponding to a first word segmentation and a node corresponding to a second word segmentation in a target word graph, determining a second edge weight value between the node corresponding to the first word segmentation and the node corresponding to the second word segmentation in a preset word graph, and determining a weight value of the node corresponding to the first word segmentation in the target word graph according to the first edge weight value and the second edge weight value. The preset word graph is established based on the participles in the second text, and the vocabulary volume of the second text is larger than that of the first text, so that according to the method disclosed by the invention, the weight value difference between the participle nodes in the target word graph can be increased through the edge weight values between the participle nodes in the preset word graph, so that the weight value discrimination degree between the participle nodes in the target word graph is improved, the condition that all the participles are taken as keywords due to the fact that the participle weight values are more average is avoided, and the accuracy of keyword extraction can be improved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flow diagram illustrating a keyword extraction method according to an exemplary embodiment of the present disclosure;

FIG. 2 is a flow diagram illustrating a keyword extraction method according to another exemplary embodiment of the present disclosure;

fig. 3 is a block diagram illustrating a keyword extraction apparatus according to an exemplary embodiment of the present disclosure;

fig. 4 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

In the related technology, most of the keyword extraction uses word graphs constructed by adjacent relations between words, then edge weights between nodes corresponding to the words and nodes corresponding to other words are determined in the word graphs according to the number of co-occurrences of a certain word and other words within a certain time, then the weight values of the nodes are determined according to the edge weights, and finally keyword extraction is performed according to the weight values.

However, in a short text, because the number of words is small, the number of co-occurrences between words is not very different, so that edge weights between word nodes in a word graph are relatively close, the weighted value distinction degree of each word node is low, and even there may be a case where the weighted values are substantially the same, which further affects keyword extraction, for example, all the participles in the text may be used as keywords, and the like.

For example, in a scenario of keyword extraction on complaint data, because different users describe complaint contents in different ways, some complaint contents are particularly long, and some complaint contents are particularly short. If the way of extracting the keywords in the related art is adopted, all the participles in the complaint content with shorter content may be used as the keywords. For example, for "somebody plays motorboats in a lake every day, the noise pollution is serious. The fish in the lake all die. The content of the complaint of the heavenly stems is short, and the weight value of each participle is basically the same according to the keyword extraction mode in the related technology, and almost all the words are keywords, so that the complaint content cannot be classified and filed according to the extracted keywords.

In view of this, embodiments of the present disclosure provide a keyword extraction method, apparatus, storage medium, and electronic device to increase a weight value difference between word nodes in a word graph and improve a weight value differentiation between the word nodes.

Fig. 1 is a flowchart illustrating a keyword extraction method according to an exemplary embodiment of the present disclosure. Referring to fig. 1, the method may include:

step S101, a first text to be subjected to keyword extraction is obtained.

In a possible mode, voice information input by a user can be received, then the voice information is recognized, a target text corresponding to the voice information is obtained, and finally the target text is used as a first text to be subjected to keyword extraction.

For example, in a scenario where the keyword is extracted from the complaint data, for example, when the user makes a complaint call to complaint, the voice complaint information input by the user may be obtained, and then the voice complaint information may be identified to obtain a complaint text corresponding to the voice information. In this case, the complaint text is the first text to be subjected to keyword extraction.

In other possible modes, the first text to be subjected to keyword extraction can be acquired in a text box of the client in response to input completion information triggered by a user. Similarly, a scenario of extracting keywords from the complaint data is exemplified, the user fills out the complaint content through a complaint website, and after the complaint content is filled out, the user clicks a submit button, that is, the user triggers the input completion information. In this case, the client may, in response to the input completion information triggered by the user, obtain the first text to be subjected to keyword extraction in the text box, that is, obtain the complaint content input by the user, so as to perform keyword extraction on the complaint content, thereby facilitating operations such as classifying and filing the complaint content.

Step S102, performing word segmentation on the first text to obtain a plurality of word segments.

After the first text is obtained, word segmentation processing may be performed on the first text. For example, in the above example, the client acquires the first text to be subjected to keyword extraction from the text box as "hello". Now located in the front entrance of XX city XX Zhen Yi plus one department, there are two music bars in the left and right. The night is operated every day until 2 to 3 o' clock in the morning. . The rest of the residents in the surrounding cells is seriously influenced. In particular, karaoke and some intoxicated people are experiencing (such as an upset or a double-curse) also sung 11 pm. . . Have played 110 to advise and environmental complaints, citizen hotlines, etc. many times. . But only a moment or moment of peace is obtained. . It is desirable that the relevant departments enforce law and also inhabit a peaceful lodging environment. . . Thanks to the culture. . . ". After the word segmentation process, a plurality of word segments can be obtained: hello/. now/on/XX/city/XX/town/yi/jia/yi/shop/entrance/right/left/there/two/music/bar/. Daily/evening/business/to/morning/2/to/3 o' clock/. /… …

It should be understood that, for the accuracy of the keyword extraction result, after obtaining the multiple segmented words, the multiple segmented words may be removed without meaning, for example, nonsense auxiliary words or adverbs such as "what" or "is" may be removed, and so on.

Step S103, inputting a plurality of participles into the word graph model to obtain a weight value corresponding to each participle.

The word graph model is used for determining the weight value of each participle in the following mode:

firstly, a target word graph is obtained, and the target word graph is established based on word segmentation in a first text.

It should be understood that the word graph may include a plurality of nodes, each node is used for representing a participle, and each node may have a weight value, the weight value is used for representing the importance degree of the participle in the text, if the weight value is larger, the higher the importance degree of the participle in the text is, and the more possible the participle is a keyword. Conversely, if the weight value is smaller, it can indicate that the word segmentation has lower importance in the text, and thus the word segmentation is less likely to be a keyword.

In addition, the word graph may also include edges between any two nodes, and each edge may also have a corresponding edge weight. When determining the weight value of a node, the determination may be performed according to the edge weight values of the node and other nodes. That is, the edge weights between nodes may affect the weight values of the nodes, thereby affecting the final keyword extraction result.

Then, in the target word graph, a first edge weight value between a node corresponding to the first participle and a node corresponding to the second participle is determined.

For example, in the first text, the number of co-occurrences of the first participle and the second participle within a certain time length may be determined, and then the number of co-occurrences may be used as a first edge weight value between a node corresponding to the first participle and a node corresponding to the second participle in the target word graph.

And then, in the preset word graph, determining a second edge weight value between a node corresponding to the first word segmentation and a node corresponding to the second word segmentation.

And the preset word graph is established based on the participles in the second text. Similarly, word segmentation processing may be performed on the second text, and then each segmented word is used as an individual node to construct a word graph, where the constructed word graph is the preset word graph. In addition, the vocabulary amount of the second text is larger than that of the first text, for example, the second text may include a plurality of pre-collected texts, so that the second text may include all the participles in the first text. Therefore, after the preset word graph is obtained, the node corresponding to the first participle and the node corresponding to the second participle can be found in the preset word graph, and then the second edge weight value between the node corresponding to the first participle and the node corresponding to the second participle is determined.

And finally, determining the weight value of the node corresponding to the first word segmentation in the target word graph according to the first edge weight value and the second edge weight value.

And step S104, extracting keywords of the first text according to the weight value corresponding to each participle.

That is, after determining the weight value corresponding to each participle according to steps S101 to S103, keyword extraction may be performed on the first text according to determining the weight value corresponding to each participle. For example, the weighted values corresponding to each participle may be arranged according to the size sequence, and then a preset number of participles with the largest weighted values are selected as the keywords of the first text, and so on.

In a possible manner, after the first text is subjected to keyword extraction, operations such as classifying and filing the first text according to the extracted keywords may be performed. For example, in the above example, after the user fills out the complaint content in the complaint website, the client may acquire the first text, and then perform keyword extraction on the first text by executing step S103 and step S104. Then, the client may send the first text and the keywords extracted from the first text to the server, and the server may classify and archive the first text according to the keywords, or the server may analyze the complaint content in the first text according to the keywords, so as to send the complaint content to a corresponding processing person for processing, and so on.

Through the mode, when the weighted values of the participle nodes are determined, prior knowledge can be added, namely, the second edge weight value between the node corresponding to the first participle and the node corresponding to the second participle in the word graph is preset, so that the edge weight value difference between the participle nodes is increased, the weighted value between the participle nodes is better distinguished, the situation that all the participles are used as key words due to the fact that the participle weighted values are more average is avoided, and operations such as classifying and filing the text according to the key words are facilitated.

The following describes a process of inputting a plurality of segmented words into a word graph model to obtain a weight value corresponding to each segmented word.

In a possible manner, in the second text, the first participle and the second participle may both be low-frequency words, that is, the first participle and the second participle are both words whose occurrence frequency is lower than a preset threshold. Alternatively, the first participle and the second participle may both be high-frequency words, that is, the first participle and the second participle are both words whose occurrence frequency is higher than or equal to a preset threshold. Or, one of the first participle and the second participle is a low-frequency word, and the other is a high-frequency word. For the case of low-frequency words, in the embodiment of the present disclosure, the edge weight value may be calculated by determining high-frequency words close to the low-frequency words, so as to further increase the edge weight value difference between word segmentation nodes.

That is to say, the embodiment of the present disclosure may further determine the second edge weight value in different manners according to a condition that the first participle and the second participle are low-frequency words or high-frequency words, so as to further increase the edge weight value difference between the participle nodes. The following explains possible cases.

In a first case, if the first word segmentation is a low-frequency word and the second word segmentation is a high-frequency word in the second text, a plurality of first target high-frequency words with the highest similarity to the first word segmentation can be determined in the high-frequency words of the second text, and then a second edge weight value between a node corresponding to the first word segmentation and a node corresponding to the second word segmentation is determined according to the following formula:

wherein, w_12-initRepresents the second edge weight, fr (w)₁,w₂) Representing a first participle w₁With a second participle w₂Number of co-occurrences within a preset duration, fr (w)₂,w_o1) Representing the second participle w₂With the first target high-frequency word w_o1The co-occurrence times within a preset time length, n represents the first target high-frequency word w_o1A and b are constants, and a + b is 1, max (sim, n) represents the same as the first word w₁And the word set is formed by the n first target high-frequency words with the highest similarity.

That is, if the first participle is a low-frequency word and the second participle is a high-frequency word, a plurality of first target high-frequency words with the highest similarity to the first participle can be determined among the high-frequency words of the second text, and then the second edge weight value between the node corresponding to the first participle and the node corresponding to the second participle in the preset word graph is determined according to the co-occurrence times of the first participle and the second participle in the preset time length and the average co-occurrence times of the second participle and the plurality of first target high-frequency words in the preset time length.

The low-frequency words are words with the occurrence frequency lower than a preset threshold, and the high-frequency words are words with the occurrence frequency higher than or equal to the preset threshold. In a possible manner, the preset threshold value may be determined according to the following formula:

wherein T represents a preset threshold value, L represents the number of the participles with the occurrence frequency of 1 in the second text, and N represents the number of different participles in the second text.

Illustratively, the number of tokens N may be determined by: each participle in the second text is counted. Specifically, counting is started from the first participle of the first text, if a different participle occurs, the counting is increased by one, if the same participle as the previously counted participle occurs, the counting is not changed, the counting of the next participle is continued, and then the final counting result is determined as N. It should be understood that N represents the number of categories in the second segmented word, and not the same as the number of all segmented words in the second text, and that N may be less than or equal to the number of all segmented words in the second text.

For example, if the preset threshold determined according to equation (1) is 50, then when the frequency of occurrence of a participle in the second text is lower than 50, the participle may be determined to be a low frequency word. Conversely, when the frequency of occurrence of a participle in the first text is higher than or equal to 50, the participle may be determined to be a high frequency word, so that all the participles in the second text may be divided into low frequency words and high frequency words in the manner described above. Then, a plurality of first target high-frequency words with the highest similarity to the first segmentation may be determined in a case where the first segmentation is a low-frequency word and the second segmentation is a high-frequency word.

For example, the process of determining a plurality of first target high-frequency words with the highest similarity to the first word may be: the second text is firstly subjected to word segmentation processing to obtain a plurality of segmented words, then meaningless segmented words in the plurality of segmented words, such as ' what ' is ' and ' is ' segmented words without practical meaning, can be removed, and a plurality of target segmented words are obtained. And then, training a Word2vec model by using a neural network to obtain Word vectors corresponding to the target Word segmentation.

It should be understood that the word vector may be used to characterize semantic features of the participles, and the more similar the participles are used, the more similar the semantics are, and the more similar the word vector is, so that the similarity calculation may be performed through the word vector, thereby determining a plurality of first target high-frequency words with the highest similarity to the first participle. Specifically, the similarity between the first participle and all the high-frequency words in the second text can be respectively calculated according to the following formula:

wherein, S (v)_i,v_j) Indicates the degree of similarity, v_iRepresenting a participle w_iWord vector of v_jRepresenting a participle w_jThe word vector of (2). Specifically, in the process of calculating the similarity between the first participle and all high-frequency words in the second text, v_iWord vector, v, which may represent a first participle_jA word vector may represent any high frequency word in the second text.

After the similarity between the first segmentation and all high-frequency words in the second text is obtained, a plurality of high-frequency words with the highest similarity can be selected as first target high-frequency words, and then a second edge weight value between a node corresponding to the first segmentation and a node corresponding to the second segmentation is determined through a formula (1).

In a second case, if the first participle and the second participle are both low-frequency words, in the high-frequency words of the second text, a plurality of first target high-frequency words with the highest similarity to the first participle and a plurality of second target high-frequency words with the highest similarity to the second participle are respectively determined, and then a second edge weight value between a node corresponding to the first participle and a node corresponding to the second participle is determined according to the following formula:

wherein, w_12-initRepresents the second edge weight, fr (w)₁,w₂) Representing a first participle w₁With a second participle w₂Number of co-occurrences within a preset duration, fr (w)₂,w_o1) Representing the second participle w₂With the first target high-frequency word w_o1Number of co-occurrences within a preset duration, fr (w)₁,w_o2) Representing a first participle w₁With a second target high-frequency word w_o2Number of co-occurrences, n, within a preset duration₁Representing a first target high-frequency word w_o1Number of (2), n₂Representing a second target high-frequency word w_o2Number of (2), max (sim, n)₁) Means and first word₁N with the highest similarity₁A first target high frequency word group set, max (sim, n)₂) Represents the second participle w₂N with the highest similarity₂A word set composed of second target high-frequency words, wherein a, b and c are constants and satisfy the following conditions: a + b + c is 1,

for example, the similarity between the first participle and all the high-frequency words in the second text can be determined according to formula (3), and then a plurality of high-frequency words with the highest similarity are selected as the first target high-frequency word. Similarly, the similarity between the second participle and all the high-frequency words in the second text can be determined according to formula (3), and then a plurality of high-frequency words with the highest similarity are selected as the second target high-frequency words.

Wherein v in formula (3) is used in determining the first target high-frequency word_iWord vector, v, which may represent a first participle_jA word vector may represent any high frequency word in the second text. V in equation (3) in determining the second target high-frequency word_iWord vectors, v, which may represent second participles_jA word vector may represent any high frequency word in the second text.

After the first target high-frequency word and the second target high-frequency word are determined, a second edge weight value between a node corresponding to the first participle and a node corresponding to the second participle in the preset word graph can be determined according to formula (4). That is to say, the second edge weight value between the node corresponding to the first participle and the node corresponding to the second participle in the preset word graph may be determined according to the number of times of co-occurrence of the first participle and the second participle within the preset time length, the average number of times of co-occurrence of the second participle and the plurality of first target high-frequency words within the preset time length, and the average number of times of co-occurrence of the first participle and the plurality of second target high-frequency words within the preset time length.

Through the method, if the low-frequency words exist in the first participle and the second participle, a plurality of high-frequency words with the highest similarity to the low-frequency words can be searched in the second text, and then the second edge weight value between the participles in the preset word graph is calculated through the high-frequency words, so that the situation that the second edge weight value is small due to the fact that the edge weight value calculation is directly carried out through the low-frequency words is avoided, the edge weight value difference between participle nodes is further increased, the weight value difference between the participle nodes is increased, and the weight value between the participle nodes is better distinguished.

In a third case, if the first participle and the second participle are both high-frequency words, the co-occurrence frequency of the first participle and the second participle within a preset time duration may be used as a second edge weight value between a node corresponding to the first participle and a node corresponding to the second participle in the second text.

That is to say, in the embodiment of the present disclosure, if the first participle and the second participle are both high-frequency words, the number of times of co-occurrence of the first participle and the second participle within a preset time period may be directly used as the second edge weight value to perform subsequent weight value calculation, so as to improve the calculation efficiency.

After the first edge weight and the second edge weight are obtained, the weight value of the node corresponding to the first branch word can be determined according to the first edge weight and the second edge weight.

In a possible manner, the weight value of the node corresponding to the first word may be determined according to the first edge weight value and the second edge weight value according to the following formula:

wherein R (w)₁) Representing a first participle w in a target word graph₁Weight value of corresponding node, R (w)₂) Representing second participle in target word graphw₂Weight value of corresponding node, in (w)₁) Representing a second participle w in the target word graph₂Set of nodes, w₁₂Representing a first participle w₁Node and second participle w₂First edge weight between nodes, w_12-initRepresenting a first participle w₁Node and second participle w₂And a second edge weight value between the nodes, wherein lambda represents a preset constant.

Further, in order to increase the weight value difference between the participle nodes, after determining the weight value of the node corresponding to the first participle, the edge weight value between the node corresponding to the second participle and its surrounding nodes may also be determined. For example, if the node corresponding to the second word segmentation is a node in the target word graph within a first preset range of the node corresponding to the first word segmentation, a third side weight between the node corresponding to the second word segmentation and the node corresponding to the third word segmentation can be further determined in the target word graph, where the node corresponding to the third word segmentation is a node in the second preset range of the second word segmentation. And then, determining a fourth side weight between the node corresponding to the second word segmentation and the node corresponding to the third word segmentation in the preset word graph. And finally, determining the weight value of the node corresponding to the first branch in the target word graph according to the first edge weight value, the second edge weight value, the third edge weight value and the fourth edge weight value. The determination manner of the third side weight is similar to the determination manner of the first side weight, and the determination manner of the fourth side weight is similar to the determination manner of the second side weight, which is not described herein again.

That is, in a possible manner, the node corresponding to the second word segmentation may be a node in a first preset range of a node corresponding to the first word segmentation in the target word graph, and accordingly, the weight value of the node corresponding to the first word segmentation in the target word graph may be determined according to the following formula:

wherein R (w)₁) Representing a first participle w in a target word graph₁Weight value of corresponding node, R (w)₂) Representing a second participle w in the target word graph₂The weight value of the corresponding node is set,in(w₁) Representing a second participle w in the target word graph₂Set of nodes, w₁₂Representing a first participle w₁Node and second participle w₂First edge weight between nodes, w_12-initRepresenting a first participle w₁Node and second participle w₂Second edge weight between nodes, out (w)₂) Representing a second participle w in the target word graph₂Participles w in second preset range of node₃Set of nodes, w₂₃Representing a second participle w in the target word graph₂Node and word segmentation w₃Third edge weight between nodes, w_23-initRepresenting a second participle w in a predetermined word graph₂Node and word segmentation w₃And a fourth edge weight between the nodes, wherein lambda represents a preset constant.

For example, the first preset range and the second preset range may be determined according to actual situations, and the embodiment of the present disclosure does not limit this. The first preset range may include all nodes pointing to the node corresponding to the first word in the preset word graph, where possible, that is, the node corresponding to the second word may be all nodes pointing to the node corresponding to the first word in the preset word graph. Similarly, the second predetermined range may include all nodes pointed to by the node corresponding to the second term in the predetermined word graph, that is, the node corresponding to the second term may be all nodes pointed to by the node corresponding to the first term in the predetermined word graph.

Illustratively, in the initial calculation, R (w)₂) The value of (d) may be determined by: first, the number of all the participles in the first text is determined, for example, the number of all the participles is N, then R (w)₂) The value of (D) can be determined to be 1/N. In addition, the value of λ may be set according to practical application, for example, may be set to 0.85, and the like, which is not limited in this disclosure.

By the method, for any participle in the first text, the participle and the first target participle around the participle can be determined first, and the second target participle around the first target participle is determined. Then, the edge weight between the node corresponding to the participle and the node corresponding to the first target participle and the edge weight between the node corresponding to the first target participle and the node corresponding to the second target participle can be respectively calculated in the target word graph and the preset word graph.

And finally, substituting the calculated edge weight value into the formula (6) to obtain the weight value of the corresponding node of the participle in the target word graph, so as to avoid the problem of small weight value difference caused by determining the edge weight value only through the co-occurrence times, improve the weight value difference between the participle nodes in the target word graph and better distinguish the weight value between the participle nodes.

In a possible manner, after determining the weight value of the first participle in the target word graph, any other participle in the target word graph may also be used as the first participle, and the keyword extraction method in the embodiment of the present disclosure is executed again. That is to say, in the embodiment of the present disclosure, the weight value of each participle node in the target word graph may be iteratively calculated.

Further, the iteration process can be stopped when the weighted values of all the participle nodes in the target word graph are in a convergence state or the times of executing the keyword extraction method reach preset times. The preset times can be set according to actual conditions, and the embodiment of the disclosure does not limit the preset times. After the iteration is stopped, determining keywords in the text according to the weight values of the participle nodes in the target word graph. For example, a plurality of segmented words with the largest weight value may be determined as keywords, and the like, which is not limited in this disclosure.

The keyword extraction method in the present disclosure is explained below by another exemplary embodiment. Referring to fig. 2, the method may include:

step S201, responding to input completion information triggered by a user, and acquiring a first text to be subjected to keyword extraction in a text box of a client;

step S202, performing word segmentation processing on the first text, taking each word segmentation as an individual node, and establishing a target word graph.

Step S203, determining a first edge weight value between a node corresponding to the first participle and a node corresponding to the second participle in the target word graph.

Step S204, determining whether the first participle and the second participle in the second text are both low-frequency words, if so, entering step S205 and step S206, otherwise, entering step S207.

In step S205, a first target high-frequency word and a second target high-frequency word are determined in the high-frequency words of the second text. The first target high-frequency words are multiple high-frequency words with the highest similarity to the first word segmentation in the second text, and the second target high-frequency words are multiple high-frequency words with the highest similarity to the second word segmentation in the second text.

Step S206, determining a second edge weight value according to the first participle, the second participle, the first target high-frequency word and the second target high-frequency word. And the second edge weight value is an edge weight value between a node corresponding to the first participle and a node corresponding to the second participle in the preset word graph. It should be understood that, the process of determining the second edge weight value according to the first participle, the second participle, the first target high-frequency word and the second target high-frequency word is already described in the foregoing, and is not described herein again.

Step S207, determining whether a low-frequency word exists in the first participle and the second participle in the second text, if so, entering step S208 and step S209, otherwise, entering step S210.

In step S208, a target high-frequency word with the highest similarity to the low-frequency word is determined among the high-frequency words in the second text.

Step S209, a second edge weight value is determined according to the first participle, the second participle and the target high-frequency word. It should be understood that, the process of determining the second edge weight value according to the first participle, the second participle and the target high-frequency word has been described in the foregoing, and is not described in detail herein.

Step S210, in the second text, the number of co-occurrences of the first participle and the second participle within a preset time duration is used as a second edge weight.

Step S211, determining the weight value of the node corresponding to the first word in the target word graph according to the first edge weight value and the second edge weight value. It should be understood that, the process of determining the weight value of the node corresponding to the first word in the target word graph according to the first edge weight value and the second edge weight value has been described in the foregoing, and is not described herein again.

Step S212, determining whether each participle node in the target word graph is in a convergence state or whether the execution frequency reaches a preset frequency, if so, entering step S214, otherwise, entering step S213.

Step S213, taking any other participle in the target word graph as the first participle, proceeding to step S203, and executing the process of determining the weight value of the participle again.

Step S214, extracting keywords from the first text according to the weight value corresponding to each participle.

Step S215, classifying and archiving the first text according to the extracted keywords.

The detailed description of the above steps is given above for illustrative purposes, and will not be repeated here. It will also be appreciated that for simplicity of explanation, the above-described method embodiments are all presented as a series of acts or combination of acts, but those skilled in the art will recognize that the present disclosure is not limited by the order of acts or combination of acts described above. Further, those skilled in the art will also appreciate that the embodiments described above are preferred embodiments and that the steps involved are not necessarily required for the present disclosure.

By the method, the problem that the weight value difference is small due to the fact that the edge weight value is determined only through the co-occurrence times can be solved, the weight value difference between all participle nodes in the target word graph is improved, and the weight values between all participle nodes can be better distinguished.

For example, in a scenario where the keyword is extracted from the complaint data, the user makes a complaint call to complaint, so that the voice complaint information input by the user can be acquired, and then the voice complaint information can be identified to obtain a complaint text corresponding to the voice complaint information. For example, the complaint text is: the report XX company causes ground water pollution due to the stealing of waste printing ink in the XX industrial park in Guangming building village, causes air pollution due to the illegal storage of printing ink residues on the roof, does not acquire environmental protection quality, and expects related environmental protection departments to carry out case investigation due to the stealing of waste printing ink due to the stealing of oil ink. "

After the complaint text is obtained, word segmentation processing can be performed on the complaint text to obtain a plurality of words, words without practical meaning such as "what" is "in the plurality of words are removed, then the remaining words are input into a word graph model to obtain a weight value of each word in the complaint text, and keyword extraction is performed on the complaint text according to the weight value of each word. In addition, keyword extraction is also performed on the complaint texts by a TextRank mode in the related art. The obtained keyword extraction results and the weight value results of each keyword are shown in table 1:

TABLE 1

Referring to table 1, after the text is processed by the TextRank method in the related art, two words, namely, ink and building are not distinguished, but a report and a term causing such common words are found, and the weights are relatively even, so that it is difficult to distinguish which word is more important. After the keyword extraction mode in the embodiment of the disclosure is used for processing, some words without actual meanings are eliminated, the Guangming and the Fucun are found, the event key information can be extracted in the form of words, the weight of the words which are important in the front can be increased, the keyword extraction is relatively more accurate, and the operation of classifying and filing according to the keyword is more accurate.

Based on the same inventive concept, referring to fig. 3, an embodiment of the present disclosure further provides a keyword extraction apparatus 300, where the apparatus 300 may become part or all of an electronic device through software, hardware, or a combination of the two, and may include:

the text acquisition module 301 is configured to acquire a first text to be subjected to keyword extraction;

a text word segmentation module 302, configured to perform word segmentation on the first text to obtain multiple word segments;

a weight determining module 303, configured to input the multiple participles into a word graph model to obtain a weight value corresponding to each participle;

a segmentation extracting module 304, configured to perform keyword extraction on the first text according to a weight value corresponding to each segmentation;

wherein the weight determining module 303 comprises:

the obtaining submodule 3031 is configured to obtain a target word graph, where the target word graph is established based on a word segmentation in the first text;

the first determining submodule 3032 is configured to determine, in the target word graph, a first edge weight between a node corresponding to the first participle and a node corresponding to the second participle;

a second determining submodule 3033, configured to determine, in a preset word graph, a second edge weight between a node corresponding to the first participle and a node corresponding to the second participle, where the preset word graph is established based on participles in a second text, and a word aggregation amount of the second text is greater than that of the first text;

the third determining submodule 3034 is configured to determine, according to the first edge weight and the second edge weight, a weight value of a node corresponding to the first term in the target word graph.

Optionally, the text obtaining module 301 is configured to:

receiving voice information input by a user;

Optionally, the text obtaining module 301 is configured to:

Optionally, the second determining submodule 3033 is configured to:

when the first participle in the second text is a low-frequency word and the second participle is a high-frequency word, determining a plurality of first target high-frequency words with highest similarity to the first participle in the high-frequency words of the second text, wherein the low-frequency word is a word with the occurrence frequency lower than a preset threshold value, and the high-frequency word is a word with the occurrence frequency higher than or equal to the preset threshold value;

Optionally, the second determining submodule 3033 is further configured to: determining a second edge weight value between the node corresponding to the first participle and the node corresponding to the second participle according to the following formula:

Optionally, the second determining submodule 3033 is configured to:

Optionally, the node corresponding to the second participle is a node in a preset range of the node corresponding to the first participle in the target word graph, and the third determining submodule 3034 is configured to:

Optionally, the apparatus 300 further comprises:

a threshold determination module, configured to determine the preset threshold according to the following formula:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Through any one keyword extraction device, the problem that the weight value difference is small due to the fact that the edge weight value is determined only through the co-occurrence times can be solved, the weight value difference between all participle nodes in the target word graph is improved, the weight values between all participle nodes are better distinguished, the accuracy of keyword extraction is improved, and subsequent operations such as classification filing and the like can be conveniently carried out according to the extracted keywords.

Based on the same inventive concept, an embodiment of the present disclosure further provides an electronic device, including:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of any of the above keyword extraction methods.

In a possible approach, a block diagram of the electronic device may be as shown in fig. 4. Referring to fig. 4, the electronic device 400 may include: a processor 401 and a memory 402. The electronic device 400 may also include one or more of a multimedia component 403, an input/output (I/O) interface 404, and a communications component 405.

The processor 401 is configured to control the overall operation of the electronic device 400, so as to complete all or part of the steps in the keyword extraction method. The memory 402 is used to store various types of data to support operations at the electronic device 400, such as instructions for any application or method operating on the electronic device 400 and application-related data, such as preset word maps, preset thresholds, etc. The Memory 402 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically erasable Programmable Read-Only Memory (EEPROM), erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia components 403 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 402 or transmitted through the communication component 405. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 404 provides an interface between the processor 401 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 405 is used for wired or wireless communication between the electronic device 400 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or a combination of one or more of them, which is not limited herein. The corresponding communication component 405 may therefore include: Wi-Fi module, Bluetooth module, NFC module, etc.

In an exemplary embodiment, the electronic Device 400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the keyword extraction method.

In another exemplary embodiment, there is also provided a computer readable storage medium including program instructions which, when executed by a processor, implement the steps of the keyword extraction method described above. For example, the computer readable storage medium may be the memory 402 comprising program instructions executable by the processor 401 of the electronic device 400 to perform the keyword extraction method described above.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the keyword extraction method described above when executed by the programmable apparatus.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. A keyword extraction method, characterized in that the method comprises:

acquiring a first text to be subjected to keyword extraction;

2. The method according to claim 1, wherein the obtaining of the first text to be subjected to keyword extraction comprises:

receiving voice information input by a user;

3. The method according to claim 1, wherein the obtaining of the first text to be subjected to keyword extraction comprises:

4. The method according to claim 1, wherein the determining a second edge weight value between a node corresponding to the first participle and a node corresponding to the second participle according to the number of co-occurrences of the first participle and the second participle in a preset time length and the average number of co-occurrences of the second participle and a plurality of first target high-frequency words in the preset time length comprises:

5. The method according to claim 1, wherein the determining a second edge value between a node corresponding to the first participle and a node corresponding to the second participle according to the number of co-occurrences of the first participle and the second participle in a preset time length, the average number of co-occurrences of the second participle and the plurality of first target high-frequency words in the preset time length, and the average number of co-occurrences of the first participle and the plurality of second target high-frequency words in the preset time length comprises:

wherein, w_12-initRepresents the second edge weight, fr (w)₁,w₂) Representing a first participle w₁With a second participle w₂Number of co-occurrences within a preset duration, fr (w)₂,w_o1) Representing the second participle w₂With the first target high-frequency word w_o1Number of co-occurrences within said preset duration, fr (w)₁,w_o2) Representing a first participle w₁With a second target high-frequency word w_o2The number of co-occurrences within the preset time period,n₁representing a first target high-frequency word w_o1Number of (2), n₂Representing a second target high-frequency word w_o2A, b and c are constants, and a + b + c is 1, max (sim, n)₁) Means and first word₁N with the highest similarity₁A first target high frequency word group set, max (sim, n)₂) Represents the second participle w₂N with the highest similarity₂A set of words comprising second target high frequency words.

6. The method of claim 1, wherein the determining a second edge weight value between the node corresponding to the first participle and the node corresponding to the second participle comprises:

7. The method according to any one of claims 1 to 6, wherein the determining the weighted value of the node corresponding to the first participle in the target word graph according to the first edge weighted value and the second edge weighted value includes:

8. A keyword extraction apparatus, characterized in that the apparatus comprises:

wherein the weight determination module comprises:

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 7.