CN112966505B

CN112966505B - Method, device and storage medium for extracting persistent hot phrases from text corpus

Info

Publication number: CN112966505B
Application number: CN202110079692.3A
Authority: CN
Inventors: 叶东; 孙兆伟; 李晖; 赵翰墨; 高祥博; 王璐
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-01-21
Filing date: 2021-01-21
Publication date: 2021-10-15
Anticipated expiration: 2041-01-21
Also published as: CN112966505A

Abstract

The embodiment of the invention discloses a method, a device and a storage medium for extracting continuous hot phrases from text corpora; the method can comprise the following steps: dividing an original text corpus into a plurality of text sets corresponding to time intervals; constructing a frequency suffix tree corresponding to each text set based on the text suffixes contained in each text set and the occurrence frequencies of the text suffixes; traversing a frequency suffix tree corresponding to the hot spot duration interval based on the hot spot duration interval indicated by the query instruction and the minimum occurrence frequency threshold, and querying to obtain the hot spot phrases of which the occurrence frequency is not lower than the minimum occurrence frequency threshold in the hot spot duration interval.

Description

Method, device and storage medium for extracting persistent hot phrases from text corpus

Technical Field

The embodiment of the invention relates to the technical field of information mining, in particular to a method, a device and a storage medium for extracting persistent hot phrases from text corpora.

Background

With the background of rapid expansion of data, a great number of knowledge base construction tasks enable effective information to be extracted from massive text corpora quickly, and the method becomes an important research direction. Mining the continuous word sequence frequently appearing in the text in a phrase form becomes one of effective modes for users to acquire key information and search text sets.

At present, in the process of mining frequent word sequences in a continuous time interval, since a user cannot completely master data contents, the user usually needs to modify query conditions (i.e. interactive query) for multiple iterations to be able to comprehensively understand the data. However, most of the related frequent word sequence mining schemes are oriented to mining tasks, have high time complexity, cannot be used for exploratory query schemes with frequently changed query conditions, and cannot quickly obtain the query feedback requirement.

Disclosure of Invention

In view of the above, embodiments of the present invention are intended to provide a method, an apparatus, and a storage medium for extracting persistent hot phrases from text corpora; the time complexity of searching the continuous hot spot phrases can be reduced, phrase information which is used as hot spots in continuous time intervals can be quickly searched, and the requirement of exploratory interactive query is met.

The technical scheme of the embodiment of the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a method for extracting persistent hot phrases from a text corpus, where the method includes:

dividing an original text corpus into a plurality of text sets corresponding to time intervals;

constructing a frequency suffix tree corresponding to each text set based on the text suffixes contained in each text set and the occurrence frequencies of the text suffixes;

traversing a frequency suffix tree corresponding to the hot spot duration interval based on the hot spot duration interval indicated by the query instruction and the minimum occurrence frequency threshold, and querying to obtain the hot spot phrases of which the occurrence frequency is not lower than the minimum occurrence frequency threshold in the hot spot duration interval.

In a second aspect, an embodiment of the present invention provides an apparatus for extracting persistent hot spot phrases from text corpus, the apparatus including: a dividing part, a constructing part and a query part; wherein the content of the first and second substances,

the dividing part is configured to divide the original text corpus into a plurality of text sets corresponding to time intervals;

the constructing part is configured to construct a frequency suffix tree corresponding to each text set based on the text suffix contained in each text set and the frequency of occurrence of each text suffix;

the query part is configured to traverse a frequency suffix tree corresponding to a hot spot duration interval based on the hot spot duration interval indicated by the query instruction and a minimum occurrence frequency threshold value, and query and obtain hot spot phrases of which the occurrence frequency is not lower than the minimum occurrence frequency threshold value in the hot spot duration interval.

In a third aspect, an embodiment of the present invention provides a computing device, where the computing device includes: a communication interface, a memory and a processor; wherein the content of the first and second substances,

the communication interface is used for receiving and sending signals in the process of receiving and sending information with other external network elements;

the memory for storing a computer program operable on the processor;

the processor is configured to, when running the computer program, perform the steps of the method for extracting persistent hot phrases from text corpora according to the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer storage medium, where the computer storage medium stores a program for extracting persistent hot spot phrases from a corpus of text, and the program for extracting persistent hot spot phrases from the corpus of text implements, when executed by at least one processor, the steps of the method for extracting persistent hot spot phrases from the corpus of text according to the first aspect.

The embodiment of the invention provides a method, a device and a storage medium for extracting continuous hot phrases from text corpora; aiming at a plurality of text sets obtained by dividing an original text corpus according to a time interval, text frequency statistics is avoided in the query process by constructing a frequency suffix tree, so that the time complexity of querying a continuous hot spot is reduced, and the query efficiency is improved.

Drawings

Fig. 1 is a schematic flowchart of a method for extracting persistent hot phrases from a corpus of texts according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a frequency suffix tree according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating comparison of results of a first experimental means provided in the embodiment of the present invention;

FIG. 4 is a diagram illustrating comparison of results of a second experimental means provided in the embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating an apparatus for extracting persistent hot spot phrases from text corpora according to an embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating an apparatus for extracting persistent hot spot phrases from text corpora according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a specific hardware structure of a computing device according to an embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

First, some terms related to the embodiments of the present invention are explained to facilitate understanding by those skilled in the art.

The phrase: are a sequence of consecutive words that occur sequentially in a text collection and that consist of words. For example, according to a dictionary Σ consisting of a finite number of words a, the text set may be represented as d ═ a₁,a₂,…,a_n}. The phrase may be expressed as s (x, y) ═ a_x,a_x+1,…,a_yWherein, x is more than or equal to 1<y≤n。

Hotspot phrase: refers to a phrase that appears at a high frequency over a period of time and that is capable of manifesting what is expected to be conveyed by the text in which it resides. In the embodiment of the present invention, phrase popularity is preferably measured by setting a minimum occurrence frequency threshold, and a sequence of consecutive words whose occurrence frequency is higher than the minimum occurrence frequency threshold is considered as a hot phrase in a query.

Hotspot duration: refers to a continuous time interval, which may consist of a plurality of minimum unit time intervals. In the embodiment of the present invention, T (x, y) ═ T is set_x,t_x+1,…,t _y1 ≦ x ≦ y ≦ m to represent the hotspot duration interval, where t may be used to identify the minimum unit time interval, t_iThe ith time interval of the data set. In addition, T (1, m) is set to represent a complete time period containing all data sets, wherein m represents the number of all minimum unit time intervals contained in the complete time period; it is understood that T (x, y) is a subset of T (1, m).

Based on the above definition and explanation of related concepts, referring to fig. 1, a method for extracting persistent hot phrases from text corpora according to an embodiment of the present invention is shown, where the method may include:

s11: dividing an original text corpus into a plurality of text sets corresponding to time intervals;

s12: constructing a frequency suffix tree corresponding to each text set based on the text suffixes contained in each text set and the occurrence frequencies of the text suffixes;

s13: traversing a frequency suffix tree corresponding to the hot spot duration interval based on the hot spot duration interval indicated by the query instruction and the minimum occurrence frequency threshold, and querying to obtain the hot spot phrases of which the occurrence frequency is not lower than the minimum occurrence frequency threshold in the hot spot duration interval.

By the technical scheme shown in fig. 1, for a plurality of text sets obtained by dividing an original text corpus according to a time interval, text frequency statistics is avoided in a query process by constructing a frequency suffix tree, so that the time complexity of querying a persistent hot spot is reduced and the query efficiency is improved.

For the technical solution shown in fig. 1, steps S11 to S12 may be implemented offline to construct and store a frequency suffix tree after the original text corpus is obtained; step S13 may be implemented online to complete the query of the persistent hot phrases. In some possible implementations, the dividing the original text corpus into a plurality of text sets corresponding to time intervals includes:

dividing the time periods for forming the original text corpus into a plurality of sequential time intervals according to the time sequence and the set minimum unit time interval;

and storing the texts in the original text corpus in a text set corresponding to each time interval according to the text occurrence time in the original text corpus and the time interval.

For the above implementation, for example, the text content occurrence time period in the original text corpus is set to T (1, n), and the time period may be divided into n time intervals in time order based on the set minimum unit time interval, where T (1, n) ═ T { (T ═ T {) respectively₁,t₂,…,t_n}; according to the time intervals obtained by the division, the text content of the original text corpus can be correspondingly stored in each time interval t_iSo as to obtain a text set D corresponding to each time interval_iWherein i is more than or equal to 1 and less than or equal to n.

In some examples, a set of text D corresponding to each time interval_iPreprocessing may be performed before the frequency suffix tree construction, and for the preprocessing, may include:

removing set symbols and stop words in text data in the text set aiming at each text set, and segmenting the stop words and the punctuation positions to obtain a plurality of plain text data strings so as to form a preprocessed text set; wherein each plain text data string is composed of a plurality of sequential words.

Along the above example, for each text set, the text content in the text set is divided into a plurality of plain text data strings by removing special symbols and stop words and segmenting at positions such as stop words and punctuation points, for example, for D_iIn other words, D after completion of pretreatment_i＝{s₁,s₂,…,s_m}; each plain text data string s_jAre sequences of consecutive words, also called phrases, each consisting of a plurality of words, denoted s_j＝{a₁,a₂,…,a_yJ is more than or equal to 1 and less than or equal to m, a_xRepresenting each word, 1 ≦ x ≦ y.

It should be noted that, in order to speed up the efficiency of subsequent online query, the words, phrases and texts may be represented by using shaping numbers, for example, a data format of a dictionary may be set, and the words and phrases in the texts may be represented by dictionary serial numbers.

After completing the preprocessing, a corresponding frequency suffix tree may be generated for each preprocessed text set, and in some examples, the constructing the frequency suffix tree corresponding to each text set based on the text suffix included in each text set and the frequency of occurrence of each text suffix includes:

adding a termination mark to the end of each plain text data string for each text set;

creating an initial frequency tree for each text set; the initial frequency tree only comprises a root node, and the frequency of the root node and the pointers of the child nodes are both null;

for each text set, inserting a text suffix of each plain text data string into the initial frequency tree by adopting a Ukkonen algorithm of Ukkonen to obtain a suffix tree corresponding to each text set;

and performing deep recursive traversal on the suffix tree, setting the frequency value of a leaf node to 1 when traversing to the leaf node, and determining the frequency value of each other node except the leaf node in the suffix tree as the sum of the frequency values of the direct child nodes of each other node to obtain a frequency suffix tree corresponding to each text set.

Following the above example, for a text collection D containing m plain text data strings_iIn particular, the frequency suffix tree is a suffix tree that contains all text suffixes and frequency of occurrence. Unlike the conventional suffix tree, each node of the frequency suffix tree is composed of a node index i and a frequency attribute value freq, which can be denoted as a node (i: freq), and freq is used to denote the occurrence frequency of the phrase spliced by the path from the root node to the current node i in the text set. Still in D_iFor example, the specific construction process of the corresponding frequency suffix tree may include:

first, for each plaintext data string s_j＝{a₁,a₂,…,a_yAt s_jPost-filling a unique end marker $ i, denoted as s_j＝{a₁,a₂,…,a_y,$i}。

Next, an initial frequency tree is created that contains only the root node, where the root node's frequency and child node pointers are null.

Then, s is divided using Ukkonen algorithm_iIs inserted into the initial frequency tree to obtain D_iA corresponding suffix tree. Understandably, the Ukkonen algorithm is a suffix tree construction algorithm that is currently commonly used and efficient, with the algorithm time complexity of o (n).

And finally, performing deep recursive traversal on the suffix tree, for example, when the suffix tree is traversed to a leaf node, setting the frequency value freq of the leaf node to 1, and returning the frequency value to a parent node at the upper layer of the leaf node. For each internal node of the non-leaf nodes, the traversal return values of all the direct child nodes of each internal node, namely the frequency values of the direct child nodes, are recorded, the frequency value freq of the internal node is determined to be equal to the sum of the frequency values of the direct child nodes of the internal node, and D is obtained_iA corresponding frequency suffix tree.

Based on the above construction process, the data structure of the frequency suffix tree obtained by construction is shown in fig. 2, square boxes represent leaf nodes, circular boxes are non-leaf nodes, the square boxes or the circular boxes are represented by the node representation node (i: freq), and edges between the nodes represent a continuous word sequence in a text data string formed from a root node to the node.

It should be noted that the above construction process adopts D_iFor example, only for purposes of illustration, it being understood that D is excluded_iIn addition, the above construction process is also applicable to all other text sets, which is not described in detail in the embodiments of the present invention. After obtaining the frequency suffix tree corresponding to each text collection according to the above-described examples, in order to facilitate storing the frequency suffix trees in a computer and a storage device, a corresponding serialized file may be generated for each frequency suffix tree, based on which, in some examples,the method further comprises the following steps:

and traversing the breadth from the root node aiming at the frequency suffix tree corresponding to each text set, and outputting the node identification, the number of child nodes, the frequency value and the text data string recorded by the connecting edge with the father node corresponding to each node so as to form the serialized file of each frequency suffix tree.

For the above example, following the above example, for each frequency suffix tree, the frequency suffix tree may be extensively traversed from the root node, and each node is output to the file in a data form of < node number, number of child nodes, freq, text string recorded along a connecting edge with a parent node >, so as to form a serialized file for storage.

Through the implementation mode and the example, the stage of constructing and storing the frequency suffix tree in an off-line manner is completed, and in the stage, firstly, the text set is preprocessed to obtain the plain text data string without stop words and punctuations; then, inserting the text data string into a suffix tree according to the division of the unit time interval and carrying out frequency statistics, thereby obtaining a frequency suffix tree; and finally, storing each node and edge information of the frequency suffix tree into a serialized file in a hierarchical traversal mode. After the stage is completed, when the query is carried out, the frequency does not need to be counted, so that the time cost of the query is reduced, and the query efficiency is improved. Based on this, after the above-mentioned phase is completed, the query phase of the online persistent hot phrase can be triggered by receiving a query instruction. In the embodiment of the present invention, the query instruction may include at least a hotspot duration interval desired to be queried and a set minimum occurrence frequency threshold, that is, a phrase whose occurrence frequency is higher than the minimum occurrence frequency threshold in the hotspot duration interval may be regarded as a persistent hotspot phrase.

Based on this, in some examples, the traversing a frequency suffix tree corresponding to the hot spot duration interval based on the hot spot duration interval indicated by the query instruction and a minimum occurrence frequency threshold, and querying to obtain the hot spot phrases whose occurrence frequencies in the hot spot duration interval are not lower than the minimum occurrence frequency threshold includes:

for a first time interval in the hotspot duration time intervals, restoring the serialized file of the frequency suffix tree corresponding to the first time interval into a frequency suffix tree corresponding to the first time interval;

traversing the frequency suffix tree corresponding to the first time interval from the root node in a middle sequence, and judging whether the frequency value of the current traversed node is not less than the minimum occurrence frequency threshold value: if so, continuing to traverse the next-stage node of the current traversed node until the frequency value of the traversed node is smaller than the minimum occurrence frequency threshold value;

if the frequency value of the current traversal node is not smaller than the minimum occurrence frequency threshold value and the frequency values of all child nodes of the current traversal node are smaller than the minimum occurrence frequency threshold value, determining a word sequence recorded by a path edge from the root node to the current traversal node as a candidate hot phrase in the first time interval;

for each other time interval except the first time interval in the hotspot duration time interval, restoring the serialized file of the frequency suffix tree corresponding to each other time interval into the frequency suffix tree corresponding to each other time interval;

for frequency suffix trees corresponding to each other time interval except the first time interval in the hotspot duration time interval, querying each candidate hotspot phrase in a candidate hotspot phrase set corresponding to a previous time interval of the other time interval or all time intervals before the other time interval according to the minimum occurrence frequency threshold in the other time interval to obtain candidate hotspot phrases in the other time interval;

determining a candidate hotspot phrase in the last other time interval as a persistent hotspot phrase in the hotspot duration interval.

For the above example, following the above example, the hot spot duration interval in the query command is set to T (x, y) ═ T_x,t_x+1,…,t_yX is more than or equal to 1 and less than or equal to y and less than or equal to m, and the minimumThe threshold frequency of occurrence is θ. Because of the long duration of the hot phrase, i.e., the hot phrase does not appear and disappears suddenly; that is, the occurrence frequency of the hot phrases is higher in the hot duration interval; therefore, for the query of the hot phrases, the embodiment of the invention can use the candidate hot phrases obtained by querying the partial time interval in the hot duration time interval as the basis for querying the subsequent time interval, and can perform pruning operation when querying the subsequent time interval, thereby further reducing the time cost spent on the query and improving the query efficiency. Based on this, the specific implementation process of the above example may include:

first, a time interval t is read_xAnd restoring the frequency suffix tree from the root node of the corresponding serialized file of the frequency suffix tree. For example, a new node is created according to the freq and the node number, the frequency suffix tree is inserted, the edge information is restored to the corresponding edge, and the number of child nodes is used to judge that the internal node in the frequency suffix tree has several child nodes.

Then, starting from the root node, the time interval t is traversed in an intermediate sequence_xAnd (3) judging whether freq of the current traversal node is greater than or equal to a minimum occurrence frequency threshold value theta (namely, the threshold value theta is met) by the corresponding frequency suffix tree: if so, continuing traversing downwards according to the word sequence in the edges; until the freq of the node no longer meets (i.e. is smaller than) the minimum occurrence frequency threshold theta, the corresponding text suffix no longer meets the threshold theta, and then the parent node is traced back upwards; if the current traversal node meets the threshold theta and all child nodes of the current traversal node do not meet the threshold theta, the word sequence from the root node to the path edge recorded by the node is a candidate hot phrase meeting the requirement. Thus, the time interval t can be obtained_xAll candidate hot phrases of the corresponding frequency suffix tree may be represented as a set S_x＝{s₁,s₂,…,s_n}

Then, the time intervals T in the time range T (x +1, y) are traversed sequentially according to the time sequence_iAnd executing the following steps in the traversing process:

reading the time interval t_iThe corresponding serialized file of the frequency suffix tree, construct the time interval t_iA corresponding frequency suffix tree; then, for the i-1 th time interval t_i-1Candidate hotspot word set S_i-1＝{s₁,s₂,…,s_pIs traversed and the set S is searched_i-1In the time interval t of the candidate hot word_iWhether the threshold θ is still met in the corresponding frequency suffix tree. For the search procedure, for example, for s_aAnd when the (a is more than or equal to 1 and less than or equal to p) is checked, starting from the root node, checking whether freq of the current node meets the threshold value theta, and if the freq does not meet the threshold value, terminating the check. In the child node connecting edge of the current node, whether s is found_aIf the same edge is found, the nodes along the edge continue to find the next node, and continue to be the same as s_iThe remainder is checked for satisfaction of the threshold θ. If no same edge exists, returning the part meeting the threshold value theta as a time interval t_iHot phrases that satisfy the condition.

And finally, after the traversal process is completed for all time intervals in the hot spot duration interval range, obtaining a final query result S, namely the persistent hot spot words in the hot spot duration interval.

Of course, in other examples, to more fully utilize candidate hotspot terms for which the query has been completed, the time interval T may be in the traversal time range T (x +1, y)_iIn the process, for a time interval t_iCandidate hotspot word set S of all previous i-1 time intervals_∑i-1Go through the traversal and search the set S_∑i-1In the time interval t of the candidate hot word_iWhether the threshold θ is still met in the corresponding frequency suffix tree. See the foregoing for set S for a specific lookup procedure_i-1The process of searching for the candidate hot word is not described in detail in the embodiment of the present invention.

In order to illustrate the effect and effectiveness of the technical solution described in the above embodiments. The embodiment of the invention is verified through specific experiments, and experimental data for technical scheme verification comprises a data set 1 consisting of 16 ten thousand tweets extracted through twitter api from 4 months to 6 months in 2009, and a data set 2 consisting of 681288 articles collected from blogger com from 1 month to 12 months in 2004. It will be appreciated that the two data sets are two original text corpora. Setting the minimum unit time interval as a week (i.e., 7 days), the data set 1 may be divided into 13 text sets, the data set 2 may be divided into 53 text sets, setting the minimum occurrence frequency threshold θ to be 50, and implementing preprocessing on the text sets through a Natural Language ToolKit (NLTK), that is, performing operations such as sentence breaking and morphological reduction on each text set according to punctuations and stop words.

Based on the above setting, the technical solution described in the embodiment of the present invention is implemented by taking the data set 2 as an example, and the specific implementation process may include:

step 1, acquiring all articles from 1 month and 1 day of 2004 to 1 month and 1 day of 2005, and setting the unit time interval length as one week (7 days). All text will be divided in weeks into sets of text within 53 unit time intervals.

And 2, removing special symbols and stop words from the text content in the acquired text set by using the NLTK natural language processing packet, and performing segmentation operation at positions of the stop words, punctuations and the like, so that the text content in each text set is divided into a plurality of pure text data strings.

And 3, representing the text content of each text set by integer numbers, for example, counting words appearing in the text, adding the words into a dictionary, and representing the words in the text by dictionary serial numbers.

And 4, constructing a frequency suffix tree for the text data in the text set corresponding to each time interval. For example, the text strings in the time interval are read one by one, and the text strings are inserted into the frequency tree. It is understood that a total of 53 frequency suffix trees can be generated according to the number of time intervals.

And 5, serializing and storing the frequency suffix tree into a file. It is understood that 53 serialized files are output for storing the frequency suffix trees of the corresponding time intervals according to the number of the time intervals.

And 6, acquiring all parameters of the query instruction. For example, the hot spot duration interval is 1-5 weeks, and the minimum frequency threshold is 50. I.e., the query is a set of all hot phrases that occur more frequently than 50 in all time intervals from week 1 to week 5.

And 7, restoring the frequency suffix tree corresponding to the first week from the root node of the serialized files of the frequency suffix tree corresponding to the first week based on the query instruction. For example, a new node may be created from the freq and node number information and inserted into the frequency suffix tree, and the edge information may be restored to the corresponding edge.

And 8, performing middle-order traversal on the frequency suffix tree corresponding to the first week from the root node, and judging whether the freq of the node is more than or equal to 50. If so, recording the word sequence in the edges and continuously traversing downwards; until the freq of the node no longer meets the threshold, the corresponding text suffix no longer meets the threshold, and at the moment, the parent node is traced back upwards; if the currently traversed node meets the threshold and all its child nodes do not meet the threshold, the word sequence of the recorded path edge from the root node to the currently traversed node is a candidate hot phrase meeting the requirement.

And 9, traversing the time interval in the 2 nd to 5 th weeks in the query time range, and executing the steps 9.1 to 9.2.

And 9.1, restoring to obtain a frequency suffix tree corresponding to the ith time interval.

Step 9.2, according to the candidate hot word set S under the i-1 time interval_i-1＝{s₁,s₂,…,s_pAnd traversing, and searching whether the frequency suffix tree corresponding to the ith time interval still meets the threshold value. The search process may be: for the sequence s_aAnd when the (a is more than or equal to 1 and less than or equal to p) is checked, starting from the root node, checking whether freq of the current node meets the threshold value, and if the freq does not meet the threshold value, terminating the check. In the child node connecting edge of the current node, whether s is found_aIf the same edge is found, the next node is continuously found along the boundary point and is continuously connected with the s_aThe remainder is checked for satisfaction of the threshold θ. If there is no identical edge, return toThe part meeting the threshold is the hot phrases meeting the threshold in the frequency suffix tree corresponding to the ith time interval.

And step 10, after traversing all the duration time ranges according to the step 9, obtaining the hot spot terms included in the final query result S, wherein the hot spot terms are the continuous hot spot terms from 1 st week to 5 th week in 2004.

The effectiveness of the technical solution described in the embodiment of the present invention is demonstrated by the above description of the specific implementation process for the data set 2. In addition, in order to verify the query efficiency of the technical scheme described in the embodiment of the present invention, the embodiment of the present invention performs comparison verification by using a continuous hotspot phrase query method implemented based on Apriori, and it should be noted that the method may be denoted as Apriori; the method for extracting the persistent hot phrases from the text corpus, which is set forth in the foregoing technical solution in the embodiment of the present invention, can be denoted as F-Tree. The specific test means comprises: first, the running time between Apriori and F-Tree is compared by changing the length of the minimum unit time interval; the run time between Apriori and the F-Tree is then compared by changing the minimum frequency of occurrence threshold θ. Therefore, the query efficiency of the technical scheme adopted by the embodiment of the invention is verified.

For the first experimental means, if the original text corpus is the data set 1, setting the initial time of query to be 2009, 4 months and 1 day, and changing the length of the minimum unit time interval to be [1, 3, 6, 9 and 12] weeks in sequence; if the original text corpus is the data set 2, setting the initial time of query to 1 month and 1 day in 2004, and changing the length of the minimum unit time interval to [5, 10, 20, 50] weeks in sequence. The two data sets are queried by using Apriori and the F-Tree respectively, the time consumption is shown in fig. 3, and it can be seen from fig. 3 that the F-Tree shows less time consumption on the two data sets compared with Apriori, and the characteristics that the larger the query range is, the more obvious the time consumption difference is are represented. From the overall situation, the query efficiency of the F-Tree can reach more than 100 times of Apriori.

In the second experimental approach, the minimum occurrence frequency threshold θ is set to [10, 20, 50, 100, 200] in the sequence for the data set 1 and the data set 2, and the two data sets are queried by using Apriori and the F-Tree, respectively, the time consumed by the query is as shown in fig. 4, and as can be seen from fig. 4, the efficiency of the F-Tree is stably higher than that of Apriori.

Combining the two above experiments, it can be known that: the technical scheme set forth by the embodiment of the invention can better adapt to the query problems of different requirements, and can keep better efficiency under the scenes of different thresholds and query ranges.

Based on the same inventive concept of the foregoing technical solution, referring to fig. 5, an apparatus 50 for extracting persistent hot phrases from text corpora according to an embodiment of the present invention is shown, where the apparatus 50 includes: a dividing part 501, a constructing part 502 and a querying part 503; wherein the content of the first and second substances,

the dividing part 501 is configured to divide the original text corpus into a plurality of text sets corresponding to time intervals;

the constructing part 502 is configured to construct a frequency suffix tree corresponding to each text set based on the text suffix contained in each text set and the frequency of occurrence of each text suffix;

the query part 503 is configured to traverse the frequency suffix tree corresponding to the hot spot duration interval based on the hot spot duration interval indicated by the query instruction and the minimum occurrence frequency threshold, and query for hot spot phrases whose occurrence frequencies in the hot spot duration interval are not lower than the minimum occurrence frequency threshold.

In some examples, the dividing section 501 is configured to:

In some examples, referring to fig. 6, the apparatus 50 further comprises: a pre-processing portion 504 configured to:

In some examples, the construction portion 502 is configured to:

In some examples, referring to fig. 6, the apparatus 50 further comprises: a serialized storage portion 505 configured to: and traversing the breadth from the root node aiming at the frequency suffix tree corresponding to each text set, and outputting the node identification, the number of child nodes, the frequency value and the text data string recorded by the connecting edge with the father node corresponding to each node so as to form the serialized file of each frequency suffix tree.

In some examples, the query portion 503 is configured to:

for the frequency suffix trees corresponding to each other time interval except the first time interval in the hotspot duration interval, utilizing each candidate hotspot phrase in the candidate hotspot phrase set corresponding to the previous time interval of each other time interval to query,

in some examples, referring to fig. 6, the apparatus 50 further comprises a receiving portion 506 configured to: receiving a query instruction; the query instruction at least comprises a hotspot duration interval expected to be queried and a set minimum occurrence frequency threshold value.

The above-described schematic solution of the apparatus 50 for extracting persistent hot phrases from text corpus is provided for the present embodiment. It should be noted that the technical solution of the apparatus 50 for extracting persistent hot spot phrases from text corpus is the same as the technical solution of the method for extracting persistent hot spot phrases from text corpus, and therefore, the details of the technical solution of the apparatus 50 for extracting persistent hot spot phrases from text corpus not described in detail can be referred to the description of the technical solution of the method for extracting persistent hot spot phrases from text corpus.

It is understood that in this embodiment, "part" may be part of a circuit, part of a processor, part of a program or software, etc., and may also be a unit, and may also be a module or a non-modular.

In addition, each component in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.

Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Accordingly, the present embodiment provides a computer storage medium storing a program for extracting persistent hot spot phrases from a corpus of text, wherein the program for extracting persistent hot spot phrases from the corpus of text implements the method steps for extracting persistent hot spot phrases from the corpus of text as described in the above technical solution when executed by at least one processor.

Referring to fig. 7, a specific hardware structure of a computing device 70 capable of implementing the persistent hot spot phrase extracting apparatus 50 from a corpus is shown, wherein the computing device 70 may be a wireless device, a mobile or cellular phone (including a so-called smart phone), a Personal Digital Assistant (PDA), a video game console (including a video display, a mobile video game device, a mobile video conference unit), a laptop computer, a desktop computer, a television set-top box, a tablet computing device, an e-book reader, a fixed or mobile media player, etc. according to the above-mentioned persistent hot spot phrase extracting apparatus 50 from a corpus and a computer storage medium. The computing device 70 includes: a communication interface 701, a memory 702, and a processor 703; the various components are coupled together by a bus system 704. It is understood that the bus system 704 is used to enable communications among the components. The bus system 704 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled in fig. 7 as the bus system 704. Wherein the content of the first and second substances,

the communication interface 701 is configured to receive and transmit signals in a process of receiving and transmitting information with other external network elements;

the memory 702 is used for storing a computer program capable of running on the processor 703;

the processor 703 is configured to, when the computer program is run, execute the steps of the method for extracting the persistent hot phrases from the text corpus in the foregoing technical solution, which are not described herein again.

It is to be understood that the memory 702 in embodiments of the present invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (ddr Data Rate SDRAM, ddr SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The memory 702 of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

The processor 703 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the method may be implemented by hardware integrated logic circuits in the processor 703 or by instructions in the form of software. The Processor 703 may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 702, and the processor 703 reads the information in the memory 702 and performs the steps of the above method in combination with the hardware thereof.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

It should be noted that: the technical schemes described in the embodiments of the present invention can be combined arbitrarily without conflict.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method for extracting persistent hot phrases from a corpus of text, the method comprising:

traversing a frequency suffix tree corresponding to the hot spot duration interval based on the hot spot duration interval indicated by the query instruction and a minimum occurrence frequency threshold, and querying to obtain hot spot phrases of which the occurrence frequency is not lower than the minimum occurrence frequency threshold in the hot spot duration interval;

wherein, the constructing a frequency suffix tree corresponding to each text set based on the text suffix contained in each text set and the frequency of occurrence of each text suffix comprises:

adding a termination mark to the end of each plain text data string for each text set; removing set symbols and stop words in text data in each text set aiming at each text set, and segmenting the stop words and the punctuation positions to obtain a plurality of plain text data strings;

2. The method of claim 1, wherein the dividing the original text corpus into a plurality of text sets corresponding to time intervals comprises:

3. The method of claim 1, further comprising:

4. The method according to claim 3, wherein the step of traversing the frequency suffix tree corresponding to the hot spot duration interval based on the hot spot duration interval indicated by the query instruction and the minimum occurrence frequency threshold value, and querying to obtain the hot spot phrases whose occurrence frequencies in the hot spot duration interval are not lower than the minimum occurrence frequency threshold value comprises:

and for the frequency suffix trees corresponding to other time intervals except the first time interval in the hotspot duration interval, utilizing each candidate hotspot phrase in the candidate hotspot phrase set corresponding to the previous time interval of the other time intervals to query.

5. The method of claim 1, further comprising:

receiving a query instruction; the query instruction at least comprises a hotspot duration interval expected to be queried and a set minimum occurrence frequency threshold value.

6. An apparatus for extracting persistent hot spot phrases from a corpus of text, the apparatus comprising: a dividing part, a constructing part and a query part; wherein the content of the first and second substances,

the query part is configured to traverse a frequency suffix tree corresponding to a hot spot duration interval based on the hot spot duration interval indicated by a query instruction and a minimum occurrence frequency threshold value, and query to obtain hot spot phrases of which the occurrence frequency is not lower than the minimum occurrence frequency threshold value in the hot spot duration interval;

wherein the construction section is configured to: adding a termination mark to the end of each plain text data string for each text set; removing set symbols and stop words in text data in each text set aiming at each text set, and segmenting the stop words and the punctuation positions to obtain a plurality of plain text data strings;

7. A computing device, wherein the computing device comprises: a communication interface, a memory and a processor; wherein the content of the first and second substances,

the memory for storing a computer program operable on the processor;

the processor, when executing the computer program, is configured to perform the steps of the method for extracting persistent hot phrases from text corpus as claimed in any one of claims 1 to 5.

8. A computer storage medium, characterized in that the computer readable medium stores a program for extracting persistent hot spot phrases from a corpus of text, which program, when executed by at least one processor, performs the steps of the method for extracting persistent hot spot phrases from a corpus of text as claimed in any one of claims 1 to 5.