CN111898010A - New keyword mining method and device and electronic equipment - Google Patents
New keyword mining method and device and electronic equipment Download PDFInfo
- Publication number
- CN111898010A CN111898010A CN202010664165.4A CN202010664165A CN111898010A CN 111898010 A CN111898010 A CN 111898010A CN 202010664165 A CN202010664165 A CN 202010664165A CN 111898010 A CN111898010 A CN 111898010A
- Authority
- CN
- China
- Prior art keywords
- segmentation
- segment
- segments
- new keyword
- new
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000005065 mining Methods 0.000 title claims abstract description 52
- 230000011218 segmentation Effects 0.000 claims abstract description 164
- 239000012634 fragment Substances 0.000 claims description 21
- 238000004364 calculation method Methods 0.000 claims description 17
- 238000012545 processing Methods 0.000 claims description 17
- 238000000605 extraction Methods 0.000 claims description 4
- 230000000875 corresponding effect Effects 0.000 description 9
- 241000289581 Macropus sp. Species 0.000 description 6
- 230000002596 correlated effect Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 241001481665 Protophormia terraenovae Species 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000008439 repair process Effects 0.000 description 2
- PCTMTFRHKVHKIS-BMFZQQSSSA-N (1s,3r,4e,6e,8e,10e,12e,14e,16e,18s,19r,20r,21s,25r,27r,30r,31r,33s,35r,37s,38r)-3-[(2r,3s,4s,5s,6r)-4-amino-3,5-dihydroxy-6-methyloxan-2-yl]oxy-19,25,27,30,31,33,35,37-octahydroxy-18,20,21-trimethyl-23-oxo-22,39-dioxabicyclo[33.3.1]nonatriaconta-4,6,8,10 Chemical compound C1C=C2C[C@@H](OS(O)(=O)=O)CC[C@]2(C)[C@@H]2[C@@H]1[C@@H]1CC[C@H]([C@H](C)CCCC(C)C)[C@@]1(C)CC2.O[C@H]1[C@@H](N)[C@H](O)[C@@H](C)O[C@H]1O[C@H]1/C=C/C=C/C=C/C=C/C=C/C=C/C=C/[C@H](C)[C@@H](O)[C@@H](C)[C@H](C)OC(=O)C[C@H](O)C[C@H](O)CC[C@@H](O)[C@H](O)C[C@H](O)C[C@](O)(C[C@H](O)[C@H]2C(O)=O)O[C@H]2C1 PCTMTFRHKVHKIS-BMFZQQSSSA-N 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
The application relates to a new keyword mining method, a device and electronic equipment, wherein the new keyword mining method comprises the steps of obtaining an internet text; enumerating a plurality of segmentation segments according to the Internet text, and simultaneously extracting left and right adjacent characters of the segmentation segments in the current context; calculating the cohesion degree of each segmentation segment; calculating the left entropy and the right entropy of each segmentation segment according to the left adjacent character and the right adjacent character; and (5) associating the cohesion degree and the left-right entropy of each segmentation segment, and outputting a new keyword result table. The method and the device can rapidly mine the new keywords from the massive Internet text data, and the mining result is more accurate.
Description
Technical Field
The application belongs to the technical field of information processing, and particularly relates to a new keyword mining method and device and electronic equipment.
Background
With the vigorous development of the internet industry, internet marketing is gradually emerging, and the internet marketing is a novel marketing campaign which is based on an internet platform, utilizes information technology and tools to meet the process of exchanging concepts, products and services between companies and clients, creates, publicizes and transmits client values through online activities, and manages client relationships to achieve a certain marketing purpose. In the battlefield of industry marketing in the background of the internet era, almost every brand owner is very concerned about the latest changes in industry dynamics. Dynamic changes in the industry include the latest emerging brand of offerings, the currently-advocated pain points in the industry and user needs, and the eye-catching techniques in use by peers, among others. In order to capture these important industry trends in the shortest time, marketing technology companies as brands provide the latest industry intelligence to brand owners through various technical analysis and mining means in the most timely response time.
The traditional new keyword mining method is to collect recent internet texts, filter and screen related industry categories in an internet text set through a word segmentation tool, mine latest industry feature words (brand, demand, pain point, topic and the like), and deliver the latest industry feature words to subsequent model analysis and service judgment. However, due to the newly-generated industry characteristic words, the traditional word segmentation tool is difficult to segment accurately. Moreover, with the rapid increase of text data, new keywords are often required to be mined from a larger amount of text sets, and the new keyword mining method is only suitable for mining small sample data and cannot undertake the task of processing a large amount of text sets.
Disclosure of Invention
The invention provides a new keyword mining method, a device and electronic equipment, and aims to overcome the problems that a traditional new keyword mining method is difficult to accurately segment texts, and often needs to mine new keywords from a larger text set along with the rapid increase of text data, and the new keyword mining method cannot bear the task of processing the large text set.
In a first aspect, the present application provides a new keyword mining method, including:
acquiring an internet text;
enumerating a plurality of segmentation segments according to the Internet text, and simultaneously extracting left and right adjacent characters of the segmentation segments in the current context;
calculating the cohesion degree of each segmentation segment;
calculating the left entropy and the right entropy of each segmentation segment according to the left adjacent character and the right adjacent character;
and (5) associating the cohesion degree and the left-right entropy of each segmentation segment, and outputting a new keyword result table.
Further, the method further comprises:
and repairing the new keywords in the new keyword result table.
Further, the repairing the new keyword in the new keyword result table includes:
setting a left-right entropy difference threshold value and a character length threshold value;
acquiring a left-right entropy difference value and a character length of a current segmentation segment;
identifying an error segmentation segment according to the relation between the left-right entropy difference value and the character length of the current segmentation segment and the set left-right entropy difference value threshold and character length threshold;
and repairing the error segmentation segment.
Further, the repairing the miscut segment includes:
performing substring segmentation on the error segmentation fragments to obtain sub-segmentation fragments;
performing association search on each sub-segmentation segment in the new keyword result table;
if the corresponding new keyword is found, replacing the error segmentation segment with a sub-segmentation segment;
and/or;
carrying out series-connection on the error segmentation fragments to obtain series-connection segmentation fragments;
carrying out association search on each string of external segmentation segments in the new keyword result table;
and if the corresponding new keyword is found, replacing the wrong segmentation segment with an external string segmentation segment.
Further, the method further comprises:
obtaining an internet text with industry category labels and a segmentation result corresponding to the internet text;
extracting representative new feature words of each industry from the segmentation result;
carrying out semantic clustering and classification processing on the new feature words and the new keywords in the new keyword result table;
and outputting a new keyword with industry representativeness according to the processing result.
Further, the calculating the cohesion degree of each segmented segment comprises:
acquiring the character length of each segmentation segment;
performing secondary segmentation on each segmentation segment with the character length larger than 1 to obtain sub-segmentation segments corresponding to the segmentation segments;
acquiring the occurrence frequency of the segmentation segments and the occurrence frequency of the sub-segmentation segments;
and calculating the cohesion degree of the segmentation segments according to the ratio of the occurrence frequency of the segmentation segments to the occurrence frequency of the sub-segmentation segments.
Further, the new keyword result table is a hive table, and the obtaining of the occurrence frequency of the segment and the occurrence frequency of the sub-segment includes:
counting the occurrence frequency of the segmentation segments and the occurrence frequency of the sub-segmentation segments through distributed calculation api in the hive table;
the distributed computation api includes select, group by, and join.
Further, the calculating left and right entropies of each segmented segment according to the left and right adjacent characters includes:
traversing the segmentation fragments in a distributed mode;
counting the occurrence frequency of left and right adjacent characters of each segmentation segment;
and calculating the left entropy and the right entropy of each segmentation segment according to the left adjacent character and the right adjacent character and the occurrence frequency of the left adjacent character and the right adjacent character.
In a second aspect, the present application provides a new keyword mining apparatus, including:
the acquisition module is used for acquiring the Internet text;
the extraction module is used for enumerating a plurality of segmentation segments according to the Internet text and simultaneously extracting left and right adjacent characters of the segmentation segments in the current context;
the first calculation module is used for calculating the cohesion of each segmentation segment;
the second calculation module is used for calculating the left entropy and the right entropy of each segmentation segment according to the left adjacent character and the right adjacent character;
and the association module is used for associating the cohesion and the left-right entropy of each segmentation segment and outputting a new keyword result table.
In a third aspect, the present application provides an electronic device, comprising:
a processor; and
a memory having stored thereon computer readable instructions which, when executed by the processor, implement the new keyword mining method as claimed in any one of the first aspects.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
according to the method, the device and the electronic equipment for mining the new keywords, provided by the embodiment of the invention, the plurality of segmentation segments are enumerated according to the Internet text, the left and right adjacent characters of the segmentation segments in the current context are simultaneously extracted, the cohesion degree of each segmentation segment is calculated, the left and right entropy of each segmentation segment is calculated according to the left and right adjacent characters, the cohesion degree and the left and right entropy of each segmentation segment are associated, the new keyword result table is output, the new keywords can be quickly mined from massive Internet text data, and the mining result is more accurate.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
Fig. 1 is a flowchart of a new keyword mining method according to an embodiment of the present application.
Fig. 2 is a flowchart of a new keyword mining method according to another embodiment of the present application.
Fig. 3 is a flowchart of a new keyword mining method according to another embodiment of the present application.
Fig. 4 is a functional block diagram of a new keyword mining apparatus according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail below. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a flowchart of a new keyword mining method according to an embodiment of the present application, and as shown in fig. 1, the new keyword mining method includes:
s11: acquiring an internet text;
s12: enumerating a plurality of segmentation segments according to the Internet text, and simultaneously extracting left and right adjacent characters of the segmentation segments in the current context;
s13: calculating the cohesion degree of each segmentation segment;
s14: calculating the left entropy and the right entropy of each segmentation segment according to the left adjacent character and the right adjacent character;
s15: and (5) associating the cohesion degree and the left-right entropy of each segmentation segment, and outputting a new keyword result table.
The traditional new keyword mining method is to collect recent internet texts, filter and screen related industry categories in an internet text set through a word segmentation tool, mine latest industry feature words (brand, demand, pain point, topic and the like), and deliver the latest industry feature words to subsequent model analysis and service judgment. However, due to the newly-generated industry characteristic words, the traditional word segmentation tool is difficult to segment accurately. Moreover, with the rapid growth of text data, new keyword mining is often required from a larger amount of text collections, and the new keyword mining method cannot undertake the task of processing a large amount of text collections.
In the embodiment, the cohesion and the left-right entropy of each segmentation segment are calculated respectively; and the cohesion and the left-right entropy of each segmentation segment are correlated, a new keyword result table is output, so that new keywords in the new keyword result table are more accurate, and the quantity of files is not limited due to the fact that the cohesion and the left-right entropy of each segmentation segment are calculated, so that the method is more suitable for mining the new keywords of the massive internet texts.
The method comprises the steps of traversing all texts, segmenting all possible segmented fragments, and searching all adjacent characters of each segmented fragment, wherein the left and right adjacent characters of the segmented fragment in the current context are directly extracted while all possible segmented fragments are segmented, so that the calculation amount consumed for searching all adjacent characters of each segmented fragment can be greatly saved.
The traditional segmentation process uses a single-computer calculation strategy, which consumes a large amount of calculation time and memory resources; in this embodiment, a streaming processing (hadoop streaming) technique of a distributed cluster is adopted to accelerate the processing speed.
In the embodiment, a plurality of segmentation segments are enumerated according to an internet text, left and right adjacent characters of the segmentation segments in the current context are extracted at the same time, the cohesion degree of each segmentation segment is calculated, left and right entropies of each segmentation segment are calculated according to the left and right adjacent characters, the cohesion degree and the left and right entropies of each segmentation segment are associated, and a new keyword result table is output; new keywords can be quickly mined from massive Internet text data, and the mining result is more accurate.
An embodiment of the present invention provides another new keyword mining method, which, as shown in the flowchart illustrated in fig. 2, further includes:
s21: acquiring an internet text;
in some embodiments, the internet text is acquired at regular time through a preset acquisition period, and it can be understood that the shorter the acquisition period is, the more real-time the acquired internet text is, and it should be noted that the acquisition period is not limited in the present application, and a technician in the present application can set the acquisition period as needed.
S22: enumerating a plurality of segmentation segments according to the Internet text, and simultaneously extracting left and right adjacent characters of the segmentation segments in the current context;
enumerating a plurality of segmentation segments according to the Internet text, wherein the segmentation segments comprise a preset maximum segmentation length, and enumerating as many segmentation segments as possible under the maximum segmentation length.
S23: calculating the cohesion degree of each segmentation segment;
some embodiments of calculating the cohesion of each sliced segment include, but are not limited to, the following methods:
s231: acquiring the character length of each segmentation segment;
s232: performing secondary segmentation on each segmentation segment with the character length larger than 1 to obtain sub-segmentation segments corresponding to the segmentation segments;
s233: acquiring the occurrence frequency of the segmentation segments and the occurrence frequency of the sub-segmentation segments;
s234: and calculating the cohesion degree of the segmentation segments according to the ratio of the occurrence frequency of the segmentation segments to the occurrence frequency of the sub-segmentation segments.
In some embodiments, the new keyword result table is a hive table, and the obtaining of the occurrence frequency of the segmentation segments and the occurrence frequency of the sub-segmentation segments includes:
counting the occurrence frequency of the segmentation segments and the occurrence frequency of the sub-segmentation segments through distributed calculation api in the hive table;
distributed computation api includes, but is not limited to, select, group by, and join.
For example, for a certain segmented segment with a character length of 4: ABCD, all its dichotomies are: (A, BCD), (AB, CD), (ABC, D). These divided sub-divided fragments are also already divided in step S22. At this time, only distributed computation api such as select, group by and join of the hive script is needed to count and associate the frequency of each segment and all sub-segments and the ratio of the frequency to the frequency, so as to compute the cohesion of the current segment. And because distributed computing is adopted, the query, counting and correlation operations are the best processing mode of the distributed framework script, and compared with single-machine computing, the processing scale and the computing efficiency are greatly improved.
S24: calculating the left entropy and the right entropy of each segmentation segment according to the left adjacent character and the right adjacent character;
in some embodiments, calculating left and right entropy for each sliced segment based on left and right adjacent characters comprises:
distributed traversing and segmenting fragments;
counting the occurrence frequency of left and right adjacent characters of each segmentation segment;
and calculating the left entropy and the right entropy of each segmentation segment according to the left adjacent character and the right adjacent character and the occurrence frequency of the left adjacent character and the right adjacent character.
It should be noted that, the distributed traversal cut segments and the left and right adjacent characters of the context of each cut segment are already obtained in the step S22, and here, only the output result of the step S22 needs to be directly used, the occurrence frequency of the left and right adjacent characters of the context of each cut segment appearing at each time is counted according to the output result of the step S22, and the left and right entropies of each cut segment are calculated according to the left and right adjacent characters and the occurrence frequency thereof.
S25: associating the cohesion degree and the left-right entropy of each segmentation segment, and outputting a new keyword result table;
although the traditional data mining method also uses the calculation indexes such as the cohesion degree or the left-right entropy and the like; however, only one calculation index of the cohesion or the left-right entropy is used, the mining results are stored in different mining result tables, the number of output files is large, the accuracy of the result in each file is low, the cohesion and the left-right entropy of each segmentation segment are correlated in the method, the correlated result is displayed in the same new keyword result table, subsequent repairing processing is facilitated, and the accuracy of new keyword mining can be improved.
S26: and repairing the new keywords in the new keyword result table.
In actual calculation mining, segmentation segments which often have errors are arranged at the position selected earlier in a new keyword result table (for example, segmentation segments with missing characters or redundant characters). Further error recognition and repair of new keywords is therefore required.
In some embodiments, repairing the new keyword in the new keyword result table includes:
setting a left-right entropy difference threshold value and a character length threshold value;
acquiring a left-right entropy difference value and a character length of a current segmentation segment;
identifying an error segmentation segment according to the relation between the left-right entropy difference value and the character length of the current segmentation segment and the set left-right entropy difference value threshold and character length threshold;
and repairing the error segmentation segment.
As an optional implementation manner of the present invention, repairing the miscut segment includes:
performing substring segmentation on the error segmentation fragments to obtain sub-segmentation fragments;
performing association search on each sub-segmentation segment in the new keyword result table;
and if the corresponding new keyword is found, replacing the error segmentation segment with the sub-segmentation segment.
As an optional implementation manner of the present invention, repairing the miscut segment includes:
carrying out series-connection on the error segmentation fragments to obtain series-connection segmentation fragments;
carrying out association search on each string of external segmentation segments in the new keyword result table;
and if the corresponding new keyword is found, replacing the wrong segmentation segment with an external string segmentation segment.
For example, firstly, a data result with high cohesion and left and right entropies larger than a preset threshold is obtained through efficient screening of a distributed script. Since left and right entropy is used to represent left and right adjacent character uncertainty, the larger the left and right entropy is, the higher the probability that the current segmented segment alone constitutes a keyword is represented.
Then, for a segment whose character length is smaller than a character length threshold (e.g., 5), when the absolute value of the left-right entropy difference is larger than the left-right entropy difference threshold (e.g., 0.2), there is a high possibility that there is a problem that characters are missing on the left and right sides of the current segment (e.g., a segment "lanko small black" where there is character missing, and the complete segment should be a "lanko small black bottle"). Here, instead of the currently short and possibly character-missing segmentation segment, the substring segmentation may be performed on the segmentation segment whose character length is not less than 5 and the absolute value of the left-right entropy difference is not greater than 0.2, and the substring segmentation may be performed and associated with the aforementioned shorter segmentation segment where character missing is possible to determine whether there is a better and longer segmentation segment. (for example, the segmentation of substrings of the "Kangaroo Xiao Black bottle" includes "Kangaroo Xiao Hei", "Kangaroo Xiao", "Kangaroo Xiao Black bottle", "Kangaroo Xiao Hei", "Kangaroo Xiao", etc.)
Similarly, for a segment whose character length is greater than the character length threshold and whose left-right entropy difference is also greater than the left-right entropy difference threshold, there is a problem that there are redundant characters on the left and right sides (e.g., "ashira"). Whether the current longer segmentation segment can be replaced by the segmentation segment which is shorter in length and possibly more accurate can be determined by performing substring segmentation on the segmentation segment and performing associated search on the segmentation segment which is shorter in character length and smaller in absolute value of left-right entropy difference.
It should be noted that, when performing association search, the segmentation segment that is not matched with a better result may be selected and retained, and whether the segmentation segment has insufficient context data to cause an identification deviation is analyzed.
In the embodiment, a plurality of segmentation segments are enumerated according to the internet text, left and right adjacent characters of the segmentation segments in the current context are extracted at the same time, the cohesion and left and right entropies of each segmentation segment are correlated, a new keyword result table is output, and the fact that the method is found in the practice verification shows that the calculation time can be shortened from several days to 1-2 hours in the new keyword mining of tens of millions of levels of text quantity, the mining result is more accurate and reliable, and the accuracy of outputting the new keyword can be further improved through the identification and the repair of the error segmentation segments.
Fig. 3 is a flowchart of a new keyword mining method according to another embodiment of the present application, and as shown in fig. 3, on the basis of the foregoing embodiment, the new keyword mining method further includes:
s31: obtaining an internet text with industry category labels and a segmentation result corresponding to the internet text;
s32: extracting representative new feature words of each industry from the segmentation result;
the method for extracting the new characteristic words representative of each industry includes statistical methods such as chi-square distribution and information gain.
S33: carrying out semantic clustering and classification processing on the new feature words and the new keywords in the new keyword result table;
s34: and outputting a new keyword with industry representativeness according to the processing result.
In the embodiment, the brand owner can be helped to more efficiently and accurately dig out the latest industry characteristics by combining with the industry characteristic words; and the method plays an important promoting role in tracking the dynamic state of the industry and improving the accuracy and integrity of data reports.
An embodiment of the present invention provides a new keyword mining device, such as a functional structure diagram shown in fig. 4, where the new keyword mining device includes:
an obtaining module 41, configured to obtain an internet text;
the extraction module 42 is configured to enumerate a plurality of segmentation segments according to the internet text, and simultaneously extract left and right adjacent characters of the segmentation segments in the current context;
a first calculating module 43, configured to calculate a cohesion degree of each segmented segment;
the second calculation module 44 is configured to calculate left and right entropies of each segmented segment according to the left and right adjacent characters;
and the associating module 45 is configured to associate the cohesion and the left-right entropy of each segment and output a new keyword result table.
In some embodiments, a fix module 46 is also included for fixing the new keywords in the new keyword result table.
In some embodiments, the system further includes an output module 47, configured to obtain an internet text labeled with an industry category and a segmentation result corresponding to the internet text labeled with the industry category, extract a new characteristic word representative of each industry from the segmentation result, perform semantic clustering and classification processing on the new characteristic word and a new keyword in a new keyword result table, and output a new keyword representative of an industry according to a processing result.
In the embodiment, the internet text is obtained through the selection obtaining module; the extraction module enumerates a plurality of segmentation segments according to the Internet text, and simultaneously extracts left and right adjacent characters of the segmentation segments in the current context, the first calculation module calculates the cohesion of each segmentation segment, the second calculation module calculates the left and right entropies of each segmentation segment according to the left and right adjacent characters, the association module associates the cohesion and the left and right entropies of each segmentation segment and outputs a new keyword result table, new keywords can be quickly mined from massive Internet text data, and the mining result is more accurate.
The present embodiment provides an electronic device, including:
a processor; and
a memory having computer readable instructions stored thereon which, when executed by the processor, implement the new keyword mining method as in any one of the above embodiments.
It should be noted that the new keyword mining method, the new keyword mining device and the electronic device belong to a general inventive concept, and the contents in the embodiments of the new keyword mining method, the new keyword mining device and the electronic device are mutually applicable.
It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.
It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.
It should be noted that the present invention is not limited to the above-mentioned preferred embodiments, and those skilled in the art can obtain other products in various forms without departing from the spirit of the present invention, but any changes in shape or structure can be made within the scope of the present invention with the same or similar technical solutions as those of the present invention.
Claims (10)
1. A new keyword mining method is characterized by comprising the following steps:
acquiring an internet text;
enumerating a plurality of segmentation segments according to the Internet text, and simultaneously extracting left and right adjacent characters of the segmentation segments in the current context;
calculating the cohesion degree of each segmentation segment;
calculating the left entropy and the right entropy of each segmentation segment according to the left adjacent character and the right adjacent character;
and (5) associating the cohesion degree and the left-right entropy of each segmentation segment, and outputting a new keyword result table.
2. The method of claim 1, further comprising:
and repairing the new keywords in the new keyword result table.
3. The method of claim 2, wherein the repairing new keywords in the new keyword result table comprises:
setting a left-right entropy difference threshold value and a character length threshold value;
acquiring a left-right entropy difference value and a character length of a current segmentation segment;
identifying an error segmentation segment according to the relation between the left-right entropy difference value and the character length of the current segmentation segment and the set left-right entropy difference value threshold and character length threshold;
and repairing the error segmentation segment.
4. The method for mining new keywords according to claim 3, wherein the repairing the miscut segments comprises:
performing substring segmentation on the error segmentation fragments to obtain sub-segmentation fragments;
performing association search on each sub-segmentation segment in the new keyword result table;
if the corresponding new keyword is found, replacing the error segmentation segment with a sub-segmentation segment;
and/or;
carrying out series-connection on the error segmentation fragments to obtain series-connection segmentation fragments;
carrying out association search on each string of external segmentation segments in the new keyword result table;
and if the corresponding new keyword is found, replacing the wrong segmentation segment with an external string segmentation segment.
5. The method for mining new keywords according to any one of claims 1 to 4, further comprising:
obtaining an internet text with industry category labels and a segmentation result corresponding to the internet text;
extracting representative new feature words of each industry from the segmentation result;
carrying out semantic clustering and classification processing on the new feature words and the new keywords in the new keyword result table;
and outputting a new keyword with industry representativeness according to the processing result.
6. The method of claim 1, wherein the calculating the cohesion of each segmented segment comprises:
acquiring the character length of each segmentation segment;
performing secondary segmentation on each segmentation segment with the character length larger than 1 to obtain sub-segmentation segments corresponding to the segmentation segments;
acquiring the occurrence frequency of the segmentation segments and the occurrence frequency of the sub-segmentation segments;
and calculating the cohesion degree of the segmentation segments according to the ratio of the occurrence frequency of the segmentation segments to the occurrence frequency of the sub-segmentation segments.
7. The method of claim 6, wherein the new keyword result table is a hive table, and the obtaining of the occurrence frequency of the segmentation segments and the occurrence frequency of the sub-segmentation segments comprises:
counting the occurrence frequency of the segmentation segments and the occurrence frequency of the sub-segmentation segments through distributed calculation api in the hive table;
the distributed computation api includes select, group by, and join.
8. The method for mining new keywords according to claim 1, wherein the calculating left and right entropy of each segmented segment according to the left and right adjacent characters comprises:
traversing the segmentation fragments in a distributed mode;
counting the occurrence frequency of left and right adjacent characters of each segmentation segment;
and calculating the left entropy and the right entropy of each segmentation segment according to the left adjacent character and the right adjacent character and the occurrence frequency of the left adjacent character and the right adjacent character.
9. A new keyword mining device, comprising:
the acquisition module is used for acquiring the Internet text;
the extraction module is used for enumerating a plurality of segmentation segments according to the Internet text and simultaneously extracting left and right adjacent characters of the segmentation segments in the current context;
the first calculation module is used for calculating the cohesion of each segmentation segment;
the second calculation module is used for calculating the left entropy and the right entropy of each segmentation segment according to the left adjacent character and the right adjacent character;
and the association module is used for associating the cohesion and the left-right entropy of each segmentation segment and outputting a new keyword result table.
10. An electronic device, comprising
A processor; and
a memory having stored thereon computer readable instructions which, when executed by the processor, implement the new keyword mining method of any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010664165.4A CN111898010A (en) | 2020-07-10 | 2020-07-10 | New keyword mining method and device and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010664165.4A CN111898010A (en) | 2020-07-10 | 2020-07-10 | New keyword mining method and device and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111898010A true CN111898010A (en) | 2020-11-06 |
Family
ID=73192331
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010664165.4A Pending CN111898010A (en) | 2020-07-10 | 2020-07-10 | New keyword mining method and device and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111898010A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112527909A (en) * | 2020-12-24 | 2021-03-19 | 东软睿驰汽车技术(沈阳)有限公司 | Data slice processing method and device and related products |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060206306A1 (en) * | 2005-02-09 | 2006-09-14 | Microsoft Corporation | Text mining apparatus and associated methods |
CN102930055A (en) * | 2012-11-18 | 2013-02-13 | 浙江大学 | New network word discovery method in combination with internal polymerization degree and external discrete information entropy |
CN104102658A (en) * | 2013-04-09 | 2014-10-15 | 腾讯科技(深圳)有限公司 | Method and device for mining text contents |
CN105260362A (en) * | 2015-10-30 | 2016-01-20 | 小米科技有限责任公司 | New word extraction method and device |
CN106021230A (en) * | 2016-05-19 | 2016-10-12 | 无线生活(杭州)信息科技有限公司 | Word segmentation method and word segmentation apparatus |
CN107515849A (en) * | 2016-06-15 | 2017-12-26 | 阿里巴巴集团控股有限公司 | It is a kind of into word judgment model generating method, new word discovery method and device |
CN109635296A (en) * | 2018-12-08 | 2019-04-16 | 广州荔支网络技术有限公司 | Neologisms method for digging, device computer equipment and storage medium |
CN109885831A (en) * | 2019-01-30 | 2019-06-14 | 广州杰赛科技股份有限公司 | Key Term abstracting method, device, equipment and computer readable storage medium |
-
2020
- 2020-07-10 CN CN202010664165.4A patent/CN111898010A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060206306A1 (en) * | 2005-02-09 | 2006-09-14 | Microsoft Corporation | Text mining apparatus and associated methods |
CN102930055A (en) * | 2012-11-18 | 2013-02-13 | 浙江大学 | New network word discovery method in combination with internal polymerization degree and external discrete information entropy |
CN104102658A (en) * | 2013-04-09 | 2014-10-15 | 腾讯科技(深圳)有限公司 | Method and device for mining text contents |
CN105260362A (en) * | 2015-10-30 | 2016-01-20 | 小米科技有限责任公司 | New word extraction method and device |
CN106021230A (en) * | 2016-05-19 | 2016-10-12 | 无线生活(杭州)信息科技有限公司 | Word segmentation method and word segmentation apparatus |
CN107515849A (en) * | 2016-06-15 | 2017-12-26 | 阿里巴巴集团控股有限公司 | It is a kind of into word judgment model generating method, new word discovery method and device |
CN109635296A (en) * | 2018-12-08 | 2019-04-16 | 广州荔支网络技术有限公司 | Neologisms method for digging, device computer equipment and storage medium |
CN109885831A (en) * | 2019-01-30 | 2019-06-14 | 广州杰赛科技股份有限公司 | Key Term abstracting method, device, equipment and computer readable storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112527909A (en) * | 2020-12-24 | 2021-03-19 | 东软睿驰汽车技术(沈阳)有限公司 | Data slice processing method and device and related products |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9418144B2 (en) | Similar document detection and electronic discovery | |
CN110874530B (en) | Keyword extraction method, keyword extraction device, terminal equipment and storage medium | |
US8630972B2 (en) | Providing context for web articles | |
US8285745B2 (en) | User query mining for advertising matching | |
CN107423278B (en) | Evaluation element identification method, device and system | |
CN113590645B (en) | Searching method, searching device, electronic equipment and storage medium | |
US20160217158A1 (en) | Image search method, image search system, and information recording medium | |
CN104504109A (en) | Image search method and device | |
WO2021175009A1 (en) | Early warning event graph construction method and apparatus, device, and storage medium | |
US20150032708A1 (en) | Database analysis apparatus and method | |
CN104881458A (en) | Labeling method and device for web page topics | |
CN104573130A (en) | Entity resolution method based on group calculation and entity resolution device based on group calculation | |
CN111158964B (en) | Disk failure prediction method, system, device and storage medium | |
CN111552800A (en) | Abstract generation method and device, electronic equipment and medium | |
CN106980639B (en) | Short text data aggregation system and method | |
CN115146062A (en) | Intelligent event analysis method and system fusing expert recommendation and text clustering | |
CN116881430B (en) | Industrial chain identification method and device, electronic equipment and readable storage medium | |
CN111538903B (en) | Method and device for determining search recommended word, electronic equipment and computer readable medium | |
CN111898010A (en) | New keyword mining method and device and electronic equipment | |
CN110263345B (en) | Keyword extraction method, keyword extraction device and storage medium | |
CN112685374B (en) | Log classification method and device and electronic equipment | |
CN117009518A (en) | Similar event judging method integrating basic attribute and text content and application thereof | |
US20210089539A1 (en) | Associating user-provided content items to interest nodes | |
CN116225848A (en) | Log monitoring method, device, equipment and medium | |
Shan et al. | Automatic Generation of Piano Score Following Videos. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |