CN111898010A

CN111898010A - New keyword mining method and device and electronic equipment

Info

Publication number: CN111898010A
Application number: CN202010664165.4A
Authority: CN
Inventors: 唐亮; 赵伟
Original assignee: Social Touch Beijing Technology Co ltd
Current assignee: Social Touch Beijing Technology Co ltd
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2020-11-06

Abstract

The application relates to a new keyword mining method, a device and electronic equipment, wherein the new keyword mining method comprises the steps of obtaining an internet text; enumerating a plurality of segmentation segments according to the Internet text, and simultaneously extracting left and right adjacent characters of the segmentation segments in the current context; calculating the cohesion degree of each segmentation segment; calculating the left entropy and the right entropy of each segmentation segment according to the left adjacent character and the right adjacent character; and (5) associating the cohesion degree and the left-right entropy of each segmentation segment, and outputting a new keyword result table. The method and the device can rapidly mine the new keywords from the massive Internet text data, and the mining result is more accurate.

Description

New keyword mining method and device and electronic equipment

Technical Field

The application belongs to the technical field of information processing, and particularly relates to a new keyword mining method and device and electronic equipment.

Background

With the vigorous development of the internet industry, internet marketing is gradually emerging, and the internet marketing is a novel marketing campaign which is based on an internet platform, utilizes information technology and tools to meet the process of exchanging concepts, products and services between companies and clients, creates, publicizes and transmits client values through online activities, and manages client relationships to achieve a certain marketing purpose. In the battlefield of industry marketing in the background of the internet era, almost every brand owner is very concerned about the latest changes in industry dynamics. Dynamic changes in the industry include the latest emerging brand of offerings, the currently-advocated pain points in the industry and user needs, and the eye-catching techniques in use by peers, among others. In order to capture these important industry trends in the shortest time, marketing technology companies as brands provide the latest industry intelligence to brand owners through various technical analysis and mining means in the most timely response time.

The traditional new keyword mining method is to collect recent internet texts, filter and screen related industry categories in an internet text set through a word segmentation tool, mine latest industry feature words (brand, demand, pain point, topic and the like), and deliver the latest industry feature words to subsequent model analysis and service judgment. However, due to the newly-generated industry characteristic words, the traditional word segmentation tool is difficult to segment accurately. Moreover, with the rapid increase of text data, new keywords are often required to be mined from a larger amount of text sets, and the new keyword mining method is only suitable for mining small sample data and cannot undertake the task of processing a large amount of text sets.

Disclosure of Invention

The invention provides a new keyword mining method, a device and electronic equipment, and aims to overcome the problems that a traditional new keyword mining method is difficult to accurately segment texts, and often needs to mine new keywords from a larger text set along with the rapid increase of text data, and the new keyword mining method cannot bear the task of processing the large text set.

In a first aspect, the present application provides a new keyword mining method, including:

acquiring an internet text;

enumerating a plurality of segmentation segments according to the Internet text, and simultaneously extracting left and right adjacent characters of the segmentation segments in the current context;

calculating the cohesion degree of each segmentation segment;

calculating the left entropy and the right entropy of each segmentation segment according to the left adjacent character and the right adjacent character;

and (5) associating the cohesion degree and the left-right entropy of each segmentation segment, and outputting a new keyword result table.

Further, the method further comprises:

and repairing the new keywords in the new keyword result table.

Further, the repairing the new keyword in the new keyword result table includes:

setting a left-right entropy difference threshold value and a character length threshold value;

acquiring a left-right entropy difference value and a character length of a current segmentation segment;

identifying an error segmentation segment according to the relation between the left-right entropy difference value and the character length of the current segmentation segment and the set left-right entropy difference value threshold and character length threshold;

and repairing the error segmentation segment.

Further, the repairing the miscut segment includes:

performing substring segmentation on the error segmentation fragments to obtain sub-segmentation fragments;

performing association search on each sub-segmentation segment in the new keyword result table;

if the corresponding new keyword is found, replacing the error segmentation segment with a sub-segmentation segment;

and/or;

carrying out series-connection on the error segmentation fragments to obtain series-connection segmentation fragments;

carrying out association search on each string of external segmentation segments in the new keyword result table;

and if the corresponding new keyword is found, replacing the wrong segmentation segment with an external string segmentation segment.

Further, the method further comprises:

obtaining an internet text with industry category labels and a segmentation result corresponding to the internet text;

extracting representative new feature words of each industry from the segmentation result;

carrying out semantic clustering and classification processing on the new feature words and the new keywords in the new keyword result table;

and outputting a new keyword with industry representativeness according to the processing result.

Further, the calculating the cohesion degree of each segmented segment comprises:

acquiring the character length of each segmentation segment;

performing secondary segmentation on each segmentation segment with the character length larger than 1 to obtain sub-segmentation segments corresponding to the segmentation segments;

acquiring the occurrence frequency of the segmentation segments and the occurrence frequency of the sub-segmentation segments;

and calculating the cohesion degree of the segmentation segments according to the ratio of the occurrence frequency of the segmentation segments to the occurrence frequency of the sub-segmentation segments.

Further, the new keyword result table is a hive table, and the obtaining of the occurrence frequency of the segment and the occurrence frequency of the sub-segment includes:

counting the occurrence frequency of the segmentation segments and the occurrence frequency of the sub-segmentation segments through distributed calculation api in the hive table;

the distributed computation api includes select, group by, and join.

Further, the calculating left and right entropies of each segmented segment according to the left and right adjacent characters includes:

traversing the segmentation fragments in a distributed mode;

counting the occurrence frequency of left and right adjacent characters of each segmentation segment;

and calculating the left entropy and the right entropy of each segmentation segment according to the left adjacent character and the right adjacent character and the occurrence frequency of the left adjacent character and the right adjacent character.

In a second aspect, the present application provides a new keyword mining apparatus, including:

the acquisition module is used for acquiring the Internet text;

the extraction module is used for enumerating a plurality of segmentation segments according to the Internet text and simultaneously extracting left and right adjacent characters of the segmentation segments in the current context;

the first calculation module is used for calculating the cohesion of each segmentation segment;

the second calculation module is used for calculating the left entropy and the right entropy of each segmentation segment according to the left adjacent character and the right adjacent character;

and the association module is used for associating the cohesion and the left-right entropy of each segmentation segment and outputting a new keyword result table.

In a third aspect, the present application provides an electronic device, comprising:

a processor; and

a memory having stored thereon computer readable instructions which, when executed by the processor, implement the new keyword mining method as claimed in any one of the first aspects.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

according to the method, the device and the electronic equipment for mining the new keywords, provided by the embodiment of the invention, the plurality of segmentation segments are enumerated according to the Internet text, the left and right adjacent characters of the segmentation segments in the current context are simultaneously extracted, the cohesion degree of each segmentation segment is calculated, the left and right entropy of each segmentation segment is calculated according to the left and right adjacent characters, the cohesion degree and the left and right entropy of each segmentation segment are associated, the new keyword result table is output, the new keywords can be quickly mined from massive Internet text data, and the mining result is more accurate.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a flowchart of a new keyword mining method according to an embodiment of the present application.

Fig. 2 is a flowchart of a new keyword mining method according to another embodiment of the present application.

Fig. 3 is a flowchart of a new keyword mining method according to another embodiment of the present application.

Fig. 4 is a functional block diagram of a new keyword mining apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail below. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a flowchart of a new keyword mining method according to an embodiment of the present application, and as shown in fig. 1, the new keyword mining method includes:

s11: acquiring an internet text;

s12: enumerating a plurality of segmentation segments according to the Internet text, and simultaneously extracting left and right adjacent characters of the segmentation segments in the current context;

s13: calculating the cohesion degree of each segmentation segment;

s14: calculating the left entropy and the right entropy of each segmentation segment according to the left adjacent character and the right adjacent character;

s15: and (5) associating the cohesion degree and the left-right entropy of each segmentation segment, and outputting a new keyword result table.

The traditional new keyword mining method is to collect recent internet texts, filter and screen related industry categories in an internet text set through a word segmentation tool, mine latest industry feature words (brand, demand, pain point, topic and the like), and deliver the latest industry feature words to subsequent model analysis and service judgment. However, due to the newly-generated industry characteristic words, the traditional word segmentation tool is difficult to segment accurately. Moreover, with the rapid growth of text data, new keyword mining is often required from a larger amount of text collections, and the new keyword mining method cannot undertake the task of processing a large amount of text collections.

In the embodiment, the cohesion and the left-right entropy of each segmentation segment are calculated respectively; and the cohesion and the left-right entropy of each segmentation segment are correlated, a new keyword result table is output, so that new keywords in the new keyword result table are more accurate, and the quantity of files is not limited due to the fact that the cohesion and the left-right entropy of each segmentation segment are calculated, so that the method is more suitable for mining the new keywords of the massive internet texts.

The method comprises the steps of traversing all texts, segmenting all possible segmented fragments, and searching all adjacent characters of each segmented fragment, wherein the left and right adjacent characters of the segmented fragment in the current context are directly extracted while all possible segmented fragments are segmented, so that the calculation amount consumed for searching all adjacent characters of each segmented fragment can be greatly saved.

The traditional segmentation process uses a single-computer calculation strategy, which consumes a large amount of calculation time and memory resources; in this embodiment, a streaming processing (hadoop streaming) technique of a distributed cluster is adopted to accelerate the processing speed.

In the embodiment, a plurality of segmentation segments are enumerated according to an internet text, left and right adjacent characters of the segmentation segments in the current context are extracted at the same time, the cohesion degree of each segmentation segment is calculated, left and right entropies of each segmentation segment are calculated according to the left and right adjacent characters, the cohesion degree and the left and right entropies of each segmentation segment are associated, and a new keyword result table is output; new keywords can be quickly mined from massive Internet text data, and the mining result is more accurate.

An embodiment of the present invention provides another new keyword mining method, which, as shown in the flowchart illustrated in fig. 2, further includes:

s21: acquiring an internet text;

in some embodiments, the internet text is acquired at regular time through a preset acquisition period, and it can be understood that the shorter the acquisition period is, the more real-time the acquired internet text is, and it should be noted that the acquisition period is not limited in the present application, and a technician in the present application can set the acquisition period as needed.

S22: enumerating a plurality of segmentation segments according to the Internet text, and simultaneously extracting left and right adjacent characters of the segmentation segments in the current context;

enumerating a plurality of segmentation segments according to the Internet text, wherein the segmentation segments comprise a preset maximum segmentation length, and enumerating as many segmentation segments as possible under the maximum segmentation length.

S23: calculating the cohesion degree of each segmentation segment;

some embodiments of calculating the cohesion of each sliced segment include, but are not limited to, the following methods:

s231: acquiring the character length of each segmentation segment;

s232: performing secondary segmentation on each segmentation segment with the character length larger than 1 to obtain sub-segmentation segments corresponding to the segmentation segments;

s233: acquiring the occurrence frequency of the segmentation segments and the occurrence frequency of the sub-segmentation segments;

s234: and calculating the cohesion degree of the segmentation segments according to the ratio of the occurrence frequency of the segmentation segments to the occurrence frequency of the sub-segmentation segments.

In some embodiments, the new keyword result table is a hive table, and the obtaining of the occurrence frequency of the segmentation segments and the occurrence frequency of the sub-segmentation segments includes:

distributed computation api includes, but is not limited to, select, group by, and join.

For example, for a certain segmented segment with a character length of 4: ABCD, all its dichotomies are: (A, BCD), (AB, CD), (ABC, D). These divided sub-divided fragments are also already divided in step S22. At this time, only distributed computation api such as select, group by and join of the hive script is needed to count and associate the frequency of each segment and all sub-segments and the ratio of the frequency to the frequency, so as to compute the cohesion of the current segment. And because distributed computing is adopted, the query, counting and correlation operations are the best processing mode of the distributed framework script, and compared with single-machine computing, the processing scale and the computing efficiency are greatly improved.

S24: calculating the left entropy and the right entropy of each segmentation segment according to the left adjacent character and the right adjacent character;

in some embodiments, calculating left and right entropy for each sliced segment based on left and right adjacent characters comprises:

distributed traversing and segmenting fragments;

It should be noted that, the distributed traversal cut segments and the left and right adjacent characters of the context of each cut segment are already obtained in the step S22, and here, only the output result of the step S22 needs to be directly used, the occurrence frequency of the left and right adjacent characters of the context of each cut segment appearing at each time is counted according to the output result of the step S22, and the left and right entropies of each cut segment are calculated according to the left and right adjacent characters and the occurrence frequency thereof.

S25: associating the cohesion degree and the left-right entropy of each segmentation segment, and outputting a new keyword result table;

although the traditional data mining method also uses the calculation indexes such as the cohesion degree or the left-right entropy and the like; however, only one calculation index of the cohesion or the left-right entropy is used, the mining results are stored in different mining result tables, the number of output files is large, the accuracy of the result in each file is low, the cohesion and the left-right entropy of each segmentation segment are correlated in the method, the correlated result is displayed in the same new keyword result table, subsequent repairing processing is facilitated, and the accuracy of new keyword mining can be improved.

S26: and repairing the new keywords in the new keyword result table.

In actual calculation mining, segmentation segments which often have errors are arranged at the position selected earlier in a new keyword result table (for example, segmentation segments with missing characters or redundant characters). Further error recognition and repair of new keywords is therefore required.

In some embodiments, repairing the new keyword in the new keyword result table includes:

and repairing the error segmentation segment.

As an optional implementation manner of the present invention, repairing the miscut segment includes:

and if the corresponding new keyword is found, replacing the error segmentation segment with the sub-segmentation segment.

For example, firstly, a data result with high cohesion and left and right entropies larger than a preset threshold is obtained through efficient screening of a distributed script. Since left and right entropy is used to represent left and right adjacent character uncertainty, the larger the left and right entropy is, the higher the probability that the current segmented segment alone constitutes a keyword is represented.

Then, for a segment whose character length is smaller than a character length threshold (e.g., 5), when the absolute value of the left-right entropy difference is larger than the left-right entropy difference threshold (e.g., 0.2), there is a high possibility that there is a problem that characters are missing on the left and right sides of the current segment (e.g., a segment "lanko small black" where there is character missing, and the complete segment should be a "lanko small black bottle"). Here, instead of the currently short and possibly character-missing segmentation segment, the substring segmentation may be performed on the segmentation segment whose character length is not less than 5 and the absolute value of the left-right entropy difference is not greater than 0.2, and the substring segmentation may be performed and associated with the aforementioned shorter segmentation segment where character missing is possible to determine whether there is a better and longer segmentation segment. (for example, the segmentation of substrings of the "Kangaroo Xiao Black bottle" includes "Kangaroo Xiao Hei", "Kangaroo Xiao", "Kangaroo Xiao Black bottle", "Kangaroo Xiao Hei", "Kangaroo Xiao", etc.)

Similarly, for a segment whose character length is greater than the character length threshold and whose left-right entropy difference is also greater than the left-right entropy difference threshold, there is a problem that there are redundant characters on the left and right sides (e.g., "ashira"). Whether the current longer segmentation segment can be replaced by the segmentation segment which is shorter in length and possibly more accurate can be determined by performing substring segmentation on the segmentation segment and performing associated search on the segmentation segment which is shorter in character length and smaller in absolute value of left-right entropy difference.

It should be noted that, when performing association search, the segmentation segment that is not matched with a better result may be selected and retained, and whether the segmentation segment has insufficient context data to cause an identification deviation is analyzed.

In the embodiment, a plurality of segmentation segments are enumerated according to the internet text, left and right adjacent characters of the segmentation segments in the current context are extracted at the same time, the cohesion and left and right entropies of each segmentation segment are correlated, a new keyword result table is output, and the fact that the method is found in the practice verification shows that the calculation time can be shortened from several days to 1-2 hours in the new keyword mining of tens of millions of levels of text quantity, the mining result is more accurate and reliable, and the accuracy of outputting the new keyword can be further improved through the identification and the repair of the error segmentation segments.

Fig. 3 is a flowchart of a new keyword mining method according to another embodiment of the present application, and as shown in fig. 3, on the basis of the foregoing embodiment, the new keyword mining method further includes:

s31: obtaining an internet text with industry category labels and a segmentation result corresponding to the internet text;

s32: extracting representative new feature words of each industry from the segmentation result;

the method for extracting the new characteristic words representative of each industry includes statistical methods such as chi-square distribution and information gain.

S33: carrying out semantic clustering and classification processing on the new feature words and the new keywords in the new keyword result table;

s34: and outputting a new keyword with industry representativeness according to the processing result.

In the embodiment, the brand owner can be helped to more efficiently and accurately dig out the latest industry characteristics by combining with the industry characteristic words; and the method plays an important promoting role in tracking the dynamic state of the industry and improving the accuracy and integrity of data reports.

An embodiment of the present invention provides a new keyword mining device, such as a functional structure diagram shown in fig. 4, where the new keyword mining device includes:

an obtaining module 41, configured to obtain an internet text;

the extraction module 42 is configured to enumerate a plurality of segmentation segments according to the internet text, and simultaneously extract left and right adjacent characters of the segmentation segments in the current context;

a first calculating module 43, configured to calculate a cohesion degree of each segmented segment;

the second calculation module 44 is configured to calculate left and right entropies of each segmented segment according to the left and right adjacent characters;

and the associating module 45 is configured to associate the cohesion and the left-right entropy of each segment and output a new keyword result table.

In some embodiments, a fix module 46 is also included for fixing the new keywords in the new keyword result table.

In some embodiments, the system further includes an output module 47, configured to obtain an internet text labeled with an industry category and a segmentation result corresponding to the internet text labeled with the industry category, extract a new characteristic word representative of each industry from the segmentation result, perform semantic clustering and classification processing on the new characteristic word and a new keyword in a new keyword result table, and output a new keyword representative of an industry according to a processing result.

In the embodiment, the internet text is obtained through the selection obtaining module; the extraction module enumerates a plurality of segmentation segments according to the Internet text, and simultaneously extracts left and right adjacent characters of the segmentation segments in the current context, the first calculation module calculates the cohesion of each segmentation segment, the second calculation module calculates the left and right entropies of each segmentation segment according to the left and right adjacent characters, the association module associates the cohesion and the left and right entropies of each segmentation segment and outputs a new keyword result table, new keywords can be quickly mined from massive Internet text data, and the mining result is more accurate.

The present embodiment provides an electronic device, including:

a processor; and

a memory having computer readable instructions stored thereon which, when executed by the processor, implement the new keyword mining method as in any one of the above embodiments.

It should be noted that the new keyword mining method, the new keyword mining device and the electronic device belong to a general inventive concept, and the contents in the embodiments of the new keyword mining method, the new keyword mining device and the electronic device are mutually applicable.

It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.

It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

It should be noted that the present invention is not limited to the above-mentioned preferred embodiments, and those skilled in the art can obtain other products in various forms without departing from the spirit of the present invention, but any changes in shape or structure can be made within the scope of the present invention with the same or similar technical solutions as those of the present invention.

Claims

1. A new keyword mining method is characterized by comprising the following steps:

acquiring an internet text;

calculating the cohesion degree of each segmentation segment;

2. The method of claim 1, further comprising:

and repairing the new keywords in the new keyword result table.

3. The method of claim 2, wherein the repairing new keywords in the new keyword result table comprises:

and repairing the error segmentation segment.

4. The method for mining new keywords according to claim 3, wherein the repairing the miscut segments comprises:

and/or;

5. The method for mining new keywords according to any one of claims 1 to 4, further comprising:

6. The method of claim 1, wherein the calculating the cohesion of each segmented segment comprises:

acquiring the character length of each segmentation segment;

7. The method of claim 6, wherein the new keyword result table is a hive table, and the obtaining of the occurrence frequency of the segmentation segments and the occurrence frequency of the sub-segmentation segments comprises:

the distributed computation api includes select, group by, and join.

8. The method for mining new keywords according to claim 1, wherein the calculating left and right entropy of each segmented segment according to the left and right adjacent characters comprises:

traversing the segmentation fragments in a distributed mode;

9. A new keyword mining device, comprising:

the acquisition module is used for acquiring the Internet text;

10. An electronic device, comprising

A processor; and

a memory having stored thereon computer readable instructions which, when executed by the processor, implement the new keyword mining method of any one of claims 1 to 8.