CN105095288B

CN105095288B - Data analysis method and data analysis device

Info

Publication number: CN105095288B
Application number: CN201410204300.1A
Authority: CN
Inventors: 温春龙; 陈妍; 梁璟彪; 骆玘; 黄利贤; 樊中一; 吕虹; 刘敏
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd; Tencent Cloud Computing Beijing Co Ltd
Priority date: 2014-05-14
Filing date: 2014-05-14
Publication date: 2020-02-07
Anticipated expiration: 2034-05-14
Also published as: CN105095288A

Abstract

The invention provides a data analysis method and a data analysis device, wherein the method comprises the following steps: establishing a product label library according to the input text content; obtaining a subject word modified by a public praise word according to the text content, wherein the public praise word is obtained by performing word segmentation processing on the text content and screening words reaching preset frequency after the word segmentation processing through a pre-stored word bank; matching the subject with the labels in the product label library; and generating a result label tree reflecting the commonality problem in the text content according to the label matched with the subject. The method can comprehensively collect the comment content in real time, simplify the existing data analysis mode and improve the accuracy of data analysis.

Description

Data analysis method and data analysis device

Technical Field

The present invention relates to internet technologies, and in particular, to a data analysis method and a data analysis device.

Background

At present, after some enterprises collect user feedback of a certain product, manual classification is performed according to text content, and whether specific aspects of the product (such as functions, bugs) are referred to by comment content and the emotional polarity (positive and negative) of the comment are judged.

That is, the word of public praise of the product and the concentration points of good comment and bad comment are judged manually. Through manual reading of the comments, whether the emotion expressed by the comments is positive, negative or neutral is judged, and meanwhile, which dimension (such as performance, function or price) the evaluation object belongs to in the comments is judged. Then manual classification is carried out, and finally statistics and sorting are carried out, so that the good evaluation and the bad evaluation of the product are mainly concentrated on the dimensionalities.

However, in the case of a large amount of data, excessive human involvement causes repetitive labor and inefficiency, and classification and summarization lack systematicness and consistency, resulting in high labor consumption cost and lack of real-time property.

For this reason, a pan-bao attribute pair classification also appears in the prior art, for example, the preset attribute words and emotion words are matched one by one, and the induction result is counted.

However, the shortcomings of the Taobao attribute pair classification include: firstly, the classification of data lacks comprehensiveness; second, failure to combine the analysis of word-of-mouth conditions, only a conclusion of a review of one aspect can be seen.

For this reason, a method capable of performing data analysis comprehensively in real time is required.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a data analysis method and a data analysis device, which are used for comprehensively collecting comment contents in real time, simplifying the existing data analysis mode and improving the accuracy of data analysis.

In a first aspect, an embodiment of the present invention provides a data analysis method, including:

establishing a product label library according to the input text content;

obtaining a subject word modified by a public praise word according to the text content, wherein the public praise word is obtained by performing word segmentation processing on the text content and screening words reaching preset frequency after the word segmentation processing through a pre-stored word bank;

matching the subject with the labels in the product label library;

and generating a result label tree reflecting the commonality problem in the text content according to the label matched with the subject.

With reference to the first aspect, in a first possible implementation manner, the creating a product tag library according to input text content includes:

establishing a dynamic label library according to the input text content;

establishing a special label library according to the product category corresponding to the text content;

and generating the product label library by using the dynamic label library, the special label library and a preset general label library.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner, the establishing a dynamic tag library according to the input text content includes:

acquiring nouns in the text content;

judging whether the frequency of occurrence of the nouns is greater than a preset threshold value or not;

if the frequency of occurrence of the noun is larger than a preset threshold, determining whether the noun is repeated with the tags in the special tag library and the tags in the general tag library;

and if the noun is not repeated with the tags in the special tag library and the tags in the general tag library, the noun is used as the tags to generate the dynamic tag library.

With reference to the first possible implementation manner of the first aspect, in a third possible implementation manner, a dedicated tag library is established according to a product category corresponding to the text content;

obtaining a custom label to which the product belongs according to the product category corresponding to the text content;

searching synonyms and synonyms of the self-defined labels;

and generating the self-defined label, the synonym of the self-defined label and the similar synonym into the special label library of the text content.

With reference to the second possible implementation manner of the first aspect, in a fourth possible implementation manner, the obtaining a noun in the text content includes:

and performing word segmentation processing on the text content according to a user-defined word bank to obtain a noun of the text content.

With reference to the first aspect and the foregoing possible implementation manners of the first aspect, in a fifth possible implementation manner, the obtaining a subject modified by a word-of-mouth according to the text content includes:

obtaining a subject and/or an implied subject modified by a word in public praise in the text content;

matching the subject with the tags in the product tag library, including:

and matching the subject and/or the hidden subject with the labels in the product label library respectively.

With reference to the first aspect and the first to fourth possible implementation manners of the first aspect, in a sixth possible implementation manner, before the step of generating, according to the tag matched with the subject, a result tag tree that reflects a commonality problem in the text content, the method further includes:

acquiring an expansion word-of-speech of a label in the product label library;

matching the expanded public praise words with the labels corresponding to the expanded public praise words in the text content;

generating a result label tree reflecting the commonality problem in the text content according to the label matched with the subject, wherein the result label tree comprises:

and generating a result label tree reflecting the common problem of the text content according to the label matched with the subject and the matching result of the label corresponding to the expanded public praise word and the expanded public praise word.

With reference to the first aspect and the first to fourth possible implementation manners of the first aspect, in a seventh possible implementation manner, after the step of establishing a product tag library according to the input file content, the method further includes:

establishing a multi-level label tree according to the membership relationship among the labels in the product label library;

matching the subject with the tags in the product tag library, including:

and matching the subject with the bottom layer label in the multi-level label tree.

With reference to the seventh possible implementation manner of the first aspect, in an eighth possible implementation manner,

if the subject is matched with the bottom layer label, recording the position of the bottom layer label;

and reversely pulling the recording result of the bottom layer label at the position of the upper layer label in the multi-level label tree to obtain a result label tree reflecting the common problem in the text.

With reference to the first possible implementation manner of the first aspect, in a ninth possible implementation manner, the method further includes:

if the subject is not matched with the label, obtaining the similarity and the importance of the subject according to the semantic similarity and importance calculation rule;

and if the similarity of the subject is greater than or equal to a first preset value and/or the importance of the subject is greater than or equal to a second preset value, adding the subject serving as a label into the dynamic label library.

In a second aspect, an embodiment of the present invention provides a data analysis apparatus, including:

the product label library establishing unit is used for establishing a product label library according to the input text content;

the subject acquiring unit is used for acquiring a subject modified by public praise words according to the text content, wherein the public praise words are obtained by performing word segmentation processing on the text content and screening words reaching preset frequency after the word segmentation processing through a pre-stored word bank;

the matching unit is used for matching the subject acquired by the subject acquisition unit with the tags in the product tag library established by the product tag library establishing unit;

and the result label tree generating unit is used for generating a result label tree reflecting the commonality problem in the text content according to the label matched with the subject in the matching unit.

With reference to the second aspect, in a first possible implementation manner, the product tag library establishing unit is configured to establish a product tag library according to a specific application

Establishing a dynamic label library according to the input text content;

With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner, the product tag library establishing unit is configured to establish a product tag library according to the first possible implementation manner of the second aspect

Acquiring nouns in the text content;

and if the noun is not repeated with the tags in the special tag library and the tags in the general tag library, generating the dynamic tag library by taking the noun as the tags.

With reference to the first possible implementation manner of the second aspect, in a third possible implementation manner, the product tag library establishing unit is configured to establish a product tag library according to a specific implementation manner of the second aspect

searching synonyms and synonyms of the self-defined labels;

With reference to the second possible implementation manner of the second aspect, in a fourth possible implementation manner, the product tag library establishing unit is configured to establish a product tag library according to the second possible implementation manner

With reference to the second aspect and the foregoing possible implementation manners of the second aspect, in a fifth possible implementation manner, the subject obtaining unit is configured to obtain a subject from the subject

the matching unit is used for

And matching the subject and/or the hidden subject acquired by the subject acquisition unit with the tags in the product tag library established by the product tag library establishing unit respectively.

With reference to the second aspect and the first to fourth possible implementation manners of the second aspect, in a sixth possible implementation manner, the apparatus further includes:

an extended public praise word acquiring unit, configured to acquire an extended public praise word of a tag in the product tag library established by the product tag library establishing unit;

the matching unit is also used for

Matching the extended public praise words acquired by the extended public praise word acquisition unit with the labels corresponding to the extended public praise words in the text content;

a result tag tree generation unit for

With reference to the second aspect and the first to fourth possible implementation manners of the second aspect, in a seventh possible implementation manner, the apparatus further includes:

the multi-level label tree establishing unit is used for establishing a multi-level label tree according to the membership relationship among all labels in the product label library established by the product label library establishing unit;

the matching unit is used for

And the system is used for matching the subject acquired by the subject acquisition unit with the bottom layer label in the multi-level label tree established by the multi-level label tree establishment unit.

With reference to the seventh possible implementation manner of the second aspect, in an eighth possible implementation manner, the result label tree generating unit is configured to generate a label tree based on the result label tree

With reference to the first possible implementation manner of the second aspect, in a ninth possible implementation manner, the apparatus further includes:

a subject similarity obtaining unit, configured to obtain, according to the matching result of the matching unit, similarity and importance of the subject according to the semantic similarity and importance calculation rule when the subject and the tag are not matched;

and the subject processing unit is used for adding the subject as a label into the dynamic label library when the similarity of the subject acquired by the subject similarity acquisition unit is greater than or equal to a first preset value and/or the importance of the subject is greater than or equal to a second preset value.

According to the technical scheme, the data analysis method and the data analysis device provided by the embodiment of the invention have the advantages that the comprehensive product tag library is established, the subject modified by the word-of-mouth inscription is further obtained, the subject is matched with the tags in the product tag library, and when the subject is matched with the tags, the result tag tree reflecting the common problem is generated, so that the comment content in the text content can be comprehensively collected in real time, the existing data analysis mode is simplified, and the accuracy of data analysis is improved.

Drawings

Fig. 1 is a schematic flow chart of a data analysis method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a data analysis method according to another embodiment of the present invention;

FIG. 3 is a diagram illustrating multi-level tag tree generation according to an embodiment of the present invention;

FIG. 4A is a diagram illustrating a multi-level tag tree according to an embodiment of the present invention;

FIG. 4B is a diagram of a result tag tree according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a data analysis apparatus according to an embodiment of the present invention;

fig. 6A and 6B are schematic structural diagrams of a data analysis apparatus according to another embodiment of the present invention.

Detailed Description

In the embodiment of the invention, the label refers to a specific comment object when the user comments the product. If the new version interface of the XX video is very junk, the specific comment object of the user is the interface, and the term interface forms a label.

The embodiment of the invention provides a data analysis method and a data analysis device for automatically positioning positive and negative public praise and centralized dimensions of a product, and mainly solves the following problems: judging the emotional polarity (positive, negative and neutral) of the user comment, dynamically and automatically classifying, counting and sequencing the common problems under different emotional polarities, displaying the aspects of good comment, bad comment and discussion focus on the product fed back by the user according to primary and secondary levels, and tracking the change trend. For example, the data analysis method of the embodiment of the present invention can be implemented as follows:

first, the main dimensions that influence public praise are located: by acquiring a large amount of user feedback (microblogs, third-party application markets and forums) and positive and negative word-of-mouth phrases, the dimension in which positive and negative word-of-mouth phrases of a product are concentrated is automatically, comprehensively analyzed in real time, deep reasons influencing the word-of-mouth phrases of the user are deeply mined through semantic analysis, main bad comment points and problem points of the product are rapidly located, and the product is helped to definitely improve aspects.

Secondly, analyzing the word-of-mouth variation of each dimension of the product: automatically analyzing various dimension public praise and variation trend of the product, such as public praise comparison before and after new edition release, new function public praise of the product and dimension public praise variation of interface design, and visually displaying by visual mutation. The attention points of the employees in the product team, which are responsible for different modules, are different, such as development of the possible attention performance, design of the possible attention interface and style, refinement of the public praise change of each dimension of the product and satisfaction of the requirements of different attention parties.

Thirdly, classifying feedback hotspots: and analyzing the user comment hotspots, modularly merging the user feedback hotspots by merging synonyms and near synonyms, and enabling the classification result to be more accurate and practical.

Fig. 1 is a schematic flow chart of a data analysis method according to an embodiment of the present invention, and as shown in fig. 1, the data analysis method according to the embodiment is as follows.

101. And establishing a product label library according to the input batch text content.

For example, the product tag library in the present embodiment may include a dynamic tag library, a dedicated tag library, and a general tag library.

The dynamic label library is established according to input batch text contents, and the special label library is established according to product categories corresponding to the batch text contents.

The universal tag library may be augmented by manual categorization in advance.

102. And acquiring a subject modified by the public praise word according to the batch text content.

For example, the word-of-mouth may be obtained by performing word segmentation on the text content and screening words reaching a preset frequency after the word segmentation through a pre-stored word bank.

It can be understood that the original information (corresponding to the batch text content) related to the product (including the name of the product, the serial alias or the name of part of the key function block) of the user comment is collected by capturing the microblog and/or the forum through an Application Program Interface (API) or a web crawler, and after the original information is cleaned, the existing Chinese lexical analysis word-of-mouth tendency can be adopted to count the positive or negative word-of-mouth.

For example, the cleaning of the original information may be understood as removing repeated and invalid information from the original information, that is, filtering the original data, performing word segmentation on the filtered information by using a chinese lexical analysis system, analyzing word-of-mouth trends on the segmented words by using a pre-stored word bank, and screening.

103. And matching the subject with the labels in the product label library.

104. And generating a result label tree reflecting the common problem in the batch text content according to the label matched with the subject, as shown in fig. 4B.

According to the data analysis method, a comprehensive product label library is established, a subject modified by a word-of-mouth is further obtained, the subject is matched with labels in the product label library, and when the subject is matched with the labels, a result label tree reflecting the common problem is generated, so that the comment contents in batch text contents can be comprehensively collected in real time, the existing data analysis mode is simplified, and the accuracy of data analysis is improved.

Fig. 2 is a schematic flow diagram illustrating a data analysis method according to an embodiment of the present invention, and fig. 3 is a schematic diagram illustrating generation of a multi-level label tree according to an embodiment of the present invention, as shown in fig. 2 and fig. 3, the data analysis method according to this embodiment is as follows.

201. And establishing a dynamic label library according to the input batch text content.

At different times, the hot spot evaluated by the user may transfer or appear with a new tag, and a dynamic tag library needs to be established to ensure real-time and accurate requirements. For example, XX music newly introduces a function called "listen to songs and identify songs" and quickly becomes a focus of attention of users, but the label of "listen to songs and identify songs" is not contained in an existing label library, and at this time, a new label discrimination and addition mechanism needs to be added to ensure completeness and real-time performance of the label library. A dynamic label may be understood as a hotword or a new word, etc. that appears at a certain time.

In the process of establishing the label library, the comprehensiveness of the label can be further improved by means of algorithms such as word meaning similarity/near-meaning word discovery and the like. For example, the label "interface" may have similar expression patterns (synonyms or synonyms) in the user's evaluation, like: pages, panels, appearance, layout, skin, desktop, etc., which require synonym classification.

For example, this step 201 may also include the following sub-steps not shown in the figures:

and A2011, acquiring the nouns in the batch text content.

For example, according to general nouns of products and/or competitive products, performing word segmentation processing on a batch of texts through a Chinese lexical analysis system to obtain nouns and/or word-of-speech of the contents of the batch of texts.

For example, the ICTCLAS segmentation algorithm can be called through a segmentation interface provided by the ICTCLAS system to perform segmentation processing on the batch text content.

The CLAICTS system needs to call a custom thesaurus. The self-defined word stock comprises specific words and part-of-speech labels, the self-defined word stock is equivalent to a submodule of a word segmentation system, and on the basis of the self-defined word stock, a word segmentation algorithm can segment a sentence into different words. The comprehensiveness of the self-defined word bank influences the accuracy of word segmentation, and the self-defined word bank meets the requirements of being updatable, accumulatible and fitting with microblog contexts/forums.

A2012, judging whether the frequency of the noun occurrence is greater than a preset threshold.

A2013, if the frequency of occurrence of the nouns is larger than a preset threshold, determining whether the nouns are repeated with the tags in the special tag library and the tags in the general tag library.

Of course, when the frequency of occurrence of the noun is less than or equal to a predetermined threshold, the noun may be ignored or discarded.

And A2014, if the noun is not overlapped with the tags in the special tag library and the tags in the general tag library, generating a dynamic tag library by taking the noun as the tags.

Of course, if a noun is duplicated in a tag in the private tag library, the noun is discarded. Alternatively, if a noun is duplicated with a tag in the universal tag library, the noun is discarded. Therefore, the labels in the dynamic label library, the special label library and the general label library are not repeated.

In this embodiment, the dynamic tag library may be a tag library composed of names.

202. And establishing a special label library according to the product category corresponding to the batch text content.

For example, if a certain text content is "XX video interface is very junk", the product category corresponding to the text content may be an interface category in the computer, and at this time, the dedicated tag library may be a tag library corresponding to the interface category, where the tag library may include: interface, appearance, layout, skin, desktop, etc.

The term "special tags" is understood to mean the terms commonly used in the field to which a certain text content belongs, and the tags in the special tag library belong to specific fields respectively.

For example, this step 202 may also include the following sub-steps not shown in the figures:

a2021, obtaining a custom label to which the product belongs according to the product category corresponding to the batch text content;

a2022, searching synonyms and synonyms of the self-defined label;

for example, synonyms and synonyms of the custom label can be searched according to the word sense similarity rule.

A2023, generating the self-defined label, the synonym of the self-defined label and the synonym into a special label library of the batch text content.

That is, due to differences of product categories, a custom tag lexicon needs to be established for different products to ensure the accuracy of semantic analysis. For example, users of music products pay attention to "sound quality, resources, download speed" and the like, while products of e-commerce pay attention to "price, logistics, service attitude" and the like, and a dedicated tag library needs to be established according to different products.

203. And generating the product label library by using the dynamic label library, the special label library and a preset general label library.

In this embodiment, the general tag library, also called a public tag library, needs to be established to save time and labor consumption in consideration of product commonality. For example, in the user feedback of all products, the comment object relates to tags such as 'bug, network speed, interface, performance, charge', and the like, and the tags have common attributes and can be added into a general tag library.

Since the current general tag library and the special tag library can satisfy the basic coverage but cannot hit all tags, a dynamic tag library is also provided in this embodiment. The product label library composed by the above steps 201 to 203 may have real-time and comprehensive properties.

204. And establishing a multi-level label tree according to the membership relationship among the labels in the product label library.

After the product label library is completed, a hierarchical relationship or a membership relationship among labels in the product label library needs to be established, that is, a multi-level label tree is established. User comment tags for a product initially contain different dimensions such as "whole, function, design, performance, content resources, campaign and advertisement" and the like. For the above-mentioned large dimension, it may be further divided into two more detailed dimensions, for example, "performance" includes "flash back, crash, black screen, play speed card, upgrade, installation problem" and the like, when the user expresses the "play speed card" in this two-dimensional dimension, there will be different expression forms (synonyms or the like) such as "speed, network speed, networking loading" and the like, these labels will all be used to describe "play speed", these labels are located in the underlying label library, that is, the labels expressed in the user's mouth, as shown in fig. 4A.

205. And acquiring a subject and/or an implied subject of the word-of-speech modification in the batch of text contents.

For example, the subject of word-of-mouth modification and/or the implied subject may be obtained in the batch of text content according to preset grammar rules.

It is understood that the implicit subject may be obtained based on all the word-of-mouth inscriptions or based on some word-of-mouth inscriptions.

For example, subjects with generally negative word-of-mouth modifications were analyzed:

and extracting negative word-of-mouth words appearing in the negative microblog, analyzing the subject modified by the negative word-of-mouth words, and extracting the analyzed subject. For example, in the negative evaluation of "the interface of the new version of XX video is very junk", the "junk" is a negative word-of-mouth, the subject of the modification is "interface", and the modified subject is extracted as "interface and junk", and of course, the modified subject can be extracted as (XX video-new version-interface and junk) according to different levels of the subject.

Negative word-of-mouth with underlying subjects was analyzed:

some negative comments have negative word-of-mouth, but no obvious subject is found. If the xx video is beautiful in appearance but is too katzen, the katzenn is recognized as a negative word of public praise in the comment, but the subject of the modification is actually hidden, and the user expresses that the xx video is too katzen in speed. For such negative word-of-speech with actual meaning, if no obvious subject in the comment can be found, the system will automatically call the corresponding subject library for matching, and extract as (xx video-speed, katon).

206. And acquiring the expansion word-of-speech of the label in the product label library.

For example, word co-linearity rules or/and manual classification may be employed to obtain extended word-of-speech for the tag. It can be understood that the expanded public praise word is not intended to be an emotional word in a real sense, and the expanded public praise word only has a practical meaning when matched with a specific label. If the logistics is very fast, the fast is an extended public praise, and only the matching with the logistics has actual emotional meaning.

It should be understood that the term co-linearity rule refers to an algorithm that calculates the probability that two words or words appear together.

For example, the subject of negative extension word modification is analyzed:

some user reviews do not have obvious negative word-of-mouth words but still express negative emotions. For example, if the speed of the xx video is slow and the traffic is fast, neither the slow nor the fast are negative word-of-speech in the comment (if the negative word-of-speech is judged to be a negative word-of-speech, a large amount of misjudgment can occur), but when the two words are collocated with a specific subject (tag), negative emotion can be expressed. At this time, a negative expansion word bank and corresponding meaningful grammar rules need to be established, and the conditions are analyzed and extracted as (speed, slow), (flow and fast).

207. And matching the subject and/or the implied subject with bottom-layer tags in the multi-level tag tree respectively, and matching the expanded word-of-mouth with tags corresponding to the expanded word-of-mouth in the batch of text contents.

In this embodiment, the expanded word-of-mouth and word-of-mouth (i.e. the common word-of-mouth) can be in parallel level, and these expanded word-of-mouth can only have practical significance if they are collocated with a unique tag (e.g. the bottom-layer tag in a multi-level tag tree).

The expansion word-of-mouth is obtained corresponding to the label in the product label library, namely, the subject of the modification of the expansion word-of-mouth is determined. The subject of the extended word-of-mouth is actually the tag, and the correct number of "tags + extended word-of-mouth" needs to be matched in the batch text.

208. And generating a result label tree reflecting the common problem in the batch text content according to the matching result of the labels matched with the subject, such as bottom layer labels, and the labels corresponding to the expanded word-of-mouth and the expanded word-of-mouth.

For example, if the subject matches the underlying label, a record is made of the location to which the underlying label belongs, an

When the expanded public praise words and the labels corresponding to the expanded public praise words are matched with the batch text content, recording the positions of the labels corresponding to the expanded public praise words;

and then the recording result of the bottom layer label can be reversely pulled to the position of the upper layer label in the multi-level label tree to obtain a result label tree containing the matching result.

That is to say, in this embodiment, a public praise word is first found in the batch text, a public praise word modification subject is searched, if found, the subject matches a bottom label in the multi-level label tree, and if successful, the bottom label + 1; if not, the implicit subject of the word-of-speech can be found, then the implicit subject is matched with the bottom-layer label, and if the implicit subject is successfully matched with the bottom-layer label, the bottom-layer label is + 1. Therefore, the 'expansion word-of-mouth + label' is traversed and matched in the batch text content, and if the 'expansion word-of-mouth + label' is successful, the label +1 corresponding to the expansion word-of-mouth is obtained.

Generally, the process of multi-level tag tree matching is counted upwards from the bottom layer of the tag tree, but when the user views the result tag tree, the user views the result tag tree from top to bottom, i.e. from the bottom layer, for example, views "xx video" → "performance" → "playing speed card", as shown in fig. 4B, at this time, the user needs to view specific text information under the module, and then needs to "pull" the matched result (i.e. pull in the reverse direction).

In this embodiment, the system can pull the matching success record, mark the text position, and perform highlighting or emphasis processing, and after pulling in the reverse direction, further obtain the result tag tree containing the matching result.

In fig. 4B, the number of successful matches of numbers behind each layer of labels is the number of common problems under this module. It can be seen that in xx video negative feedback, most concentrated on the aspects of functions (300), design (250) and performance (240) in the spitting groove, and the detailed classification result is clearly readable.

It should be noted that the bottom label may be the lowest label of each branch, such as speed, wire speed, networking, loading, etc. in fig. 4B. If there are no more tags under the campaign and advertisement, the campaign and advertisement also belong to the underlying tags.

The common problem of a negative word of mouth is illustrated in this embodiment. In other embodiments, the automatic classification of the common problems under the positive word-of-mouth can also be obtained through the above process, and the embodiment is not described in detail.

Optionally, when the subject and the tag are not matched, calculating the similarity and the importance of the subject according to the semantic similarity and importance calculation rule;

and when the similarity of the subject is more than or equal to a first preset value and/or the importance of the subject is more than or equal to a second preset value, adding the subject serving as a label into a dynamic label library.

That is, after the subject modified by the negative word-of-mouth word is analyzed, the subject is matched with the dynamic tag library, and the matching result is recorded. If the subject language is not matched, the subject language preferentially enters a dynamic label library according to the semantic similarity and the importance calculation rule.

The data analysis method in the above embodiment may classify and combine the results of the same type of tags according to the matching result, remove duplicates, and count the results. And (4) carrying out upward classification statistics layer by layer according to the label tree until all labels are finished, and obtaining a final result label tree.

Fig. 5 is a schematic structural diagram of a data analysis apparatus according to an embodiment of the present invention, and as shown in fig. 5, the data analysis apparatus in the embodiment includes: a product label library establishing unit 51, a subject acquiring unit 52, a matching unit 53 and a result label tree generating unit 54;

the product label library establishing unit 51 is configured to establish a product label library according to input batch text content;

the subject obtaining unit 52 is configured to obtain a subject modified by a public praise word according to the batch text content, where the public praise word is obtained by performing word segmentation processing on the batch text content and screening words reaching a preset frequency after the word segmentation processing through a pre-stored word bank;

the matching unit 53 is configured to match the subject acquired by the subject acquiring unit 52 with the tag in the product tag library established by the product tag library establishing unit;

the result tag tree generating unit 54 is configured to generate a result tag tree reflecting the commonality problem in the batch text content according to the tag matched with the subject in the matching unit 53.

For example, the aforementioned product label library establishing unit 51 is used for

Establishing a dynamic label library according to the input batch text content;

establishing a special label library according to the product category corresponding to the batch text content;

In an alternative application scenario, the aforementioned product label library establishing unit 51 is used for

Acquiring nouns in the batch text content; for example, word segmentation is performed on the batch text content according to a custom word bank to obtain nouns of the batch text content.

and if the noun is not repeated with the tags in the special tag library and the tags in the general tag library, generating a dynamic tag library by taking the noun as the tags.

In a second optional application scenario, the aforementioned product tag library establishing unit 51 can also be used for

Obtaining a custom label to which the product belongs according to the product category corresponding to the batch text content;

searching synonyms and synonyms of the self-defined labels;

and generating the self-defined label, the synonym of the self-defined label and the similar synonym into a special label library of the batch text content.

In a third optional application scenario, the subject obtaining unit 52 is configured to

Obtaining a subject and/or a hidden subject modified by a word-of-mouth in the batch text content;

the matching unit 53 is used for

In a fourth optional application scenario, the apparatus may further include an extended word-of-mouth acquiring unit 55 shown in fig. 6A:

the extended public praise word acquiring unit 55 is configured to acquire an extended public praise word of a tag in the product tag library established by the product tag library establishing unit 51;

the matching unit 53 is further configured to match the extended public praise words acquired by the extended public praise word acquiring unit 55 with the tags corresponding to the extended public praise words in the batch of text contents;

the result tag tree generation unit 54 is used for

And generating a result label tree reflecting the common problem in the batch text content according to the label matched with the subject and the matching result of the label corresponding to the expanded public praise word and the expanded public praise word.

In a fifth optional application scenario, the apparatus may further include a multi-level label tree establishing unit 56, as shown in fig. 6B:

the multi-level label tree establishing unit 56 is configured to establish a multi-level label tree according to the membership relationship between the labels in the product label library established by the product label library establishing unit 51;

the matching unit 53 is used for

In a sixth optional application scenario, the result label tree generation unit 54 is configured to

and reversely pulling the recording result of the bottom layer label at the position of the upper layer label in the multi-level label tree to obtain a result label tree reflecting the common problem in the batch text.

That is, the result tag tree generating unit 54 is configured to record the matching result at the position of the corresponding bottom-level tag in the multi-level tag tree when the subject and the tag are correctly matched according to the matching result of the matching unit 53, and

and reversely pulling the matching success result of the bottom layer label at the position of the upper layer label in the multi-level label tree to obtain a result label tree reflecting the common problem in the batch of texts.

In a seventh optional application scenario, the apparatus further includes a subject similarity obtaining unit 57 and a subject processing unit 58, which are not shown in the figure:

the subject similarity obtaining unit 57 is configured to obtain, according to the matching result of the matching unit 53, the similarity and the importance of the subject according to the semantic similarity and importance calculation rule when the subject and the tag are not matched;

a subject processing unit 58, configured to add the subject as a tag into a dynamic tag library when the similarity of the subject obtained by the subject similarity obtaining unit 57 is greater than or equal to a first preset value, and/or the importance of the subject is greater than or equal to a second preset value.

The data analysis device may execute the technical solution of any one of the method embodiments shown in fig. 1 to fig. 3, and the implementation principle and the technical effect are similar, which are not described herein again.

The data analysis device in the above embodiment may embody the intelligentization of data processing: the emotional polarity of the data is automatically judged according to the data characteristics, and the centralized dimensions of good evaluation and poor evaluation are automatically classified; high efficiency: after one-time configuration customization, all the processes can be automatically completed, so that the labor consumption is greatly reduced; systematicness: the problems of subjective standard difference and incomplete framework of different executors in data classification are solved; instantaneity: the latest dynamics of the product is fed back sharply, and real-time result display is supported.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of data analysis, comprising:

establishing a dynamic label library according to the input text content;

generating the product label library by the dynamic label library, the special label library and a preset general label library;

matching the subject with a bottom layer label in the multi-level label tree;

acquiring an expansion word-of-speech of a label in the product label library;

2. The method of claim 1, wherein the building a dynamic tag library according to the input text content comprises:

acquiring nouns in the text content;

3. The method according to claim 1, characterized in that a dedicated label library is established according to the product category corresponding to the text content;

searching synonyms and synonyms of the self-defined labels;

4. The method of claim 2, wherein the obtaining the noun in the text content comprises:

5. The method according to any one of claims 1 to 4, wherein obtaining a subject of word-of-mouth modification from the text content comprises:

matching the subject with the tags in the product tag library, including:

6. The method of claim 1, wherein generating a result tag tree reflecting a commonality problem in the textual content based on tags matching the subject comprises:

7. The method of claim 1, further comprising:

if the subject is not matched with the label, acquiring the similarity and the importance of the subject according to a semantic similarity and importance calculation rule;

8. A data analysis apparatus, comprising:

the product label library establishing unit is used for establishing a dynamic label library according to the input text content; establishing a special label library according to the product category corresponding to the text content; generating the product label library by the dynamic label library, the special label library and a preset general label library;

the matching unit is used for matching the subject with a bottom layer label in the multi-level label tree;

the matching unit is further configured to match the extended public praise words acquired by the extended public praise word acquisition unit with the tags corresponding to the extended public praise words in the text content;

and the result label tree generating unit is used for generating a result label tree reflecting the common problem in the text content according to the label matched with the subject and the matching result of the label corresponding to the expanded word-of-speech and the expanded word-of-speech.

9. The apparatus of claim 8, wherein the product tag library creating unit is configured to create a library of product tags

Acquiring nouns in the text content;

10. The apparatus of claim 8, wherein the product tag library creating unit is configured to create a library of product tags

searching synonyms and synonyms of the self-defined labels;

11. The apparatus of claim 9, wherein the product tag library creating unit is configured to create a library of product tags

12. The apparatus according to any one of claims 8 to 11, wherein the subject obtaining unit is configured to obtain the subject

the matching unit is used for

13. The apparatus of claim 8, wherein the result tag tree generation unit is configured to generate the result tag tree

14. The apparatus of claim 8, further comprising:

a subject similarity obtaining unit, configured to obtain, according to the matching result of the matching unit, similarity and importance of the subject according to a semantic similarity and importance calculation rule when the subject and the tag are not matched;

15. A storage medium having stored thereon computer-executable instructions which, when loaded and executed by a processor, carry out a method of data analysis according to any one of claims 1 to 7.