CN105095288B - Data analysis method and data analysis device - Google Patents

Data analysis method and data analysis device Download PDF

Info

Publication number
CN105095288B
CN105095288B CN201410204300.1A CN201410204300A CN105095288B CN 105095288 B CN105095288 B CN 105095288B CN 201410204300 A CN201410204300 A CN 201410204300A CN 105095288 B CN105095288 B CN 105095288B
Authority
CN
China
Prior art keywords
label
subject
library
text content
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410204300.1A
Other languages
Chinese (zh)
Other versions
CN105095288A (en
Inventor
温春龙
陈妍
梁璟彪
骆玘
黄利贤
樊中一
吕虹
刘敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Tencent Cloud Computing Beijing Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201410204300.1A priority Critical patent/CN105095288B/en
Publication of CN105095288A publication Critical patent/CN105095288A/en
Application granted granted Critical
Publication of CN105095288B publication Critical patent/CN105095288B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a data analysis method and a data analysis device, wherein the method comprises the following steps: establishing a product label library according to the input text content; obtaining a subject word modified by a public praise word according to the text content, wherein the public praise word is obtained by performing word segmentation processing on the text content and screening words reaching preset frequency after the word segmentation processing through a pre-stored word bank; matching the subject with the labels in the product label library; and generating a result label tree reflecting the commonality problem in the text content according to the label matched with the subject. The method can comprehensively collect the comment content in real time, simplify the existing data analysis mode and improve the accuracy of data analysis.

Description

Data analysis method and data analysis device
Technical Field
The present invention relates to internet technologies, and in particular, to a data analysis method and a data analysis device.
Background
At present, after some enterprises collect user feedback of a certain product, manual classification is performed according to text content, and whether specific aspects of the product (such as functions, bugs) are referred to by comment content and the emotional polarity (positive and negative) of the comment are judged.
That is, the word of public praise of the product and the concentration points of good comment and bad comment are judged manually. Through manual reading of the comments, whether the emotion expressed by the comments is positive, negative or neutral is judged, and meanwhile, which dimension (such as performance, function or price) the evaluation object belongs to in the comments is judged. Then manual classification is carried out, and finally statistics and sorting are carried out, so that the good evaluation and the bad evaluation of the product are mainly concentrated on the dimensionalities.
However, in the case of a large amount of data, excessive human involvement causes repetitive labor and inefficiency, and classification and summarization lack systematicness and consistency, resulting in high labor consumption cost and lack of real-time property.
For this reason, a pan-bao attribute pair classification also appears in the prior art, for example, the preset attribute words and emotion words are matched one by one, and the induction result is counted.
However, the shortcomings of the Taobao attribute pair classification include: firstly, the classification of data lacks comprehensiveness; second, failure to combine the analysis of word-of-mouth conditions, only a conclusion of a review of one aspect can be seen.
For this reason, a method capable of performing data analysis comprehensively in real time is required.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides a data analysis method and a data analysis device, which are used for comprehensively collecting comment contents in real time, simplifying the existing data analysis mode and improving the accuracy of data analysis.
In a first aspect, an embodiment of the present invention provides a data analysis method, including:
establishing a product label library according to the input text content;
obtaining a subject word modified by a public praise word according to the text content, wherein the public praise word is obtained by performing word segmentation processing on the text content and screening words reaching preset frequency after the word segmentation processing through a pre-stored word bank;
matching the subject with the labels in the product label library;
and generating a result label tree reflecting the commonality problem in the text content according to the label matched with the subject.
With reference to the first aspect, in a first possible implementation manner, the creating a product tag library according to input text content includes:
establishing a dynamic label library according to the input text content;
establishing a special label library according to the product category corresponding to the text content;
and generating the product label library by using the dynamic label library, the special label library and a preset general label library.
With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner, the establishing a dynamic tag library according to the input text content includes:
acquiring nouns in the text content;
judging whether the frequency of occurrence of the nouns is greater than a preset threshold value or not;
if the frequency of occurrence of the noun is larger than a preset threshold, determining whether the noun is repeated with the tags in the special tag library and the tags in the general tag library;
and if the noun is not repeated with the tags in the special tag library and the tags in the general tag library, the noun is used as the tags to generate the dynamic tag library.
With reference to the first possible implementation manner of the first aspect, in a third possible implementation manner, a dedicated tag library is established according to a product category corresponding to the text content;
obtaining a custom label to which the product belongs according to the product category corresponding to the text content;
searching synonyms and synonyms of the self-defined labels;
and generating the self-defined label, the synonym of the self-defined label and the similar synonym into the special label library of the text content.
With reference to the second possible implementation manner of the first aspect, in a fourth possible implementation manner, the obtaining a noun in the text content includes:
and performing word segmentation processing on the text content according to a user-defined word bank to obtain a noun of the text content.
With reference to the first aspect and the foregoing possible implementation manners of the first aspect, in a fifth possible implementation manner, the obtaining a subject modified by a word-of-mouth according to the text content includes:
obtaining a subject and/or an implied subject modified by a word in public praise in the text content;
matching the subject with the tags in the product tag library, including:
and matching the subject and/or the hidden subject with the labels in the product label library respectively.
With reference to the first aspect and the first to fourth possible implementation manners of the first aspect, in a sixth possible implementation manner, before the step of generating, according to the tag matched with the subject, a result tag tree that reflects a commonality problem in the text content, the method further includes:
acquiring an expansion word-of-speech of a label in the product label library;
matching the expanded public praise words with the labels corresponding to the expanded public praise words in the text content;
generating a result label tree reflecting the commonality problem in the text content according to the label matched with the subject, wherein the result label tree comprises:
and generating a result label tree reflecting the common problem of the text content according to the label matched with the subject and the matching result of the label corresponding to the expanded public praise word and the expanded public praise word.
With reference to the first aspect and the first to fourth possible implementation manners of the first aspect, in a seventh possible implementation manner, after the step of establishing a product tag library according to the input file content, the method further includes:
establishing a multi-level label tree according to the membership relationship among the labels in the product label library;
matching the subject with the tags in the product tag library, including:
and matching the subject with the bottom layer label in the multi-level label tree.
With reference to the seventh possible implementation manner of the first aspect, in an eighth possible implementation manner,
generating a result label tree reflecting the commonality problem in the text content according to the label matched with the subject, wherein the result label tree comprises:
if the subject is matched with the bottom layer label, recording the position of the bottom layer label;
and reversely pulling the recording result of the bottom layer label at the position of the upper layer label in the multi-level label tree to obtain a result label tree reflecting the common problem in the text.
With reference to the first possible implementation manner of the first aspect, in a ninth possible implementation manner, the method further includes:
if the subject is not matched with the label, obtaining the similarity and the importance of the subject according to the semantic similarity and importance calculation rule;
and if the similarity of the subject is greater than or equal to a first preset value and/or the importance of the subject is greater than or equal to a second preset value, adding the subject serving as a label into the dynamic label library.
In a second aspect, an embodiment of the present invention provides a data analysis apparatus, including:
the product label library establishing unit is used for establishing a product label library according to the input text content;
the subject acquiring unit is used for acquiring a subject modified by public praise words according to the text content, wherein the public praise words are obtained by performing word segmentation processing on the text content and screening words reaching preset frequency after the word segmentation processing through a pre-stored word bank;
the matching unit is used for matching the subject acquired by the subject acquisition unit with the tags in the product tag library established by the product tag library establishing unit;
and the result label tree generating unit is used for generating a result label tree reflecting the commonality problem in the text content according to the label matched with the subject in the matching unit.
With reference to the second aspect, in a first possible implementation manner, the product tag library establishing unit is configured to establish a product tag library according to a specific application
Establishing a dynamic label library according to the input text content;
establishing a special label library according to the product category corresponding to the text content;
and generating the product label library by using the dynamic label library, the special label library and a preset general label library.
With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner, the product tag library establishing unit is configured to establish a product tag library according to the first possible implementation manner of the second aspect
Acquiring nouns in the text content;
judging whether the frequency of occurrence of the nouns is greater than a preset threshold value or not;
if the frequency of occurrence of the noun is larger than a preset threshold, determining whether the noun is repeated with the tags in the special tag library and the tags in the general tag library;
and if the noun is not repeated with the tags in the special tag library and the tags in the general tag library, generating the dynamic tag library by taking the noun as the tags.
With reference to the first possible implementation manner of the second aspect, in a third possible implementation manner, the product tag library establishing unit is configured to establish a product tag library according to a specific implementation manner of the second aspect
Obtaining a custom label to which the product belongs according to the product category corresponding to the text content;
searching synonyms and synonyms of the self-defined labels;
and generating the self-defined label, the synonym of the self-defined label and the similar synonym into the special label library of the text content.
With reference to the second possible implementation manner of the second aspect, in a fourth possible implementation manner, the product tag library establishing unit is configured to establish a product tag library according to the second possible implementation manner
And performing word segmentation processing on the text content according to a user-defined word bank to obtain a noun of the text content.
With reference to the second aspect and the foregoing possible implementation manners of the second aspect, in a fifth possible implementation manner, the subject obtaining unit is configured to obtain a subject from the subject
Obtaining a subject and/or an implied subject modified by a word in public praise in the text content;
the matching unit is used for
And matching the subject and/or the hidden subject acquired by the subject acquisition unit with the tags in the product tag library established by the product tag library establishing unit respectively.
With reference to the second aspect and the first to fourth possible implementation manners of the second aspect, in a sixth possible implementation manner, the apparatus further includes:
an extended public praise word acquiring unit, configured to acquire an extended public praise word of a tag in the product tag library established by the product tag library establishing unit;
the matching unit is also used for
Matching the extended public praise words acquired by the extended public praise word acquisition unit with the labels corresponding to the extended public praise words in the text content;
a result tag tree generation unit for
And generating a result label tree reflecting the common problem of the text content according to the label matched with the subject and the matching result of the label corresponding to the expanded public praise word and the expanded public praise word.
With reference to the second aspect and the first to fourth possible implementation manners of the second aspect, in a seventh possible implementation manner, the apparatus further includes:
the multi-level label tree establishing unit is used for establishing a multi-level label tree according to the membership relationship among all labels in the product label library established by the product label library establishing unit;
the matching unit is used for
And the system is used for matching the subject acquired by the subject acquisition unit with the bottom layer label in the multi-level label tree established by the multi-level label tree establishment unit.
With reference to the seventh possible implementation manner of the second aspect, in an eighth possible implementation manner, the result label tree generating unit is configured to generate a label tree based on the result label tree
If the subject is matched with the bottom layer label, recording the position of the bottom layer label;
and reversely pulling the recording result of the bottom layer label at the position of the upper layer label in the multi-level label tree to obtain a result label tree reflecting the common problem in the text.
With reference to the first possible implementation manner of the second aspect, in a ninth possible implementation manner, the apparatus further includes:
a subject similarity obtaining unit, configured to obtain, according to the matching result of the matching unit, similarity and importance of the subject according to the semantic similarity and importance calculation rule when the subject and the tag are not matched;
and the subject processing unit is used for adding the subject as a label into the dynamic label library when the similarity of the subject acquired by the subject similarity acquisition unit is greater than or equal to a first preset value and/or the importance of the subject is greater than or equal to a second preset value.
According to the technical scheme, the data analysis method and the data analysis device provided by the embodiment of the invention have the advantages that the comprehensive product tag library is established, the subject modified by the word-of-mouth inscription is further obtained, the subject is matched with the tags in the product tag library, and when the subject is matched with the tags, the result tag tree reflecting the common problem is generated, so that the comment content in the text content can be comprehensively collected in real time, the existing data analysis mode is simplified, and the accuracy of data analysis is improved.
Drawings
Fig. 1 is a schematic flow chart of a data analysis method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a data analysis method according to another embodiment of the present invention;
FIG. 3 is a diagram illustrating multi-level tag tree generation according to an embodiment of the present invention;
FIG. 4A is a diagram illustrating a multi-level tag tree according to an embodiment of the present invention;
FIG. 4B is a diagram of a result tag tree according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a data analysis apparatus according to an embodiment of the present invention;
fig. 6A and 6B are schematic structural diagrams of a data analysis apparatus according to another embodiment of the present invention.
Detailed Description
In the embodiment of the invention, the label refers to a specific comment object when the user comments the product. If the new version interface of the XX video is very junk, the specific comment object of the user is the interface, and the term interface forms a label.
The embodiment of the invention provides a data analysis method and a data analysis device for automatically positioning positive and negative public praise and centralized dimensions of a product, and mainly solves the following problems: judging the emotional polarity (positive, negative and neutral) of the user comment, dynamically and automatically classifying, counting and sequencing the common problems under different emotional polarities, displaying the aspects of good comment, bad comment and discussion focus on the product fed back by the user according to primary and secondary levels, and tracking the change trend. For example, the data analysis method of the embodiment of the present invention can be implemented as follows:
first, the main dimensions that influence public praise are located: by acquiring a large amount of user feedback (microblogs, third-party application markets and forums) and positive and negative word-of-mouth phrases, the dimension in which positive and negative word-of-mouth phrases of a product are concentrated is automatically, comprehensively analyzed in real time, deep reasons influencing the word-of-mouth phrases of the user are deeply mined through semantic analysis, main bad comment points and problem points of the product are rapidly located, and the product is helped to definitely improve aspects.
Secondly, analyzing the word-of-mouth variation of each dimension of the product: automatically analyzing various dimension public praise and variation trend of the product, such as public praise comparison before and after new edition release, new function public praise of the product and dimension public praise variation of interface design, and visually displaying by visual mutation. The attention points of the employees in the product team, which are responsible for different modules, are different, such as development of the possible attention performance, design of the possible attention interface and style, refinement of the public praise change of each dimension of the product and satisfaction of the requirements of different attention parties.
Thirdly, classifying feedback hotspots: and analyzing the user comment hotspots, modularly merging the user feedback hotspots by merging synonyms and near synonyms, and enabling the classification result to be more accurate and practical.
Fig. 1 is a schematic flow chart of a data analysis method according to an embodiment of the present invention, and as shown in fig. 1, the data analysis method according to the embodiment is as follows.
101. And establishing a product label library according to the input batch text content.
For example, the product tag library in the present embodiment may include a dynamic tag library, a dedicated tag library, and a general tag library.
The dynamic label library is established according to input batch text contents, and the special label library is established according to product categories corresponding to the batch text contents.
The universal tag library may be augmented by manual categorization in advance.
102. And acquiring a subject modified by the public praise word according to the batch text content.
For example, the word-of-mouth may be obtained by performing word segmentation on the text content and screening words reaching a preset frequency after the word segmentation through a pre-stored word bank.
It can be understood that the original information (corresponding to the batch text content) related to the product (including the name of the product, the serial alias or the name of part of the key function block) of the user comment is collected by capturing the microblog and/or the forum through an Application Program Interface (API) or a web crawler, and after the original information is cleaned, the existing Chinese lexical analysis word-of-mouth tendency can be adopted to count the positive or negative word-of-mouth.
For example, the cleaning of the original information may be understood as removing repeated and invalid information from the original information, that is, filtering the original data, performing word segmentation on the filtered information by using a chinese lexical analysis system, analyzing word-of-mouth trends on the segmented words by using a pre-stored word bank, and screening.
103. And matching the subject with the labels in the product label library.
104. And generating a result label tree reflecting the common problem in the batch text content according to the label matched with the subject, as shown in fig. 4B.
According to the data analysis method, a comprehensive product label library is established, a subject modified by a word-of-mouth is further obtained, the subject is matched with labels in the product label library, and when the subject is matched with the labels, a result label tree reflecting the common problem is generated, so that the comment contents in batch text contents can be comprehensively collected in real time, the existing data analysis mode is simplified, and the accuracy of data analysis is improved.
Fig. 2 is a schematic flow diagram illustrating a data analysis method according to an embodiment of the present invention, and fig. 3 is a schematic diagram illustrating generation of a multi-level label tree according to an embodiment of the present invention, as shown in fig. 2 and fig. 3, the data analysis method according to this embodiment is as follows.
201. And establishing a dynamic label library according to the input batch text content.
At different times, the hot spot evaluated by the user may transfer or appear with a new tag, and a dynamic tag library needs to be established to ensure real-time and accurate requirements. For example, XX music newly introduces a function called "listen to songs and identify songs" and quickly becomes a focus of attention of users, but the label of "listen to songs and identify songs" is not contained in an existing label library, and at this time, a new label discrimination and addition mechanism needs to be added to ensure completeness and real-time performance of the label library. A dynamic label may be understood as a hotword or a new word, etc. that appears at a certain time.
In the process of establishing the label library, the comprehensiveness of the label can be further improved by means of algorithms such as word meaning similarity/near-meaning word discovery and the like. For example, the label "interface" may have similar expression patterns (synonyms or synonyms) in the user's evaluation, like: pages, panels, appearance, layout, skin, desktop, etc., which require synonym classification.
For example, this step 201 may also include the following sub-steps not shown in the figures:
and A2011, acquiring the nouns in the batch text content.
For example, according to general nouns of products and/or competitive products, performing word segmentation processing on a batch of texts through a Chinese lexical analysis system to obtain nouns and/or word-of-speech of the contents of the batch of texts.
For example, the ICTCLAS segmentation algorithm can be called through a segmentation interface provided by the ICTCLAS system to perform segmentation processing on the batch text content.
The CLAICTS system needs to call a custom thesaurus. The self-defined word stock comprises specific words and part-of-speech labels, the self-defined word stock is equivalent to a submodule of a word segmentation system, and on the basis of the self-defined word stock, a word segmentation algorithm can segment a sentence into different words. The comprehensiveness of the self-defined word bank influences the accuracy of word segmentation, and the self-defined word bank meets the requirements of being updatable, accumulatible and fitting with microblog contexts/forums.
A2012, judging whether the frequency of the noun occurrence is greater than a preset threshold.
A2013, if the frequency of occurrence of the nouns is larger than a preset threshold, determining whether the nouns are repeated with the tags in the special tag library and the tags in the general tag library.
Of course, when the frequency of occurrence of the noun is less than or equal to a predetermined threshold, the noun may be ignored or discarded.
And A2014, if the noun is not overlapped with the tags in the special tag library and the tags in the general tag library, generating a dynamic tag library by taking the noun as the tags.
Of course, if a noun is duplicated in a tag in the private tag library, the noun is discarded. Alternatively, if a noun is duplicated with a tag in the universal tag library, the noun is discarded. Therefore, the labels in the dynamic label library, the special label library and the general label library are not repeated.
In this embodiment, the dynamic tag library may be a tag library composed of names.
202. And establishing a special label library according to the product category corresponding to the batch text content.
For example, if a certain text content is "XX video interface is very junk", the product category corresponding to the text content may be an interface category in the computer, and at this time, the dedicated tag library may be a tag library corresponding to the interface category, where the tag library may include: interface, appearance, layout, skin, desktop, etc.
The term "special tags" is understood to mean the terms commonly used in the field to which a certain text content belongs, and the tags in the special tag library belong to specific fields respectively.
For example, this step 202 may also include the following sub-steps not shown in the figures:
a2021, obtaining a custom label to which the product belongs according to the product category corresponding to the batch text content;
a2022, searching synonyms and synonyms of the self-defined label;
for example, synonyms and synonyms of the custom label can be searched according to the word sense similarity rule.
A2023, generating the self-defined label, the synonym of the self-defined label and the synonym into a special label library of the batch text content.
That is, due to differences of product categories, a custom tag lexicon needs to be established for different products to ensure the accuracy of semantic analysis. For example, users of music products pay attention to "sound quality, resources, download speed" and the like, while products of e-commerce pay attention to "price, logistics, service attitude" and the like, and a dedicated tag library needs to be established according to different products.
203. And generating the product label library by using the dynamic label library, the special label library and a preset general label library.
In this embodiment, the general tag library, also called a public tag library, needs to be established to save time and labor consumption in consideration of product commonality. For example, in the user feedback of all products, the comment object relates to tags such as 'bug, network speed, interface, performance, charge', and the like, and the tags have common attributes and can be added into a general tag library.
Since the current general tag library and the special tag library can satisfy the basic coverage but cannot hit all tags, a dynamic tag library is also provided in this embodiment. The product label library composed by the above steps 201 to 203 may have real-time and comprehensive properties.
204. And establishing a multi-level label tree according to the membership relationship among the labels in the product label library.
After the product label library is completed, a hierarchical relationship or a membership relationship among labels in the product label library needs to be established, that is, a multi-level label tree is established. User comment tags for a product initially contain different dimensions such as "whole, function, design, performance, content resources, campaign and advertisement" and the like. For the above-mentioned large dimension, it may be further divided into two more detailed dimensions, for example, "performance" includes "flash back, crash, black screen, play speed card, upgrade, installation problem" and the like, when the user expresses the "play speed card" in this two-dimensional dimension, there will be different expression forms (synonyms or the like) such as "speed, network speed, networking loading" and the like, these labels will all be used to describe "play speed", these labels are located in the underlying label library, that is, the labels expressed in the user's mouth, as shown in fig. 4A.
205. And acquiring a subject and/or an implied subject of the word-of-speech modification in the batch of text contents.
For example, the subject of word-of-mouth modification and/or the implied subject may be obtained in the batch of text content according to preset grammar rules.
It is understood that the implicit subject may be obtained based on all the word-of-mouth inscriptions or based on some word-of-mouth inscriptions.
For example, subjects with generally negative word-of-mouth modifications were analyzed:
and extracting negative word-of-mouth words appearing in the negative microblog, analyzing the subject modified by the negative word-of-mouth words, and extracting the analyzed subject. For example, in the negative evaluation of "the interface of the new version of XX video is very junk", the "junk" is a negative word-of-mouth, the subject of the modification is "interface", and the modified subject is extracted as "interface and junk", and of course, the modified subject can be extracted as (XX video-new version-interface and junk) according to different levels of the subject.
Negative word-of-mouth with underlying subjects was analyzed:
some negative comments have negative word-of-mouth, but no obvious subject is found. If the xx video is beautiful in appearance but is too katzen, the katzenn is recognized as a negative word of public praise in the comment, but the subject of the modification is actually hidden, and the user expresses that the xx video is too katzen in speed. For such negative word-of-speech with actual meaning, if no obvious subject in the comment can be found, the system will automatically call the corresponding subject library for matching, and extract as (xx video-speed, katon).
206. And acquiring the expansion word-of-speech of the label in the product label library.
For example, word co-linearity rules or/and manual classification may be employed to obtain extended word-of-speech for the tag. It can be understood that the expanded public praise word is not intended to be an emotional word in a real sense, and the expanded public praise word only has a practical meaning when matched with a specific label. If the logistics is very fast, the fast is an extended public praise, and only the matching with the logistics has actual emotional meaning.
It should be understood that the term co-linearity rule refers to an algorithm that calculates the probability that two words or words appear together.
For example, the subject of negative extension word modification is analyzed:
some user reviews do not have obvious negative word-of-mouth words but still express negative emotions. For example, if the speed of the xx video is slow and the traffic is fast, neither the slow nor the fast are negative word-of-speech in the comment (if the negative word-of-speech is judged to be a negative word-of-speech, a large amount of misjudgment can occur), but when the two words are collocated with a specific subject (tag), negative emotion can be expressed. At this time, a negative expansion word bank and corresponding meaningful grammar rules need to be established, and the conditions are analyzed and extracted as (speed, slow), (flow and fast).
207. And matching the subject and/or the implied subject with bottom-layer tags in the multi-level tag tree respectively, and matching the expanded word-of-mouth with tags corresponding to the expanded word-of-mouth in the batch of text contents.
In this embodiment, the expanded word-of-mouth and word-of-mouth (i.e. the common word-of-mouth) can be in parallel level, and these expanded word-of-mouth can only have practical significance if they are collocated with a unique tag (e.g. the bottom-layer tag in a multi-level tag tree).
The expansion word-of-mouth is obtained corresponding to the label in the product label library, namely, the subject of the modification of the expansion word-of-mouth is determined. The subject of the extended word-of-mouth is actually the tag, and the correct number of "tags + extended word-of-mouth" needs to be matched in the batch text.
208. And generating a result label tree reflecting the common problem in the batch text content according to the matching result of the labels matched with the subject, such as bottom layer labels, and the labels corresponding to the expanded word-of-mouth and the expanded word-of-mouth.
For example, if the subject matches the underlying label, a record is made of the location to which the underlying label belongs, an
When the expanded public praise words and the labels corresponding to the expanded public praise words are matched with the batch text content, recording the positions of the labels corresponding to the expanded public praise words;
and then the recording result of the bottom layer label can be reversely pulled to the position of the upper layer label in the multi-level label tree to obtain a result label tree containing the matching result.
That is to say, in this embodiment, a public praise word is first found in the batch text, a public praise word modification subject is searched, if found, the subject matches a bottom label in the multi-level label tree, and if successful, the bottom label + 1; if not, the implicit subject of the word-of-speech can be found, then the implicit subject is matched with the bottom-layer label, and if the implicit subject is successfully matched with the bottom-layer label, the bottom-layer label is + 1. Therefore, the 'expansion word-of-mouth + label' is traversed and matched in the batch text content, and if the 'expansion word-of-mouth + label' is successful, the label +1 corresponding to the expansion word-of-mouth is obtained.
Generally, the process of multi-level tag tree matching is counted upwards from the bottom layer of the tag tree, but when the user views the result tag tree, the user views the result tag tree from top to bottom, i.e. from the bottom layer, for example, views "xx video" → "performance" → "playing speed card", as shown in fig. 4B, at this time, the user needs to view specific text information under the module, and then needs to "pull" the matched result (i.e. pull in the reverse direction).
In this embodiment, the system can pull the matching success record, mark the text position, and perform highlighting or emphasis processing, and after pulling in the reverse direction, further obtain the result tag tree containing the matching result.
In fig. 4B, the number of successful matches of numbers behind each layer of labels is the number of common problems under this module. It can be seen that in xx video negative feedback, most concentrated on the aspects of functions (300), design (250) and performance (240) in the spitting groove, and the detailed classification result is clearly readable.
It should be noted that the bottom label may be the lowest label of each branch, such as speed, wire speed, networking, loading, etc. in fig. 4B. If there are no more tags under the campaign and advertisement, the campaign and advertisement also belong to the underlying tags.
The common problem of a negative word of mouth is illustrated in this embodiment. In other embodiments, the automatic classification of the common problems under the positive word-of-mouth can also be obtained through the above process, and the embodiment is not described in detail.
Optionally, when the subject and the tag are not matched, calculating the similarity and the importance of the subject according to the semantic similarity and importance calculation rule;
and when the similarity of the subject is more than or equal to a first preset value and/or the importance of the subject is more than or equal to a second preset value, adding the subject serving as a label into a dynamic label library.
That is, after the subject modified by the negative word-of-mouth word is analyzed, the subject is matched with the dynamic tag library, and the matching result is recorded. If the subject language is not matched, the subject language preferentially enters a dynamic label library according to the semantic similarity and the importance calculation rule.
The data analysis method in the above embodiment may classify and combine the results of the same type of tags according to the matching result, remove duplicates, and count the results. And (4) carrying out upward classification statistics layer by layer according to the label tree until all labels are finished, and obtaining a final result label tree.
Fig. 5 is a schematic structural diagram of a data analysis apparatus according to an embodiment of the present invention, and as shown in fig. 5, the data analysis apparatus in the embodiment includes: a product label library establishing unit 51, a subject acquiring unit 52, a matching unit 53 and a result label tree generating unit 54;
the product label library establishing unit 51 is configured to establish a product label library according to input batch text content;
the subject obtaining unit 52 is configured to obtain a subject modified by a public praise word according to the batch text content, where the public praise word is obtained by performing word segmentation processing on the batch text content and screening words reaching a preset frequency after the word segmentation processing through a pre-stored word bank;
the matching unit 53 is configured to match the subject acquired by the subject acquiring unit 52 with the tag in the product tag library established by the product tag library establishing unit;
the result tag tree generating unit 54 is configured to generate a result tag tree reflecting the commonality problem in the batch text content according to the tag matched with the subject in the matching unit 53.
For example, the aforementioned product label library establishing unit 51 is used for
Establishing a dynamic label library according to the input batch text content;
establishing a special label library according to the product category corresponding to the batch text content;
and generating the product label library by using the dynamic label library, the special label library and a preset general label library.
In an alternative application scenario, the aforementioned product label library establishing unit 51 is used for
Acquiring nouns in the batch text content; for example, word segmentation is performed on the batch text content according to a custom word bank to obtain nouns of the batch text content.
Judging whether the frequency of occurrence of the nouns is greater than a preset threshold value or not;
if the frequency of occurrence of the noun is larger than a preset threshold, determining whether the noun is repeated with the tags in the special tag library and the tags in the general tag library;
and if the noun is not repeated with the tags in the special tag library and the tags in the general tag library, generating a dynamic tag library by taking the noun as the tags.
In a second optional application scenario, the aforementioned product tag library establishing unit 51 can also be used for
Obtaining a custom label to which the product belongs according to the product category corresponding to the batch text content;
searching synonyms and synonyms of the self-defined labels;
and generating the self-defined label, the synonym of the self-defined label and the similar synonym into a special label library of the batch text content.
In a third optional application scenario, the subject obtaining unit 52 is configured to
Obtaining a subject and/or a hidden subject modified by a word-of-mouth in the batch text content;
the matching unit 53 is used for
And matching the subject and/or the hidden subject acquired by the subject acquisition unit with the tags in the product tag library established by the product tag library establishing unit respectively.
In a fourth optional application scenario, the apparatus may further include an extended word-of-mouth acquiring unit 55 shown in fig. 6A:
the extended public praise word acquiring unit 55 is configured to acquire an extended public praise word of a tag in the product tag library established by the product tag library establishing unit 51;
the matching unit 53 is further configured to match the extended public praise words acquired by the extended public praise word acquiring unit 55 with the tags corresponding to the extended public praise words in the batch of text contents;
the result tag tree generation unit 54 is used for
And generating a result label tree reflecting the common problem in the batch text content according to the label matched with the subject and the matching result of the label corresponding to the expanded public praise word and the expanded public praise word.
In a fifth optional application scenario, the apparatus may further include a multi-level label tree establishing unit 56, as shown in fig. 6B:
the multi-level label tree establishing unit 56 is configured to establish a multi-level label tree according to the membership relationship between the labels in the product label library established by the product label library establishing unit 51;
the matching unit 53 is used for
And the system is used for matching the subject acquired by the subject acquisition unit with the bottom layer label in the multi-level label tree established by the multi-level label tree establishment unit.
In a sixth optional application scenario, the result label tree generation unit 54 is configured to
If the subject is matched with the bottom layer label, recording the position of the bottom layer label;
and reversely pulling the recording result of the bottom layer label at the position of the upper layer label in the multi-level label tree to obtain a result label tree reflecting the common problem in the batch text.
That is, the result tag tree generating unit 54 is configured to record the matching result at the position of the corresponding bottom-level tag in the multi-level tag tree when the subject and the tag are correctly matched according to the matching result of the matching unit 53, and
and reversely pulling the matching success result of the bottom layer label at the position of the upper layer label in the multi-level label tree to obtain a result label tree reflecting the common problem in the batch of texts.
In a seventh optional application scenario, the apparatus further includes a subject similarity obtaining unit 57 and a subject processing unit 58, which are not shown in the figure:
the subject similarity obtaining unit 57 is configured to obtain, according to the matching result of the matching unit 53, the similarity and the importance of the subject according to the semantic similarity and importance calculation rule when the subject and the tag are not matched;
a subject processing unit 58, configured to add the subject as a tag into a dynamic tag library when the similarity of the subject obtained by the subject similarity obtaining unit 57 is greater than or equal to a first preset value, and/or the importance of the subject is greater than or equal to a second preset value.
The data analysis device may execute the technical solution of any one of the method embodiments shown in fig. 1 to fig. 3, and the implementation principle and the technical effect are similar, which are not described herein again.
The data analysis device in the above embodiment may embody the intelligentization of data processing: the emotional polarity of the data is automatically judged according to the data characteristics, and the centralized dimensions of good evaluation and poor evaluation are automatically classified; high efficiency: after one-time configuration customization, all the processes can be automatically completed, so that the labor consumption is greatly reduced; systematicness: the problems of subjective standard difference and incomplete framework of different executors in data classification are solved; instantaneity: the latest dynamics of the product is fed back sharply, and real-time result display is supported.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (15)

1. A method of data analysis, comprising:
establishing a dynamic label library according to the input text content;
establishing a special label library according to the product category corresponding to the text content;
generating the product label library by the dynamic label library, the special label library and a preset general label library;
establishing a multi-level label tree according to the membership relationship among the labels in the product label library;
obtaining a subject word modified by a public praise word according to the text content, wherein the public praise word is obtained by performing word segmentation processing on the text content and screening words reaching preset frequency after the word segmentation processing through a pre-stored word bank;
matching the subject with a bottom layer label in the multi-level label tree;
acquiring an expansion word-of-speech of a label in the product label library;
matching the expanded public praise words with the labels corresponding to the expanded public praise words in the text content;
and generating a result label tree reflecting the common problem of the text content according to the label matched with the subject and the matching result of the label corresponding to the expanded public praise word and the expanded public praise word.
2. The method of claim 1, wherein the building a dynamic tag library according to the input text content comprises:
acquiring nouns in the text content;
judging whether the frequency of occurrence of the nouns is greater than a preset threshold value or not;
if the frequency of occurrence of the noun is larger than a preset threshold, determining whether the noun is repeated with the tags in the special tag library and the tags in the general tag library;
and if the noun is not repeated with the tags in the special tag library and the tags in the general tag library, generating the dynamic tag library by taking the noun as the tags.
3. The method according to claim 1, characterized in that a dedicated label library is established according to the product category corresponding to the text content;
obtaining a custom label to which the product belongs according to the product category corresponding to the text content;
searching synonyms and synonyms of the self-defined labels;
and generating the self-defined label, the synonym of the self-defined label and the similar synonym into the special label library of the text content.
4. The method of claim 2, wherein the obtaining the noun in the text content comprises:
and performing word segmentation processing on the text content according to a user-defined word bank to obtain a noun of the text content.
5. The method according to any one of claims 1 to 4, wherein obtaining a subject of word-of-mouth modification from the text content comprises:
obtaining a subject and/or an implied subject modified by a word in public praise in the text content;
matching the subject with the tags in the product tag library, including:
and matching the subject and/or the hidden subject with the labels in the product label library respectively.
6. The method of claim 1, wherein generating a result tag tree reflecting a commonality problem in the textual content based on tags matching the subject comprises:
if the subject is matched with the bottom layer label, recording the position of the bottom layer label;
and reversely pulling the recording result of the bottom layer label at the position of the upper layer label in the multi-level label tree to obtain a result label tree reflecting the common problem in the text.
7. The method of claim 1, further comprising:
if the subject is not matched with the label, acquiring the similarity and the importance of the subject according to a semantic similarity and importance calculation rule;
and if the similarity of the subject is greater than or equal to a first preset value and/or the importance of the subject is greater than or equal to a second preset value, adding the subject serving as a label into the dynamic label library.
8. A data analysis apparatus, comprising:
the product label library establishing unit is used for establishing a dynamic label library according to the input text content; establishing a special label library according to the product category corresponding to the text content; generating the product label library by the dynamic label library, the special label library and a preset general label library;
the multi-level label tree establishing unit is used for establishing a multi-level label tree according to the membership relationship among all labels in the product label library established by the product label library establishing unit;
the subject acquiring unit is used for acquiring a subject modified by public praise words according to the text content, wherein the public praise words are obtained by performing word segmentation processing on the text content and screening words reaching preset frequency after the word segmentation processing through a pre-stored word bank;
the matching unit is used for matching the subject with a bottom layer label in the multi-level label tree;
an extended public praise word acquiring unit, configured to acquire an extended public praise word of a tag in the product tag library established by the product tag library establishing unit;
the matching unit is further configured to match the extended public praise words acquired by the extended public praise word acquisition unit with the tags corresponding to the extended public praise words in the text content;
and the result label tree generating unit is used for generating a result label tree reflecting the common problem in the text content according to the label matched with the subject and the matching result of the label corresponding to the expanded word-of-speech and the expanded word-of-speech.
9. The apparatus of claim 8, wherein the product tag library creating unit is configured to create a library of product tags
Acquiring nouns in the text content;
judging whether the frequency of occurrence of the nouns is greater than a preset threshold value or not;
if the frequency of occurrence of the noun is larger than a preset threshold, determining whether the noun is repeated with the tags in the special tag library and the tags in the general tag library;
and if the noun is not repeated with the tags in the special tag library and the tags in the general tag library, generating the dynamic tag library by taking the noun as the tags.
10. The apparatus of claim 8, wherein the product tag library creating unit is configured to create a library of product tags
Obtaining a custom label to which the product belongs according to the product category corresponding to the text content;
searching synonyms and synonyms of the self-defined labels;
and generating the self-defined label, the synonym of the self-defined label and the similar synonym into the special label library of the text content.
11. The apparatus of claim 9, wherein the product tag library creating unit is configured to create a library of product tags
And performing word segmentation processing on the text content according to a user-defined word bank to obtain a noun of the text content.
12. The apparatus according to any one of claims 8 to 11, wherein the subject obtaining unit is configured to obtain the subject
Obtaining a subject and/or an implied subject modified by a word in public praise in the text content;
the matching unit is used for
And matching the subject and/or the hidden subject acquired by the subject acquisition unit with the tags in the product tag library established by the product tag library establishing unit respectively.
13. The apparatus of claim 8, wherein the result tag tree generation unit is configured to generate the result tag tree
If the subject is matched with the bottom layer label, recording the position of the bottom layer label;
and reversely pulling the recording result of the bottom layer label at the position of the upper layer label in the multi-level label tree to obtain a result label tree reflecting the common problem in the text.
14. The apparatus of claim 8, further comprising:
a subject similarity obtaining unit, configured to obtain, according to the matching result of the matching unit, similarity and importance of the subject according to a semantic similarity and importance calculation rule when the subject and the tag are not matched;
and the subject processing unit is used for adding the subject as a label into the dynamic label library when the similarity of the subject acquired by the subject similarity acquisition unit is greater than or equal to a first preset value and/or the importance of the subject is greater than or equal to a second preset value.
15. A storage medium having stored thereon computer-executable instructions which, when loaded and executed by a processor, carry out a method of data analysis according to any one of claims 1 to 7.
CN201410204300.1A 2014-05-14 2014-05-14 Data analysis method and data analysis device Active CN105095288B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410204300.1A CN105095288B (en) 2014-05-14 2014-05-14 Data analysis method and data analysis device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410204300.1A CN105095288B (en) 2014-05-14 2014-05-14 Data analysis method and data analysis device

Publications (2)

Publication Number Publication Date
CN105095288A CN105095288A (en) 2015-11-25
CN105095288B true CN105095288B (en) 2020-02-07

Family

ID=54575741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410204300.1A Active CN105095288B (en) 2014-05-14 2014-05-14 Data analysis method and data analysis device

Country Status (1)

Country Link
CN (1) CN105095288B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156041B (en) * 2015-03-26 2019-05-28 科大讯飞股份有限公司 Hot information finds method and system
CN105824898A (en) * 2016-03-14 2016-08-03 苏州大学 Label extracting method and device for network comments
CN106021433B (en) * 2016-05-16 2019-05-10 北京百分点信息科技有限公司 A kind of the public praise analysis method and device of comment on commodity data
CN106250420A (en) * 2016-07-21 2016-12-21 深圳市辣妈帮科技有限公司 Label correlating method and device
CN107918778B (en) * 2016-10-11 2022-03-15 阿里巴巴集团控股有限公司 Information matching method and related device
CN107153641B (en) * 2017-05-08 2021-01-12 北京百度网讯科技有限公司 Comment information determination method, comment information determination device, server and storage medium
CN108510285A (en) * 2017-05-17 2018-09-07 苏州纯青智能科技有限公司 A kind of evaluation method based on trade order
CN107391480A (en) * 2017-06-23 2017-11-24 广州市万隆证券咨询顾问有限公司 A kind of stock invester's personality characters analysis method and system based on stock invester's market sentiment
CN107861944A (en) * 2017-10-24 2018-03-30 广东亿迅科技有限公司 A kind of text label extracting method and device based on Word2Vec
CN107918667B (en) * 2017-11-28 2020-09-04 杭州有赞科技有限公司 Method, system and device for generating text label words
CN108009715A (en) * 2017-11-28 2018-05-08 邢加和 It is a kind of automatically analyze index fluctuation root because method
CN108153856B (en) * 2017-12-22 2022-09-06 北京百度网讯科技有限公司 Method and apparatus for outputting information
CN109325148A (en) * 2018-08-03 2019-02-12 百度在线网络技术(北京)有限公司 The method and apparatus for generating information
CN109145301B (en) * 2018-08-29 2023-01-24 上海汽车集团股份有限公司 Information classification method and device and computer readable storage medium
CN113505192A (en) * 2021-05-25 2021-10-15 平安银行股份有限公司 Data tag library construction method and device, electronic equipment and computer storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101216842A (en) * 2008-01-07 2008-07-09 华为技术有限公司 Method for obtaining page key words and page information processing apparatus
CN101968788A (en) * 2009-07-27 2011-02-09 富士通株式会社 Method and device for extracting product attribute information
CN102982076A (en) * 2012-10-30 2013-03-20 新华通讯社 Multi-dimensionality content labeling method based on semanteme label database
CN103399916A (en) * 2013-07-31 2013-11-20 清华大学 Internet comment and opinion mining method and system on basis of product features
CN103679462A (en) * 2012-08-31 2014-03-26 阿里巴巴集团控股有限公司 Comment data processing method and device and searching method and system
CN103761264A (en) * 2013-12-31 2014-04-30 浙江大学 Concept hierarchy establishing method based on product review document set

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101216842A (en) * 2008-01-07 2008-07-09 华为技术有限公司 Method for obtaining page key words and page information processing apparatus
CN101968788A (en) * 2009-07-27 2011-02-09 富士通株式会社 Method and device for extracting product attribute information
CN103679462A (en) * 2012-08-31 2014-03-26 阿里巴巴集团控股有限公司 Comment data processing method and device and searching method and system
CN102982076A (en) * 2012-10-30 2013-03-20 新华通讯社 Multi-dimensionality content labeling method based on semanteme label database
CN103399916A (en) * 2013-07-31 2013-11-20 清华大学 Internet comment and opinion mining method and system on basis of product features
CN103761264A (en) * 2013-12-31 2014-04-30 浙江大学 Concept hierarchy establishing method based on product review document set

Also Published As

Publication number Publication date
CN105095288A (en) 2015-11-25

Similar Documents

Publication Publication Date Title
CN105095288B (en) Data analysis method and data analysis device
Kaur et al. Multimodal sentiment analysis: A survey and comparison
Gu et al. " what parts of your apps are loved by users?"(T)
CN109189942B (en) Construction method and device of patent data knowledge graph
Ding et al. Entity-level sentiment analysis of issue comments
KR101713558B1 (en) Method of classification and analysis of sentiment in social network service
US20200134398A1 (en) Determining intent from multimodal content embedded in a common geometric space
KR20120109943A (en) Emotion classification method for analysis of emotion immanent in sentence
CN112699645B (en) Corpus labeling method, apparatus and device
CN107436916B (en) Intelligent answer prompting method and device
CN104199845B (en) Line Evaluation based on agent model discusses sensibility classification method
Haruechaiyasak et al. S-sense: A sentiment analysis framework for social media sensing
CN108009297B (en) Text emotion analysis method and system based on natural language processing
CN109582788A (en) Comment spam training, recognition methods, device, equipment and readable storage medium storing program for executing
CN112765974B (en) Service assistance method, electronic equipment and readable storage medium
CN109785123A (en) A kind of business handling assisted method, device and terminal device
CN111462752A (en) Client intention identification method based on attention mechanism, feature embedding and BI-L STM
Kantharaj et al. Opencqa: Open-ended question answering with charts
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN107632974B (en) Chinese analysis platform suitable for multiple fields
CN113051380A (en) Information generation method and device, electronic equipment and storage medium
Chen et al. Research on credit evaluation model of online store based on SnowNLP
KR102185733B1 (en) Server and method for automatically generating profile
Trakultaweekoon et al. Sensetag: A tagging tool for constructing thai sentiment lexicon
CN109933784B (en) Text recognition method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20231227

Address after: 518000 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 Floors

Patentee after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

Patentee after: TENCENT CLOUD COMPUTING (BEIJING) Co.,Ltd.

Address before: 2, 518044, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District

Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.