CN108364199B - Data analysis method and system based on Internet user comments - Google Patents

Data analysis method and system based on Internet user comments Download PDF

Info

Publication number
CN108364199B
CN108364199B CN201810167403.3A CN201810167403A CN108364199B CN 108364199 B CN108364199 B CN 108364199B CN 201810167403 A CN201810167403 A CN 201810167403A CN 108364199 B CN108364199 B CN 108364199B
Authority
CN
China
Prior art keywords
information
comment
category
word
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810167403.3A
Other languages
Chinese (zh)
Other versions
CN108364199A (en
Inventor
周通
张绪玲
于潇潇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sohu New Media Information Technology Co Ltd
Original Assignee
Beijing Sohu New Media Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sohu New Media Information Technology Co Ltd filed Critical Beijing Sohu New Media Information Technology Co Ltd
Priority to CN201810167403.3A priority Critical patent/CN108364199B/en
Publication of CN108364199A publication Critical patent/CN108364199A/en
Application granted granted Critical
Publication of CN108364199B publication Critical patent/CN108364199B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The application discloses a data analysis method based on Internet user comments, which comprises the steps of obtaining comment information of a user on the Internet, cutting words of the comment information to obtain comment word information of the comment information, comparing preset classification keywords with the comment word information, dividing the comment word information conforming to the classification keywords into categories of the classification keywords, comparing preset emotion marking words with the classified comment word information to generate emotion information of the classified comment word information, wherein the emotion information comprises positive emotion information and/or negative emotion information, and generating word comment information of the categories based on the emotion information and the comment word information. The method has the advantages that the corresponding linguistic data of each website are automatically crawled, so that the labor cost is relatively low, and the recovery difficulty is simple; all behaviors of the user are truly reflected by the preference of the user, so that the authenticity can be ensured; and finally, the analysis accuracy is high because the big data of the whole user is analyzed, but the sample is not used for estimating the whole.

Description

Data analysis method and system based on Internet user comments
Technical Field
The application relates to the technical field of data analysis, in particular to a data analysis method and system based on internet user comments.
Background
With the development of the internet and the mobile internet in recent years, marketing is gradually switched from taking a product as a center to taking a user as a center, and by obtaining user comments of brands, on one hand, the status, advantages and disadvantages of the brands in the user center can be determined, and targeted marketing interaction is performed; on the other hand, a new expectation point of a potential client can be found, and the social data is used for realizing the requirements of the consumers.
Most of the existing marketing public praise analysis is obtained by adopting a traditional investigation method, the traditional investigation method usually adopts questionnaire survey, and the questionnaire survey is limited by questionnaire design, sample capacity, investigation method, questionnaire recovery, time and manpower expenditure and the like, for example, the questionnaire is difficult to design, and how to scientifically and reasonably design the questionnaire; the investigation result is wide and not deep; questionnaire surveys often adopt a mode of filling and answering questionnaires by users, and the quality of survey results cannot be well guaranteed; most of the collected data are small data, and the sample data is used for estimating the whole data, so that the result accuracy is not high. Therefore, enterprises using the method have the defects of difficult questionnaire design, uncertain truth of survey results, low overall accuracy rate of sample conjecture and the like.
Therefore, how to ensure the authenticity of marketing public praise analysis and improve the accuracy of analysis results becomes a problem which needs to be solved urgently by technical personnel in the field.
Disclosure of Invention
In view of the above, the application provides a data analysis method based on internet user comments, compared with the prior art, the public praise analysis method based on the internet comments of the users performs analysis, and as the corresponding corpora of each website are automatically crawled, the labor cost is relatively low and the recovery difficulty is simple; all behaviors of the user are truly reflected by the preference of the user, so that the authenticity can be ensured; and finally, the analysis accuracy is high because the big data of the whole user is analyzed, but the sample is not used for estimating the whole.
The application provides a data analysis method based on internet user comments, which comprises the following steps:
obtaining comment information of a user on the Internet;
cutting words of the comment information to obtain comment word information of the comment information;
comparing preset classification keywords with the comment word information, and classifying the comment word information which accords with the classification keywords into the category of the classification keywords;
comparing a preset emotion marking word with the classified comment word information to generate emotion information of the classified comment word information, wherein the emotion information comprises positive emotion information and/or negative emotion information;
generating word-of-mouth information for the category based on the sentiment information and the comment word information.
Preferably, before the obtaining of the comment word information of the comment information by word cutting of the comment information, the method further includes:
and cleaning the comment information to remove impurities in the comment information.
Preferably, the impurities include any one or more of comment information in which text is displayed as blank in the comment information, comment information in which corpus length exceeds a preset threshold value in the comment information, and comment information in which non-user-generated content is present in the comment information.
Preferably, the classification keyword includes any one or more of an industry keyword, a category keyword, and a brand keyword, the category includes any one or more of an industry category, a category, and a brand category, the public praise information includes category popularity information, category image information, and category positive index, and the generating the public praise information of the category based on the emotion information and the comment word information includes:
generating popularity information of a category corresponding to the comment word information based on the number of the comment word information;
and generating the category positive index and/or the category image information of the category based on the emotional direction and the quantity of the emotional information.
Preferably, the classification keyword further includes a focus keyword, the public praise information further includes industry focus information, category focus information, or brand focus information, and the generating of the public praise information of the category based on the emotion information and the comment word information further includes:
generating category focus information corresponding to the focus keyword based on the number of comment word information corresponding to the focus keyword.
The utility model provides a data analysis system based on internet user comment, includes comment acquisition module, word segmentation module, classification module, emotion marking module and report preparation module, wherein:
the comment acquisition module is used for acquiring comment information of a user on the Internet;
the word cutting module is used for cutting words of the comment information to obtain comment word information of the comment information;
the classification module is used for comparing preset classification keywords with the comment word information and classifying the comment word information which accords with the classification keywords into the category of the classification keywords;
the emotion marking module is used for comparing a preset emotion marking word with the classified comment word information to generate emotion information of the classified comment word information, and the emotion information comprises positive emotion information and/or negative emotion information;
the report making module is used for generating public praise information of the category based on the emotion information and the comment word information.
Preferably, the information cleaning module is further included, wherein:
the information cleaning module is used for cleaning the comment information and removing impurities in the comment information.
Preferably, the impurities include any one or more of comment information in which text is displayed as blank in the comment information, comment information in which corpus length exceeds a preset threshold value in the comment information, and comment information in which non-user-generated content is present in the comment information.
Preferably, the classification keywords include any one or more of industry keywords, category keywords, and brand keywords, the category includes any one or more of industry categories, category categories, and brand categories, and the public praise information includes category popularity information, category image information, and category positive index; the report making module comprises a heat unit and an image unit, wherein:
the popularity unit is used for generating popularity information of a category corresponding to the number of the comment word information based on the number of the comment word information;
the image unit is used for generating the category positive index and/or the category image information of the category based on the emotional direction and the quantity of the emotional information.
Preferably, the classification keywords further include focus-of-attention keywords, and the public praise information further includes industry focus-of-attention information, category focus-of-attention information, or brand focus-of-attention information; the report production module further comprises a point of interest unit, wherein:
the point of interest unit is configured to generate category point of interest information corresponding to a point of interest keyword based on the number of comment word information corresponding to the point of interest keyword.
In summary, the application discloses a data analysis method based on internet user comments, which includes obtaining comment information of a user on the internet, segmenting the comment information to obtain comment word information of the comment information, comparing a preset classification keyword with the comment word information, classifying the comment word information conforming to the classification keyword into a category of the classification keyword, comparing a preset emotion label word with the classified comment word information to generate emotion information of the classified comment word information, wherein the emotion information includes positive emotion information and/or negative emotion information, and generating word comment information of the category based on the emotion information and the comment word information. Compared with the prior art, the public praise analysis method for analyzing by internet comments of the user has the advantages that the corresponding linguistic data of each website are automatically crawled, so that the labor cost is relatively low, and the recovery difficulty is simple; all behaviors of the user are truly reflected by the preference of the user, so that the authenticity can be ensured; and finally, the analysis accuracy is high because the big data of the whole user is analyzed, but the sample is not used for estimating the whole.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of an embodiment 1 of a user behavior based data analysis method disclosed herein;
FIG. 2 is a flow chart of an embodiment 2 of a user behavior based data analysis method disclosed herein;
FIG. 3 is a flowchart of an embodiment 3 of a method for user behavior-based data analysis disclosed herein;
FIG. 4 is a schematic structural diagram of an embodiment 1 of a data analysis system based on user behavior according to the present disclosure;
FIG. 5 is a schematic structural diagram of an embodiment 2 of a data analysis system based on user behavior according to the present disclosure;
fig. 6 is a schematic structural diagram of an embodiment 3 of the data analysis system based on user behavior disclosed in the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
As shown in fig. 1, a flowchart of an embodiment 1 of the data analysis method based on internet user comments disclosed in the present application includes:
s101, obtaining comment information of a user on the Internet;
the method for acquiring the user comment information is to adopt a crawler to automatically acquire the user comment information, and can adopt a handwritten crawler or a mature crawler frame such as script and the like. The sources of the comment information of the user include:
the community + vertical channel are integrated, the characteristic topics of the network station are widely related, and the vertical industry is clearly divided by means of sub forums, channels and the like;
the vertical community is characterized in that the website is concentrated on the discussion of a certain category (even a certain brand), such as automobile homes (automobile industry), mobile phone forums (mobile phones) in the middle-concerned villages and the like;
a professional gathering website which only collects comment data of a certain category and does not directly sell the category of commodities, such as mobile phone critique in Zhongguancun and the like;
the electronic commerce comment area is an electronic commerce website which is provided with a product comment area, such as Jingdong and first shop.
S102, segmenting the comment information to obtain comment word information of the comment information;
the method for segmenting the obtained comment information includes: segmenting the corpus into sentences according to punctuations, carrying out word segmentation aiming at each character and sentence, summarizing and counting all the frequency of the appeared words, and suggesting that the length of the words after word segmentation is limited to be between 2 and 5 Chinese characters.
S103, comparing preset classification keywords with comment word information, and classifying the comment word information which accords with the classification keywords into the category of the classification keywords;
a classification basic system needs to be established, classification can be performed according to categories such as industries, categories, brands and the like, keywords of different categories are different, and when the basic system is established, the problem of mutual coverage among the keywords needs to be considered, and generally, the same keyword is not suggested to belong to multiple categories.
The word segmentation results, that is, the comment word information, are arranged in a descending order according to the occurrence times or frequency, the number of keywords to be classified is determined, stop words are removed, the meaning words are classified, and the threshold value of the classification upper limit is determined, for example, ten thousand words are obtained after word segmentation in a certain comment information, and the comment information is classified into the category of the keyword only if the certain keyword appears in the comment and the keyword appears more than 10 times.
S104, comparing the preset emotion label words with the classified comment word information to generate emotion information of the classified comment word information, wherein the emotion information comprises positive emotion information and/or negative emotion information;
the specific method for generating the emotional information comprises the following steps: and manually labeling some linguistic data, namely, emotion positive and negative training sets of emotion labeling words, performing model training on the training sets by adopting a classification algorithm, and processing residual data by adopting trained models.
S105, generating word-of-mouth information of the category based on the emotional information and the comment word information;
according to the emotion information and comment word information of the comment information, word-of-mouth information of a category corresponding to the comment information can be generated, for example: social media popularity, brand/category positive index, brand/category focus analysis, brand image, and the like.
In summary, the application discloses a data analysis method based on internet user comments, which includes obtaining comment information of a user on the internet, segmenting the comment information to obtain comment word information of the comment information, comparing a preset classification keyword with the comment word information, classifying the comment word information conforming to the classification keyword into a category of the classification keyword, comparing a preset emotion label word with the classified comment word information to generate emotion information of the classified comment word information, wherein the emotion information includes positive emotion information and/or negative emotion information, and generating word comment information of the category based on the emotion information and the comment word information. Compared with the prior art, the public praise analysis method for analyzing by internet comments of the user has the advantages that the corresponding linguistic data of each website are automatically crawled, so that the labor cost is relatively low, and the recovery difficulty is simple; all behaviors of the user are truly reflected by the preference of the user, so that the authenticity can be ensured; finally, the big data analysis of the whole user is adopted, and the whole is not presumed from the sample, so that the analysis accuracy is high
As shown in fig. 2, a flowchart of an embodiment 2 of the data analysis method based on internet user comments disclosed in the present application includes:
s201, obtaining comment information of a user on the Internet;
the method for acquiring the user comment information is to adopt a crawler to automatically acquire the user comment information, and can adopt a handwritten crawler or a mature crawler frame such as script and the like. The sources of the comment information of the user include:
the community + vertical channel are integrated, the characteristic topics of the network station are widely related, and the vertical industry is clearly divided by means of sub forums, channels and the like;
the vertical community is characterized in that the website is concentrated on the discussion of a certain category (even a certain brand), such as automobile homes (automobile industry), mobile phone forums (mobile phones) in the middle-concerned villages and the like;
a professional gathering website which only collects comment data of a certain category and does not directly sell the category of commodities, such as mobile phone critique in Zhongguancun and the like;
the electronic commerce comment area is an electronic commerce website which is provided with a product comment area, such as Jingdong and first shop.
S202, cleaning the comment information, and removing impurities in the comment information;
the impurities comprise any one or more of comment information (with blank values removed, some linguistic data are pure picture comments, and the text content of the comment information is displayed as blank and can be removed), comment information (with the length of the linguistic data exceeding a preset threshold value removed, the linguistic data with the length being too short or too long are removed, if the length of the linguistic data is less than 5, effective information is not generally contained, and the linguistic data with the length being more than 200 are subjected to suspicion and influence on subsequent analysis, both the linguistic data and the linguistic data can be removed) and non-user generated content comment information (with non-UGC content such as sign-in posts and activity posts removed) in the comment information.
S203, segmenting the comment information to obtain comment word information of the comment information;
the method for segmenting the obtained comment information includes: segmenting the corpus into sentences according to punctuations, carrying out word segmentation aiming at each character and sentence, summarizing and counting all the frequency of the appeared words, and suggesting that the length of the words after word segmentation is limited to be between 2 and 5 Chinese characters.
S204, comparing preset classification keywords with comment word information, and classifying the comment word information which accords with the classification keywords into the category of the classification keywords;
a classification basic system needs to be established, classification can be performed according to categories such as industries, categories, brands and the like, keywords of different categories are different, and when the basic system is established, the problem of mutual coverage among the keywords needs to be considered, and generally, the same keyword is not suggested to belong to multiple categories.
The word segmentation results, that is, the comment word information, are arranged in a descending order according to the occurrence times or frequency, the number of keywords to be classified is determined, stop words are removed, the meaning words are classified, and the threshold value of the classification upper limit is determined, for example, ten thousand words are obtained after word segmentation in a certain comment information, and the comment information is classified into the category of the keyword only if the certain keyword appears in the comment and the keyword appears more than 10 times.
S205, comparing the preset emotion label words with the classified comment word information to generate emotion information of the classified comment word information, wherein the emotion information comprises positive emotion information and/or negative emotion information;
the specific method for generating the emotional information comprises the following steps: and manually labeling some linguistic data, namely, emotion positive and negative training sets of emotion labeling words, performing model training on the training sets by adopting a classification algorithm, and processing residual data by adopting trained models.
S206, generating word-of-mouth information of the category based on the emotion information and the comment word information;
according to the emotion information and comment word information of the comment information, word-of-mouth information of a category corresponding to the comment information can be generated, for example: social media popularity, brand/category positive index, brand/category focus analysis, brand image, and the like
As shown in fig. 3, a flowchart of embodiment 3 of the data analysis method based on internet user comments disclosed in the present application includes:
s301, obtaining comment information of a user on the Internet;
the method for acquiring the user comment information is to adopt a crawler to automatically acquire the user comment information, and can adopt a handwritten crawler or a mature crawler frame such as script and the like. The sources of the comment information of the user include:
the community + vertical channel are integrated, the characteristic topics of the network station are widely related, and the vertical industry is clearly divided by means of sub forums, channels and the like;
the vertical community is characterized in that the website is concentrated on the discussion of a certain category (even a certain brand), such as automobile homes (automobile industry), mobile phone forums (mobile phones) in the middle-concerned villages and the like;
a professional gathering website which only collects comment data of a certain category and does not directly sell the category of commodities, such as mobile phone critique in Zhongguancun and the like;
the electronic commerce comment area is an electronic commerce website which is provided with a product comment area, such as Jingdong and first shop.
S302, segmenting the comment information to obtain comment word information of the comment information;
the method for segmenting the obtained comment information includes: segmenting the corpus into sentences according to punctuations, carrying out word segmentation aiming at each character and sentence, summarizing and counting all the frequency of the appeared words, and suggesting that the length of the words after word segmentation is limited to be between 2 and 5 Chinese characters.
S303, comparing preset classification keywords with comment word information, and classifying the comment word information which accords with the classification keywords into the category of the classification keywords;
a classification basic system needs to be established, classification can be performed according to categories such as industries, categories, brands and the like, keywords of different categories are different, and when the basic system is established, the problem of mutual coverage among the keywords needs to be considered, and generally, the same keyword is not suggested to belong to multiple categories.
The word segmentation results, that is, the comment word information, are arranged in a descending order according to the occurrence times or frequency, the number of keywords to be classified is determined, stop words are removed, the meaning words are classified, and the threshold value of the classification upper limit is determined, for example, ten thousand words are obtained after word segmentation in a certain comment information, and the comment information is classified into the category of the keyword only if the certain keyword appears in the comment and the keyword appears more than 10 times.
S304, comparing the preset emotion annotation words with the classified comment word information to generate emotion information of the classified comment word information, wherein the emotion information comprises positive emotion information and/or negative emotion information;
the specific method for generating the emotional information comprises the following steps: and manually labeling some linguistic data, namely, emotion positive and negative training sets of emotion labeling words, performing model training on the training sets by adopting a classification algorithm, and processing residual data by adopting trained models.
S305, generating corresponding category heat information based on the number of the comment word information;
in this embodiment, the classification keyword includes any one or more of an industry keyword, a category keyword, and a brand keyword, the category includes any one or more of an industry category, a category, and a brand category, and the public praise information includes category popularity information, category image information, and category positive index.
Since the comment information and the comment word information have been previously classified by the keyword, the degree of heat of the category can be determined based on the number of comment word information in the category that is the same as the keyword.
S306, generating category positive indexes and/or category image information of categories based on the emotion directions and the number of the emotion information;
because the comment information and the comment word information are classified through the keywords and the emotion information of the comment word information is determined according to the emotion marking words, the positive index or the category image information of the category can be calculated according to the emotion direction and the number of the emotion information. For example, if each piece of positive comment word information is marked as +1 and each piece of negative comment word information is marked as-1, the image information and positive index of each category can be obtained from the final total score.
To further optimize the present embodiment, the present embodiment further includes:
s307, generating category focus point information corresponding to the focus point key words based on the number of the comment word information corresponding to the focus point key words;
the classification keywords further comprise focus keywords, and the public praise information further comprises industry focus information, category focus information or brand focus information. For example, if the industry keyword is an automobile, the focus keyword may include fuel consumption, configuration, appearance, space, cost performance, comfort, etc. Correspondingly, the attention point information of the industry can be generated according to the number of the comment word information matched with each attention point in each industry.
In the invention, the quality of the comment information acquired by different websites is different, for example, the comment information can be divided into a stable level, an available level and a prudent level, different preprocessing modes can be adopted for the comment information of different levels, and in addition, different weights can be given to calculate public praise.
A stabilizing stage: e-commerce comment areas and professional comment websites;
available stage: vertical community, bean, question and answer websites;
caution level: the bar pasting machine has excessive irrelevant information and needs to be dewatered before analysis.
As shown in fig. 4, a schematic structural diagram of an embodiment 1 of the data analysis system based on internet user comments disclosed in the present application includes a comment acquisition module 101, a word segmentation module 102, a classification module 103, an emotion labeling module 104, and a report making module 105, where:
the comment acquisition module 101 is used for acquiring comment information of a user on the internet;
the method for acquiring the user comment information is to adopt a crawler to automatically acquire the user comment information, and can adopt a handwritten crawler or a mature crawler frame such as script and the like. The sources of the comment information of the user include:
the community + vertical channel are integrated, the characteristic topics of the network station are widely related, and the vertical industry is clearly divided by means of sub forums, channels and the like;
the vertical community is characterized in that the website is concentrated on the discussion of a certain category (even a certain brand), such as automobile homes (automobile industry), mobile phone forums (mobile phones) in the middle-concerned villages and the like;
a professional gathering website which only collects comment data of a certain category and does not directly sell the category of commodities, such as mobile phone critique in Zhongguancun and the like;
the electronic commerce comment area is an electronic commerce website which is provided with a product comment area, such as Jingdong and first shop.
The word segmentation module 102 is configured to segment words of the comment information to obtain comment word information of the comment information;
the method for segmenting the obtained comment information includes: segmenting the corpus into sentences according to punctuations, carrying out word segmentation aiming at each character and sentence, summarizing and counting all the frequency of the appeared words, and suggesting that the length of the words after word segmentation is limited to be between 2 and 5 Chinese characters.
The classification module 103 is configured to compare a preset classification keyword with the comment word information, and classify the comment word information that meets the classification keyword into a category of the classification keyword;
a classification basic system needs to be established, classification can be performed according to categories such as industries, categories, brands and the like, keywords of different categories are different, and when the basic system is established, the problem of mutual coverage among the keywords needs to be considered, and generally, the same keyword is not suggested to belong to multiple categories.
The word segmentation results, that is, the comment word information, are arranged in a descending order according to the occurrence times or frequency, the number of keywords to be classified is determined, stop words are removed, the meaning words are classified, and the threshold value of the classification upper limit is determined, for example, ten thousand words are obtained after word segmentation in a certain comment information, and the comment information is classified into the category of the keyword only if the certain keyword appears in the comment and the keyword appears more than 10 times.
The emotion labeling module 104 is configured to compare a preset emotion label word with the classified comment word information, and generate emotion information of the classified comment word information, where the emotion information includes positive emotion information and/or negative emotion information;
the specific method for generating the emotional information comprises the following steps: and manually labeling some linguistic data, namely, emotion positive and negative training sets of emotion labeling words, performing model training on the training sets by adopting a classification algorithm, and processing residual data by adopting trained models.
The report making module 105 is used for generating word-of-mouth information of categories based on the emotional information and the comment word information;
according to the emotion information and comment word information of the comment information, word-of-mouth information of a category corresponding to the comment information can be generated, for example: social media popularity, brand/category positive index, brand/category focus analysis, brand image, and the like.
In summary, the application discloses a data analysis system based on internet user comments, which has the working principle that comment information of a user on the internet is obtained, the comment information is cut into words to obtain comment word information of the comment information, a preset classification keyword is compared with the comment word information, the comment word information conforming to the classification keyword is classified into the classification of the classification keyword, a preset emotion label word is compared with the classified comment word information to generate emotion information of the classified comment word information, the emotion information comprises positive emotion information and/or negative emotion information, and word comment information of the classification is generated based on the emotion information and the comment word information. Compared with the prior art, the public praise analysis method for analyzing by internet comments of the user has the advantages that the corresponding linguistic data of each website are automatically crawled, so that the labor cost is relatively low, and the recovery difficulty is simple; all behaviors of the user are truly reflected by the preference of the user, so that the authenticity can be ensured; and finally, the analysis accuracy is high because the big data of the whole user is analyzed, but the sample is not used for estimating the whole.
As shown in fig. 5, a schematic structural diagram of an embodiment 2 of the data analysis system based on internet user comments disclosed in the present application includes an information cleansing module 202, a comment acquisition module 201, a word segmentation module 203, a classification module 204, an emotion labeling module 205, and a report making module 206, where:
the comment acquisition module 201 is used for acquiring comment information of a user on the internet;
the method for acquiring the user comment information is to adopt a crawler to automatically acquire the user comment information, and can adopt a handwritten crawler or a mature crawler frame such as script and the like. The sources of the comment information of the user include:
the community + vertical channel are integrated, the characteristic topics of the network station are widely related, and the vertical industry is clearly divided by means of sub forums, channels and the like;
the vertical community is characterized in that the website is concentrated on the discussion of a certain category (even a certain brand), such as automobile homes (automobile industry), mobile phone forums (mobile phones) in the middle-concerned villages and the like;
a professional gathering website which only collects comment data of a certain category and does not directly sell the category of commodities, such as mobile phone critique in Zhongguancun and the like;
the electronic commerce comment area is an electronic commerce website which is provided with a product comment area, such as Jingdong and first shop.
The information cleaning module 202 is used for cleaning the comment information and removing impurities in the comment information;
the impurities comprise any one or more of comment information (with blank values removed, some linguistic data are pure picture comments, and the text content of the comment information is displayed as blank and can be removed), comment information (with the length of the linguistic data exceeding a preset threshold value removed, the linguistic data with the length being too short or too long are removed, if the length of the linguistic data is less than 5, effective information is not generally contained, and the linguistic data with the length being more than 200 are subjected to suspicion and influence on subsequent analysis, both the linguistic data and the linguistic data can be removed) and non-user generated content comment information (with non-UGC content such as sign-in posts and activity posts removed) in the comment information.
The word segmentation module 203 is used for segmenting the comment information to obtain comment word information of the comment information;
the method for segmenting the obtained comment information includes: segmenting the corpus into sentences according to punctuations, carrying out word segmentation aiming at each character and sentence, summarizing and counting all the frequency of the appeared words, and suggesting that the length of the words after word segmentation is limited to be between 2 and 5 Chinese characters.
The classification module 204 is configured to compare a preset classification keyword with the comment word information, and classify the comment word information that meets the classification keyword into a category of the classification keyword;
a classification basic system needs to be established, classification can be performed according to categories such as industries, categories, brands and the like, keywords of different categories are different, and when the basic system is established, the problem of mutual coverage among the keywords needs to be considered, and generally, the same keyword is not suggested to belong to multiple categories.
The word segmentation results, that is, the comment word information, are arranged in a descending order according to the occurrence times or frequency, the number of keywords to be classified is determined, stop words are removed, the meaning words are classified, and the threshold value of the classification upper limit is determined, for example, ten thousand words are obtained after word segmentation in a certain comment information, and the comment information is classified into the category of the keyword only if the certain keyword appears in the comment and the keyword appears more than 10 times.
The emotion labeling module 205 is configured to compare a preset emotion label word with the classified comment word information, and generate emotion information of the classified comment word information, where the emotion information includes positive emotion information and/or negative emotion information;
the specific method for generating the emotional information comprises the following steps: and manually labeling some linguistic data, namely, emotion positive and negative training sets of emotion labeling words, performing model training on the training sets by adopting a classification algorithm, and processing residual data by adopting trained models.
The report making module 206 is configured to generate word-of-mouth information of a category based on the emotion information and the comment word information;
according to the emotion information and comment word information of the comment information, word-of-mouth information of a category corresponding to the comment information can be generated, for example: social media popularity, brand/category positive index, brand/category focus analysis, brand image, and the like.
As shown in fig. 6, a schematic structural diagram of an embodiment 3 of the data analysis system based on internet user comments disclosed in the present application includes a comment acquisition module 301, a word segmentation module 302, a classification module 303, an emotion labeling module 304, and a report creation module 305, where the report creation module 305 includes a hotness unit 306 and an image unit 307, where:
the comment acquisition module 301 is used for acquiring comment information of a user on the internet;
the method for acquiring the user comment information is to adopt a crawler to automatically acquire the user comment information, and can adopt a handwritten crawler or a mature crawler frame such as script and the like. The sources of the comment information of the user include:
the community + vertical channel are integrated, the characteristic topics of the network station are widely related, and the vertical industry is clearly divided by means of sub forums, channels and the like;
the vertical community is characterized in that the website is concentrated on the discussion of a certain category (even a certain brand), such as automobile homes (automobile industry), mobile phone forums (mobile phones) in the middle-concerned villages and the like;
a professional gathering website which only collects comment data of a certain category and does not directly sell the category of commodities, such as mobile phone critique in Zhongguancun and the like;
the electronic commerce comment area is an electronic commerce website which is provided with a product comment area, such as Jingdong and first shop.
The word segmentation module 302 is configured to segment words of the comment information to obtain comment word information of the comment information;
the method for segmenting the obtained comment information includes: segmenting the corpus into sentences according to punctuations, carrying out word segmentation aiming at each character and sentence, summarizing and counting all the frequency of the appeared words, and suggesting that the length of the words after word segmentation is limited to be between 2 and 5 Chinese characters.
The classification module 303 is configured to compare a preset classification keyword with the comment word information, and classify the comment word information that matches the classification keyword into a category of the classification keyword;
a classification basic system needs to be established, classification can be performed according to categories such as industries, categories, brands and the like, keywords of different categories are different, and when the basic system is established, the problem of mutual coverage among the keywords needs to be considered, and generally, the same keyword is not suggested to belong to multiple categories.
The word segmentation results, that is, the comment word information, are arranged in a descending order according to the occurrence times or frequency, the number of keywords to be classified is determined, stop words are removed, the meaning words are classified, and the threshold value of the classification upper limit is determined, for example, ten thousand words are obtained after word segmentation in a certain comment information, and the comment information is classified into the category of the keyword only if the certain keyword appears in the comment and the keyword appears more than 10 times.
The emotion labeling module 304 is configured to compare a preset emotion label word with the classified comment word information, and generate emotion information of the classified comment word information, where the emotion information includes positive emotion information and/or negative emotion information;
the specific method for generating the emotional information comprises the following steps: and manually labeling some linguistic data, namely, emotion positive and negative training sets of emotion labeling words, performing model training on the training sets by adopting a classification algorithm, and processing residual data by adopting trained models.
The popularity unit 306 is used for generating popularity information of a category corresponding to the number of the comment word information based on the number of the comment word information;
in this embodiment, the classification keyword includes any one or more of an industry keyword, a category keyword, and a brand keyword, the category includes any one or more of an industry category, a category, and a brand category, and the public praise information includes category popularity information, category image information, and category positive index.
Since the comment information and the comment word information have been previously classified by the keyword, the degree of heat of the category can be determined based on the number of comment word information in the category that is the same as the keyword.
The character unit 307 is used for generating a category positive index and/or category character information of the category based on the emotional direction and the number of the emotional information;
because the comment information and the comment word information are classified through the keywords and the emotion information of the comment word information is determined according to the emotion marking words, the positive index or the category image information of the category can be calculated according to the emotion direction and the number of the emotion information. For example, if each piece of positive comment word information is marked as +1 and each piece of negative comment word information is marked as-1, the image information and positive index of each category can be obtained from the final total score.
To further optimize the present embodiment, the report making module 305 further includes a point of interest unit 308, where the point of interest unit 308 is configured to generate category point of interest information corresponding to the point of interest keyword based on the number of comment word information corresponding to the point of interest keyword;
the classification keywords further comprise focus keywords, and the public praise information further comprises industry focus information, category focus information or brand focus information. For example, if the industry keyword is an automobile, the focus keyword may include fuel consumption, configuration, appearance, space, cost performance, comfort, etc. Correspondingly, the attention point information of the industry can be generated according to the number of the comment word information matched with each attention point in each industry.
In the invention, the quality of the comment information acquired by different websites is different, for example, the comment information can be divided into a stable level, an available level and a prudent level, different preprocessing modes can be adopted for the comment information of different levels, and in addition, different weights can be given to calculate public praise.
A stabilizing stage: e-commerce comment areas and professional comment websites;
available stage: vertical community, bean, question and answer websites;
caution level: the bar pasting machine has excessive irrelevant information and needs to be dewatered before analysis.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. A data analysis method based on Internet user comments is characterized by comprising the following steps:
obtaining comment information of a user on the Internet;
cutting words of the comment information to obtain comment word information of the comment information;
performing descending order arrangement according to the occurrence frequency of the comment word information, comparing a preset classification keyword with the arranged comment word information, if the occurrence frequency of the arranged comment word information is greater than or equal to a threshold value corresponding to the classification keyword, judging that the arranged comment word information accords with the classification keyword, and classifying the arranged comment word information into the category of the classification keyword;
comparing a preset emotion marking word with the classified comment word information to generate emotion information of the classified comment word information, wherein the emotion information comprises positive emotion information and/or negative emotion information;
generating public praise information of the category based on the emotion information and the comment word information;
the classification keywords comprise any one or more of industry keywords, category keywords and brand keywords, the categories comprise any one or more of industry categories, category categories and brand categories, the public praise information comprises category popularity information, category image information and category positive indexes, and the generation of the public praise information of the categories based on the emotion information and the comment word information comprises the following steps:
generating popularity information of a category corresponding to the comment word information based on the number of the comment word information;
and generating the category positive index and/or the category image information of the category based on the emotional direction and the quantity of the emotional information.
2. The method of claim 1, wherein before the tokenizing the comment information to obtain comment word information of the comment information, further comprising:
and cleaning the comment information to remove impurities in the comment information.
3. The method of claim 2, wherein the impurities include any one or more of comment information in which text is displayed as blank, comment information in which corpus length exceeds a preset threshold value, and non-user-generated content comment information in the comment information.
4. The method of claim 1, wherein the classification keywords further comprise point of interest keywords, the public key word information further comprises industry point of interest information, category point of interest information, or brand point of interest information, and the generating public key word information for the category based on the sentiment information and the comment word information further comprises:
generating category focus information corresponding to the focus keyword based on the number of comment word information corresponding to the focus keyword.
5. The utility model provides a data analysis system based on internet user comment which characterized in that, includes comment acquisition module, word segmentation module, classification module, emotion marking module and report preparation module, wherein:
the comment acquisition module is used for acquiring comment information of a user on the Internet;
the word cutting module is used for cutting words of the comment information to obtain comment word information of the comment information;
the classification module is used for performing descending order arrangement according to the occurrence times of the comment word information, comparing preset classification keywords with the arranged comment word information, and judging that the arranged comment word information conforms to the classification keywords and classifying the arranged comment word information into the category of the classification keywords if the occurrence times of the arranged comment word information is greater than or equal to a threshold value corresponding to the classification keywords;
the emotion marking module is used for comparing a preset emotion marking word with the classified comment word information to generate emotion information of the classified comment word information, and the emotion information comprises positive emotion information and/or negative emotion information;
the report making module is used for generating public praise information of the category based on the emotion information and the comment word information;
the classification keywords comprise any one or more of industry keywords, category keywords and brand keywords, the category comprises any one or more of industry categories, category categories and brand categories, and the public praise information comprises category popularity information, category image information and category positive indexes; the report making module comprises a heat unit and an image unit, wherein:
the popularity unit is used for generating popularity information of a category corresponding to the number of the comment word information based on the number of the comment word information;
the image unit is used for generating the category positive index and/or the category image information of the category based on the emotional direction and the quantity of the emotional information.
6. The system of claim 5, further comprising an information cleansing module, wherein:
the information cleaning module is used for cleaning the comment information and removing impurities in the comment information.
7. The system of claim 6, wherein the impurities include any one or more of comment information in which text is displayed as blank, comment information in which corpus length exceeds a preset threshold value, and non-user-generated content comment information in the comment information.
8. The system of claim 5, wherein the classification keywords further comprise point of interest keywords, the public key information further comprises industry point of interest information, category point of interest information, or brand point of interest information; the report production module further comprises a point of interest unit, wherein:
the point of interest unit is configured to generate category point of interest information corresponding to a point of interest keyword based on the number of comment word information corresponding to the point of interest keyword.
CN201810167403.3A 2018-02-28 2018-02-28 Data analysis method and system based on Internet user comments Active CN108364199B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810167403.3A CN108364199B (en) 2018-02-28 2018-02-28 Data analysis method and system based on Internet user comments

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810167403.3A CN108364199B (en) 2018-02-28 2018-02-28 Data analysis method and system based on Internet user comments

Publications (2)

Publication Number Publication Date
CN108364199A CN108364199A (en) 2018-08-03
CN108364199B true CN108364199B (en) 2021-08-13

Family

ID=63002799

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810167403.3A Active CN108364199B (en) 2018-02-28 2018-02-28 Data analysis method and system based on Internet user comments

Country Status (1)

Country Link
CN (1) CN108364199B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210150594A1 (en) * 2019-11-15 2021-05-20 Midea Group Co., Ltd. System, Method, and User Interface for Facilitating Product Research and Development
CN110991838A (en) * 2019-11-21 2020-04-10 中国联合网络通信集团有限公司 Method and device for determining competitiveness index of communication operator
CN111523923B (en) * 2020-04-06 2023-09-29 北京三快在线科技有限公司 Merchant comment management system, merchant comment management method, merchant comment management server and storage medium
CN111444434A (en) * 2020-04-22 2020-07-24 郭庆涛 Method and system for generating Internet feedback comments
CN111724196A (en) * 2020-05-14 2020-09-29 天津大学 Method for improving quality of automobile product based on user experience
CN112053080A (en) * 2020-09-15 2020-12-08 上海唐硕信息科技有限公司 Brand scoring method based on user experience perception
CN112257439B (en) * 2020-10-30 2024-04-12 上海明略人工智能(集团)有限公司 Method and device for mining hot root words through public opinion data
CN112419029B (en) * 2020-11-27 2021-11-12 诺丁汉(宁波保税区)区块链有限公司 Similar financial institution risk monitoring method, risk simulation system and storage medium
CN113744068B (en) * 2021-11-08 2022-07-12 深圳市路演中网络科技有限公司 Financial investment data evaluation method and system
CN114925308B (en) * 2022-04-29 2023-10-03 北京百度网讯科技有限公司 Webpage processing method and device of website, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101071443A (en) * 2007-06-26 2007-11-14 腾讯科技(深圳)有限公司 Content-related advertising identifying method and content-related advertising server
CN105868185A (en) * 2016-05-16 2016-08-17 南京邮电大学 Part-of-speech-tagging-based dictionary construction method applied in shopping comment emotion analysis
CN106294425A (en) * 2015-05-26 2017-01-04 富泰华工业(深圳)有限公司 The automatic image-text method of abstracting of commodity network of relation article and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101071443A (en) * 2007-06-26 2007-11-14 腾讯科技(深圳)有限公司 Content-related advertising identifying method and content-related advertising server
CN106294425A (en) * 2015-05-26 2017-01-04 富泰华工业(深圳)有限公司 The automatic image-text method of abstracting of commodity network of relation article and system
CN105868185A (en) * 2016-05-16 2016-08-17 南京邮电大学 Part-of-speech-tagging-based dictionary construction method applied in shopping comment emotion analysis

Also Published As

Publication number Publication date
CN108364199A (en) 2018-08-03

Similar Documents

Publication Publication Date Title
CN108364199B (en) Data analysis method and system based on Internet user comments
Bansal et al. On predicting elections with hybrid topic based sentiment analysis of tweets
CN107291780B (en) User comment information display method and device
CN109325165B (en) Network public opinion analysis method, device and storage medium
CN107391493B (en) Public opinion information extraction method and device, terminal equipment and storage medium
CN105095288B (en) Data analysis method and data analysis device
CN113837531A (en) Product quality problem finding and risk assessment method based on network comments
EP2618296A1 (en) Social media data analysis system and method
CN107544988B (en) Method and device for acquiring public opinion data
CN103336766A (en) Short text garbage identification and modeling method and device
KR20120109943A (en) Emotion classification method for analysis of emotion immanent in sentence
CN105095179B (en) The method and device that user's evaluation is handled
CN110706028A (en) Commodity evaluation emotion analysis system based on attribute characteristics
Awrahman et al. Sentiment analysis and opinion mining within social networks using konstanz information miner
CN111695357A (en) Text labeling method and related product
KR20190048781A (en) System for crawling and analyzing online reviews about merchandise or service
Hasanati et al. Implementation of support vector machine with lexicon based for sentimenT ANALYSIS ON TWITter
CN104462083A (en) Content comparison method and device and information processing system
CN113282704A (en) Method and device for judging and screening comment usefulness
CN107291686B (en) Method and system for identifying emotion identification
CN111882224A (en) Method and device for classifying consumption scenes
Deitrick et al. Enhancing sentiment analysis on twitter using community detection
CN112182244A (en) Brand knowledge graph construction method and device and terminal
CN115687790B (en) Advertisement pushing method and system based on big data and cloud platform
JP6509590B2 (en) User's emotion analysis device and program for goods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant