CN115829657A - Data acquisition method and device applied to data statistics and storage medium - Google Patents

Data acquisition method and device applied to data statistics and storage medium Download PDF

Info

Publication number
CN115829657A
CN115829657A CN202211710953.8A CN202211710953A CN115829657A CN 115829657 A CN115829657 A CN 115829657A CN 202211710953 A CN202211710953 A CN 202211710953A CN 115829657 A CN115829657 A CN 115829657A
Authority
CN
China
Prior art keywords
data
evaluation
commodity
parameter
statistics
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211710953.8A
Other languages
Chinese (zh)
Inventor
林鸿熙
金华松
林琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Putian University
Original Assignee
Putian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Putian University filed Critical Putian University
Priority to CN202211710953.8A priority Critical patent/CN115829657A/en
Publication of CN115829657A publication Critical patent/CN115829657A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data acquisition method, a device and a storage medium applied to data statistics, the method comprises the following steps: acquiring a first data source for data acquisition of current data statistics and a first target parameter of the current data statistics; taking the first target parameter and similar parameters in the same technical field as the first target parameter as a first parameter set; acquiring historical expression content related to the first parameter set, wherein the historical expression content is data statistical content and a target parameter of the historical expression content belongs to the first parameter set; filtering the first data source according to the self value coefficient of the historical expression content and the data source referred by the historical expression content in the content to obtain a filtered second data source; data required for current data statistics is collected via a second data source. The invention ensures the value of the acquired data by filtering out data sources with low value.

Description

Data acquisition method and device applied to data statistics and storage medium
Technical Field
The invention relates to the technical field of data statistics, in particular to a data acquisition method and device applied to data statistics and a storage medium.
Background
An important factor for the accuracy of data statistics is whether the collected data is comprehensive and valuable. The traditional data acquisition mode comprises questionnaire investigation, data reference, field examination, experiment and the like, but the traditional data acquisition mode belongs to manual acquisition and is suitable for application scenes with less required data volume, so that the traditional data acquisition mode is not suitable for large data scenes with suddenly increased data volume.
On the basis, the data acquisition mode suitable for the big data scene generally acquires data through an automatic acquisition mode. If various data, such as fingerprints, human faces, videos, temperatures, humidity and the like, occurring in a real scene are required, an internet of things system is formed through various sensors to perform data aggregation so as to be applied to data statistics. If various data occurring in the network scene are needed, the use data of each application software can be acquired through pre-buried points, or various public data published on a website can be acquired through a web crawler, or various software data can be communicated through modes such as an open database and a software interface.
Meanwhile, for the existing automatic collected data, the data with wrong formats such as incomplete data, error data and repeated data are usually deleted only by data cleaning, and finally, the data meeting the data format requirement is provided. However, in practical use scenarios, the following problems often occur: although the amount of data collected is sufficient, the effect applied to data statistics is not ideal.
Disclosure of Invention
In order to solve the above problems in the prior art, the present invention provides a data collecting method, device and storage medium for data statistics, so as to improve the quality of collected data, thereby ensuring the effect of applying the data to data statistics.
In order to achieve the purpose, the invention adopts the technical scheme that:
in a first aspect, the present invention provides a data collecting method applied to data statistics, including:
acquiring a first data source for data acquisition of current data statistics and a first target parameter of the current data statistics;
taking the first target parameter and similar parameters in the same technical field as the first target parameter as a first parameter set;
acquiring historical expression content related to the first parameter set, wherein the historical expression content is data statistical content and a target parameter of the historical expression content belongs to the first parameter set;
filtering the first data source according to the self value coefficient of the historical expression content and the data source referenced in the content to obtain a filtered second data source;
and collecting data required by the current data statistics through the second data source.
The invention has the beneficial effects that: the parameter range is expanded by acquiring other parameters in the same technical field as target parameters needing to be counted by the user, the comprehensiveness of historical expression contents is ensured, the data sources related to the past data statistics are measured according to the value coefficient of the expression contents where the data sources are located, the value of each data source is judged, the data sources with low value are filtered, data collection is only carried out from data source channels with high value, the number of data collection is reduced, the value of the collected data can be ensured, and the effect of the collected data applied to the data statistics is ensured.
Optionally, when the history expression content is in a paper form, the filtering the first data source according to the self-value coefficient of the history expression content and the data source referenced in the content comprises:
acquiring an author, a publishing position, a quoted number, a quoted place and a quoted data source of the historical expression content;
comprehensively obtaining self value coefficients of the historical expression contents based on the authors, publication positions, quoted numbers and quoted value levels in respective evaluation systems, and endowing the self value coefficients to quoted data sources;
and summarizing the comprehensive value coefficient of each data source in the first data sources, and filtering the first data sources according to a preset filtering rule.
According to the above description, when the collected data is in a paper form, the value coefficient of the current paper is judged according to the social status of the author, the gold content of published journals and the citation condition, so as to obtain the value coefficient of the corresponding data source in the paper, and the comprehensive value coefficient of each data source in the first data source is summarized, so as to filter out the data source with lower value, and ensure the value of the collected data.
Optionally, when the history expression content is in the form of a paper, the evaluation of the author includes:
obtaining initial evaluation of an author according to the social status of the author;
obtaining the publishing position, the quoted number and the quoted place of the historical expression content of the author in the data statistical direction to obtain the overall field evaluation of the author;
obtaining the publishing position, the quoted number and the quoted position of the historical expression content of an author in the data statistics direction, wherein the target parameter belongs to the first parameter set, and obtaining the subdivision field evaluation of the author;
and calibrating the initial evaluation through the overall field evaluation and the subdivided field evaluation to obtain the final evaluation of the author, wherein the influence coefficient of the subdivided field evaluation in the calibration is larger than that of the overall field evaluation.
According to the description, on the basis that value evaluation is usually performed on an author only aiming at the social status of the author at present, the method not only performs overall field evaluation presented by historical expression content in a data statistics direction, but also further limits subdivided field evaluation under the condition that target parameters are consistent, and calibrates initial evaluation represented by the social status of the author through the field evaluation of the two field evaluations, so that the value evaluation of the author is more reasonable and accurate.
Optionally, the preset filtering rule includes filtering according to a preset number or filtering according to whether the preset number is higher than a preset value.
According to the description, the number of the data sources can be guaranteed according to the rule for filtering the data sources according to the preset number, the quality of the data sources can be guaranteed according to the rule for filtering the data sources according to the preset numerical value, and on the basis, the filtering can be limited on the basis of the minimum number according to whether the filtering is higher than the preset numerical value or not, so that the number and the quality of the data sources can be guaranteed.
Optionally, when the historical expression content is in a video form, the value coefficient of the historical expression content is measured by an author, a publication position, a click rate and an evaluation trend.
Optionally, the method further comprises:
acquiring a first influence parameter capable of influencing the first target parameter;
taking the first influence parameters and the synonym parameters of the first influence parameters as a synonym set of each first influence parameter, wherein the synonym set is used for endowing the first influence parameters with corresponding self-value coefficients when the influence parameters appearing in the historical expression content belong to the synonym set;
filtering the first influence parameter according to the self value coefficient of the historical expression content and the influence parameter related to the historical expression content in the content to obtain a second influence parameter after filtering;
the collecting data required for the current data statistics by the second data source comprises:
data relating to the second impact parameter is collected via the second data source.
According to the description, the influence parameters required by data statistics are subjected to value filtering, so that the influence parameters with high value are only acquired from the data source with high value in the subsequent acquired data, the acquisition of the influence parameters with low value is reduced, the value of the acquired data is ensured, the processing pressure can be reduced, and the processing time effectiveness is improved.
Optionally, when the second influence parameter includes commodity evaluation data, the method further includes:
acquiring a collected commodity evaluation data set, and dividing all commodity evaluation data of the same data source into a plurality of groups of similar user sets UA according to the shopping tendency SO and the grading tendency ST of a user;
dividing all commodity evaluation data of the same commodity into a plurality of groups of evaluation sets EA according to a plurality of groups of similar user sets UA, and dividing each group of evaluation sets EA i All the evaluation data in (1) are subjected to cluster classification and are sequentially ordered from high to low according to the data volume contained in the clusters, and each group of evaluation sets EA is reserved i Taking the evaluation data in the first N clusters with the median total exceeding a preset proportion as a credible evaluation data set, and taking the evaluation data of all the commodity evaluation data of the same commodity except the credible evaluation data as a suspicious evaluation data set;
for the credible evaluation data set, judging whether the credible evaluation data of the commodity is real according to the evaluation behavior tendency of the user corresponding to each credible evaluation data in the credible evaluation data set in all commodities, if so, retaining the credible evaluation data judged to be real, and otherwise, deleting the credible evaluation data judged to be unreal;
and for the suspicious evaluation data set, judging whether the suspicious evaluation data is real according to the evaluation elements of each suspicious evaluation data in the suspicious evaluation data set, judging whether the evaluation is specific and whether a merchant replies, if so, keeping the suspicious evaluation data judged to be real as the real evaluation data of the commodity, and otherwise, deleting the suspicious evaluation data judged to be unreal.
According to the description, the purchasing behavior and the grading behavior of the user are subjected to trend analysis to obtain a set of similar users, evaluation data obviously not clustered in the set of similar users is obtained through clustering analysis and is considered as suspicious fake data, and then classification accuracy of real evaluation and false evaluation can be further guaranteed through screening suspicious and credible data again, so that the truth and effectiveness of the collected commodity evaluation data are guaranteed.
Optionally, the acquiring the collected commodity evaluation data set further includes:
filtering the collected commodity evaluation data set according to suspicious users provided by data sources;
and filtering commodity evaluation data of which evaluation elements are larger than a preset element threshold value and the similarity is larger than a preset similarity threshold value from the commodity evaluation data set after the suspicious user is filtered to obtain an available commodity evaluation data set, wherein the evaluation elements comprise texts, pictures or videos.
According to the above description, when the data source considers that the user is basically evaluated by brushing, the user is considered as a suspicious user, and if the similarity of the commodity evaluation data is too high, a plurality of commodity evaluation data considered to be too high in similarity are brushed by the same template, so that the commodity evaluation data can be directly filtered out, the collection of false evaluations is further reduced, and the accuracy of data statistics can be ensured.
In a second aspect, the present invention provides a data acquisition apparatus for data statistics, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the method of the first aspect.
In a third aspect, the present invention provides a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when executed, the computer program implements the method of the first aspect.
The technical effects corresponding to the data acquisition device applied to data statistics and the computer-readable storage medium provided by the third invention in the second aspect refer to the related description of the data acquisition method applied to data statistics provided by the first aspect.
Drawings
Fig. 1 is a schematic main flow chart of a data acquisition method applied to data statistics according to an embodiment of the present invention;
FIG. 2 is a schematic overall flowchart of a data collection method applied to data statistics according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a data acquisition device applied to data statistics according to an embodiment of the present invention.
[ description of reference ]
1: a data acquisition device for data statistics;
2: a processor;
3: a memory.
Detailed Description
In order to better understand the above technical solutions, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
Example one
In the existing big data era, more and more application scenes are needed for data statistics, such as whole-network evaluation on a certain commodity, a consumption picture on a certain user and the like, the embodiment is suitable for data statistics in a network scene, and the data acquisition mode of the embodiment comprises a pre-buried point, a network crawler, an open database or a software interface, wherein the software interface is legal data obtained by butting with a source channel after legal permission of the source channel is obtained, such as sales data and evaluation data of a commodity under a flagship store of a certain mobile phone brander.
Referring to fig. 1 to 2, a data collecting method applied to data statistics includes:
s1, acquiring a first data source for data acquisition of current data statistics and a first target parameter of the current data statistics;
the first data source is a source channel preset during data statistics, and the first target parameter is a conclusion desired to be obtained by current data statistics.
In this embodiment, the current data is counted as a market reaction condition of a certain mobile phone model, and then the source channels include six e-commerce channels, namely Taobao, jingdong, shuduo, suning is easy to purchase, national beauty and mobile phone brand business and official networks, to obtain the sales data and evaluation data of the e-commerce, and also include five news websites, namely, a civil network, a Xinhua network, a Phoenix news network, a Chinese daily news network and a fox search news network to obtain report data; the first objective parameter is the market reaction effect.
S2, taking the first target parameter and similar parameters in the same technical field as the first target parameter as a first parameter set;
similar parameters of the same technical field of the market reaction effect comprise market feedback, market evaluation, market sales, market public praise, market ranking, market recommendation and the like, namely, synonyms, upper-level words or lower-level words of the same subdivided field are considered as similar parameters of the same technical field.
S3, acquiring historical expression content related to the first parameter set, wherein the historical expression content is data statistical content and a target parameter of the historical expression content belongs to the first parameter set;
the historical expression means the content published in the past, and the content carrier form includes texts, pictures, videos and the like, wherein the texts include papers, news reports, posts, patents and the like.
In step S3, it is limited that the historical expression content belongs to the field of data statistics and the research direction is consistent with the research direction of the current data statistics.
S4, filtering the first data source according to the self value coefficient of the historical expression content and the data source quoted in the content to obtain a filtered second data source;
as shown in fig. 2, the present embodiment requires different evaluation of the value coefficient according to different historical expression contents.
When the historical expression content is in a video form, the value coefficient of the historical expression content is measured by an author, a publication position, a click rate and an evaluation trend. For example, a recommended video related to a mobile phone brand on a short video website is considered as a market reaction to the mobile phone brand, and the number of fans of an author of the recommended video, the traffic of the short video website, the click rate of the recommended video, the number of likes, the number of forwards, the evaluation tendency, and the like of the recommended video belong to the evaluation tendency, which is taken as the acceptance degree of the audience group of the short video to the recommended video, so that the value coefficient of the recommended video is correctly reflected.
When the historical expression content is in a picture form, the corresponding value coefficient of the historical expression content is measured through an author, a publication position, a click rate and an evaluation trend, and is equivalent to the video form, so that the historical expression content can be expressed by referring to the video form.
When the history expression content is in a paper form, the filtering of the first data source according to the self value coefficient of the history expression content and the data source referenced in the content comprises the following steps:
s41, acquiring authors, publication positions, cited numbers, cited places and cited data sources of historical expression contents;
s42, comprehensively obtaining self value coefficients of historical expression contents based on authors, publication positions, quoted numbers and quoted value levels in respective evaluation systems, and endowing the self value coefficients to quoted data sources;
the publishing position is a journal to which the publication belongs, and the journal to which the publication belongs generally uses an influence factor, so that the value coefficient of the publishing position can be evaluated through the influence factor, and if the publication is a simple graduation paper without putting any journal, the value coefficient is endowed according to the lowest value.
The influence factor is the number of cited articles of the journal, the overall influence condition is emphasized, and the cited number and the cited place refer to the self cited condition of the historical expression content, and the cited place is the journal to which the cited article belongs, so that the value coefficient can be given according to the cited condition.
In the prior art, when evaluating an author, value evaluation is usually performed on the author only according to the social status of the author, for example, academic titles such as academicians, thousands of people, ten thousands of people, 973 project headrests, 863 project headrests, changjiang scholars, jiqing, youqing and the like, academic position types such as doctors, masters, scholars and the like, and teacher titles such as high-level, middle-level and low-level. In this embodiment, in order to ensure that the value evaluation of the author is more reasonable and accurate, the evaluation of the author includes:
s421, obtaining an initial evaluation a of the author according to the social status of the author;
s422, obtaining the publishing position, the quoted number and the quoted place of the historical expression content of the author in the data statistics direction to obtain the overall field evaluation of the author;
the professional directions of different authors are greatly different under the same social status, and the data statistics direction involved in this embodiment needs to evaluate the influence of the authors in the data statistics direction, i.e. the overall field evaluation.
S423, obtaining the publishing position, the quoted number and the quoted place of the historical expression content of the author in the data statistics direction and the target parameter belonging to the first parameter set to obtain the subdivision field evaluation of the author;
in this embodiment, for the author, in the direction of segment, for example, the influence of the mobile phone product market feedback of this embodiment, that is, the segment field evaluation.
The publication position, the number of references, and the reference point referred to in the above steps S422 and S423 are explained with reference to the foregoing.
And S424, calibrating the initial evaluation through the whole field evaluation and the subdivided field evaluation to obtain the final evaluation of the author, wherein the influence coefficient of the subdivided field evaluation in the calibration is larger than that of the whole field evaluation.
In the present example, the overall field evaluation b 1 And subdivision field evaluation b 2 Obtaining a calibration coefficient b = b 1 *p+b 2 * (1-p), where p is the influence coefficient of the overall domain evaluation, and then calibrating the initial evaluation according to the calibration coefficient yields a final evaluation =2 a b/(a + b).
In this embodiment, the influence coefficient of the subdivided region evaluation is 0.64, and the influence coefficient of the entire region evaluation is 0.36, and at this time, when the initial evaluation, the entire region evaluation, and the subdivided region evaluation of a certain author are 0.6, 0.8, and 0.9, respectively, the calibration coefficient b =0.8 × 0.36+0.9 × 0.64=0.864, and the final evaluation =2 × a × b/(a + b) =2 × 0.6 × 0.864/(0.6 + 0.864) =0.71, and the value coefficient of the author is 0.71.
S43, summarizing the comprehensive value coefficient of each data source in the first data source, and filtering the first data source according to a preset filtering rule.
In this embodiment, the preset filtering rule is to filter whether the number is higher than a preset value on the basis of a preset minimum number, where the preset number is 6, and the preset value is 0.6, if the number of the source channels is more than 0.6, for example, 8 channels, 10 channels, etc. are all reserved, and if the number is less than 6 channels, for example, 4 channels, 5 channels, etc., the first 6 channels are reserved from high to low. In other embodiments, the filtering may be performed in a predetermined amount or in a manner of whether or not the filtering is higher than a predetermined value.
In this embodiment, if the overall value coefficient of seven data sources is greater than 0.6 among the overall value coefficients summarizing eleven data sources, the seven data sources greater than 0.6 are retained, and the remaining four data sources are filtered out.
And S5, acquiring data required by current data statistics through a second data source.
Therefore, the data sources with low value are filtered, data acquisition is only carried out from the data source channels with high value, the number of data acquisition is reduced, and meanwhile, the acquired data value can be ensured, so that the effect of the acquired data application in data statistics is ensured.
Example two
Referring to fig. 1 to fig. 2, a data collecting method for data statistics, in accordance with a first embodiment, the step S1 further includes:
acquiring a first influence parameter capable of influencing a first target parameter;
wherein the first objective parameter is market reaction effect, and the first influence parameter comprises total sales quantity, sales quantity trend, sales price trend and evaluation data, wherein the sales quantity trend can be monthly quantity change, quarterly quantity change or self-set quantity change in a time period, and the sales price trends are identical.
The step S2 further includes:
the first influence parameters and the synonym parameters of the first influence parameters are used as synonym sets of each first influence parameter, and the synonym sets are used for endowing the corresponding self value coefficients to the first influence parameters when the influence parameters appearing in the historical expression content belong to the synonym sets;
the synonymous parameters can refer to the similar parameters in the embodiment, namely, various names which are possibly used by people in the industry are found to avoid missing detection.
Step S4 further includes:
filtering the first influence parameter according to the self value coefficient of the historical expression content and the influence parameter related to the historical expression content in the content to obtain a second influence parameter after filtering;
step S5 is replaced by:
data relating to the second influencing parameter is collected by a second data source.
The second data source filters the data source with lower value, and the second influence parameter filters the influence parameter with lower value, so that the pressure of data acquisition is reduced, and the accuracy and the timeliness of data statistics are ensured.
Wherein, when the second influence parameter includes the commodity evaluation data, then still include:
s6, acquiring a collected commodity evaluation data set, and dividing all commodity evaluation data of the same data source into a plurality of groups of similar user sets UA according to the shopping tendency SO and the grading tendency ST of the user;
the step S6 of acquiring the collected commodity evaluation data set further includes:
s61, filtering the collected commodity evaluation data set according to suspicious users provided by data sources;
the data source can also screen the order-brushing data, and the general websites cannot be easily considered as suspicious users, because the number of the users of each website is a crucial index, namely, the users are considered as suspicious users who brush orders for professions under the condition, and the credibility is high, therefore, the commodity evaluation data of the suspicious users are considered as the order-brushing data to be filtered.
S62, commodity evaluation data with evaluation elements larger than a preset element threshold value and similarity larger than a preset similarity threshold value are filtered from the commodity evaluation data set after the suspicious user is filtered, and an available commodity evaluation data set is obtained, wherein the evaluation elements comprise texts, pictures or videos.
Wherein, according to the current data of the bill, the basic elements can include text and one or two of pictures and videos, namely, the text and the pictures, the text and the videos, or all the text and the pictures. The texts in the refresh list data usually have a plurality of words, so the preset element threshold refers to the number of words of the texts, the number of pictures and the video, for example, the number of words of the texts in the preset element threshold is 30 words, then the commodity evaluation data set with more than 30 words is subjected to similarity screening, so that an available commodity evaluation data set is obtained, and the preset similarity threshold can be adjusted according to specific commodities and user requirements.
And for the similarity larger than a preset similarity threshold, the text similarity adopts character string similarity, simhash similarity, word2vec similarity and the like. And the picture similarity adopts a histogram comparison method, a perceptual hash algorithm and the like. The similarity of the videos can be compared through a Hamming distance after the uniquely identified fingerprint codes, wherein the video similarity only considers the similarity of the previous 3-5 frames, meanwhile, the text similarity is preferentially carried out, if the text similarity is larger than a preset text threshold, the similarity of two commodity evaluation data is considered to be larger than the preset similarity threshold, and then the images are carried out, and finally the videos are obtained.
In this embodiment, step S62 specifically includes:
screening out commodity evaluation data with evaluation elements larger than a preset element threshold value from all commodity evaluation data of a suspicious user to obtain a suspected evaluation data set, wherein the evaluation elements comprise texts, pictures or videos;
and sequentially carrying out similarity judgment on each suspected brushing evaluation data in the suspected brushing evaluation data set and each commodity evaluation data in the commodity evaluation data set after the suspicious user is filtered, so as to obtain commodity evaluation data with similarity larger than a preset similarity threshold value with the suspected brushing evaluation data, and filtering the commodity evaluation data from the commodity evaluation data set to obtain an available commodity evaluation data set.
Therefore, the similarity judgment is carried out by skillfully utilizing the suspected review data of the suspicious user without carrying out pairwise similarity judgment on all commodity evaluation data, so that the workload of similarity screening is reduced, and the accuracy of the similarity screening is ensured.
S7, dividing all commodity evaluation data of the same commodity into a plurality of groups of evaluation sets EA according to a plurality of groups of similar user sets UA, and dividing each group of evaluation sets EA i All the evaluation data in (1) are subjected to cluster classification and are sequentially ordered from high to low according to the data volume contained in the clusters, and each group of evaluation sets EA is reserved i Taking the evaluation data in the first N clusters with the median total exceeding a preset proportion as a credible evaluation data set, and taking the evaluation data of all the commodity evaluation data of the same commodity except the credible evaluation data as a suspicious evaluation data set;
on the basis, a plurality of groups of similar user sets UA are divided for all the commodity evaluation data of the same data source according to the shopping tendency SO and the grading tendency ST of the user, and the groups of similar user sets are divided by referring to the existing user consumption portrait.
The embodiment is a market reaction condition of a certain mobile phone model, so that the commodity evaluation data acquired in the embodiment is all evaluation data of the same mobile phone model, and all the evaluation data of the same commodity refer to all the evaluation data of the same version of the same mobile phone model, such as different memory combinations like 8+128 and 12+256, or different colors, or different processors, and the like, and are divided into different commodities according to different versions, so that when the method is applied to data statistics, market tendency and user preference can be reflected according to the versions.
In a similar user set, the evaluation of the same commodity is supposed to tend to the same direction, and then the evaluation data of obvious ungrouped in a similar user set is obtained through clustering analysis and is considered as suspicious counterfeiting data, for example, most scores are 4 stars and 5 stars, and then 2 stars belong to the obvious ungrouped.
Wherein each set of evaluation sets EA is retained i The evaluation data in the first N clusters with the median total exceeding the preset ratio is 60% -90%, in this embodiment 80%, as the preset ratio in the credible evaluation data set, i.e. if there are five clusters, the proportions of the five clusters are 47%, 31%, 14%, 5%, and 3%, respectively, then the first three clusters are taken as credible evaluation data sets, because neither the first one nor the first two exceeds the preset ratio of 80%. Of course, in this embodiment, it is also required to ensure that the number proportion of each cluster is greater than or equal to the minimum proportion, and if the number proportion of the clusters arranged behind is less than the minimum proportion, the deletion is also performed, for example, 10% in this embodiment, that is, if the four clusters in this embodiment are 47%, 31%, 9%, 7%, and 6% respectively, only the first two clusters are required.
S8, judging whether the credible evaluation data of the commodity is real or not according to the evaluation behavior tendency of the user corresponding to each credible evaluation data in the credible evaluation data set in all commodities, if so, retaining the credible evaluation data judged to be real, and otherwise, deleting the credible evaluation data judged to be unreal;
wherein, the evaluation behavior tendency comprises evaluation elements and corresponding element numbers. For credible evaluation data, because a large number of users who brush the list and extremely similar evaluation data are screened in the front, at this time, a small part of commodity evaluation data which brushes the list through the self-organizing terms exist, and under the condition that most of the existing brushing lists are good comments and most of the brushing lists are good comments in normal shopping and then do not go to comment or short comment, when a few long comments of the users exist, the user does not accord with the self evaluation behavior tendency, the good comments of the user are not true, so that the accuracy of the commodity evaluation data which is judged to be brushed by a small part of the self-organizing terms is higher, and the commodity evaluation data is deleted. Therefore, whether the evaluation behavior tendency is good comment or not can be judged through the evaluation behavior tendency, so that credible evaluation data with good comment is filtered out;
and S9, for the suspicious evaluation data set, judging whether the suspicious evaluation data is real according to the evaluation elements of each suspicious evaluation data in the suspicious evaluation data set, judging whether the evaluation is specific and whether the suspicious evaluation data is true or not according to the reply of the merchant, if so, keeping the suspicious evaluation data judged to be real as the real evaluation data of the commodity, and if not, deleting the suspicious evaluation data judged to be unreal.
Therefore, the non-group real evaluation in the non-group evaluation data may be a non-group real evaluation caused by accidents such as express delivery, commodity damage and the like, and the non-group real evaluation usually indicates specific reasons, and is attached with pictures, videos and the like and answers to merchants with high possibility, so that the non-group real evaluation is judged by the number of evaluation elements, whether the evaluation itself is specific and whether the merchants answer to the evaluation.
In the embodiment, the purchasing behavior and the scoring behavior of the user are subjected to trend analysis to obtain a set of similar users, suspicious and credible data are distinguished according to the behavior consistency of the set of similar users, and then the suspicious and credible data are screened again, so that the classification accuracy of real evaluation and false evaluation can be further ensured, and the truth and effectiveness of the collected commodity evaluation data are ensured.
EXAMPLE III
Referring to fig. 3, a data collecting apparatus 1 for data statistics includes a memory 3, a processor 2, and a computer program stored in the memory 3 and running on the processor 2, wherein the processor 2 implements the steps of the first or second embodiments when executing the computer program.
Since the apparatus/device described in the above embodiments of the present invention is an apparatus/device used for implementing the method of the above embodiments of the present invention, a person skilled in the art can understand the specific structure and modification of the apparatus/device based on the method described in the above embodiments of the present invention, and thus the detailed description is omitted here. All the devices/apparatuses adopted in the method of the above embodiments of the present invention are within the intended protection scope of the present invention.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions.
It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the terms first, second, third and the like are for convenience only and do not denote any order. These words are to be understood as part of the name of the component.
Furthermore, it should be noted that in the description of the present specification, the description of the term "one embodiment", "some embodiments", "examples", "specific examples" or "some examples", etc., means that a specific feature, structure, material or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, the claims should be construed to include preferred embodiments and all changes and modifications that fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention should also include such modifications and variations.

Claims (10)

1. A data acquisition method applied to data statistics is characterized by comprising the following steps:
acquiring a first data source for data acquisition of current data statistics and a first target parameter of the current data statistics;
taking the first target parameter and similar parameters in the same technical field as the first target parameter as a first parameter set;
acquiring historical expression content related to the first parameter set, wherein the historical expression content is data statistical content and a target parameter of the historical expression content belongs to the first parameter set;
filtering the first data source according to the self value coefficient of the historical expression content and the data source referenced in the content to obtain a filtered second data source;
data required for the current data statistics is collected via the second data source.
2. The data collection method of claim 1, wherein the filtering the first data source according to the self-value coefficient of the history representation and the data source referenced in the history representation comprises:
acquiring an author, a publication position, a referenced number, a referenced place and a referenced data source of the historical expression content;
comprehensively obtaining self value coefficients of the historical expression contents based on the authors, publication positions, quoted numbers and quoted value levels in respective evaluation systems, and endowing the self value coefficients to quoted data sources;
and summarizing the comprehensive value coefficient of each data source in the first data sources, and filtering the first data sources according to a preset filtering rule.
3. The data collection method for data statistics as claimed in claim 2, wherein when the history representation is in the form of a paper, the evaluation of the author includes:
obtaining initial evaluation of an author according to the social status of the author;
obtaining the publishing position, the quoted number and the quoted place of the historical expression content of the author in the data statistical direction to obtain the overall field evaluation of the author;
obtaining the publishing position, the quoted number and the quoted position of the historical expression content of the author in the data statistical direction and the target parameter belonging to the first parameter set to obtain the subdivided field evaluation of the author;
and calibrating the initial evaluation through the overall field evaluation and the subdivided field evaluation to obtain the final evaluation of the author, wherein the influence coefficient of the subdivided field evaluation in the calibration is larger than that of the overall field evaluation.
4. The data collecting method for data statistics as claimed in claim 2, wherein the predetermined filtering rule includes filtering according to a predetermined number or filtering according to whether the number is higher than a predetermined value.
5. The data collection method for data statistics as claimed in claim 2, wherein when the historical expression content is in a video form, the self-value coefficient of the historical expression content is measured by author, publication location, click-through rate, and evaluation trend.
6. The data collection method applied to data statistics as claimed in claim 1, further comprising:
acquiring a first influence parameter capable of influencing the first target parameter;
taking the first influence parameters and the synonym parameters of the first influence parameters as a synonym set of each first influence parameter, wherein the synonym set is used for endowing the corresponding self-value coefficient to the first influence parameters when the influence parameters appearing in the historical expression content belong to the synonym set;
filtering the first influence parameter according to the self value coefficient of the historical expression content and the influence parameter related to the historical expression content in the content to obtain a second influence parameter after filtering;
the collecting data required for the current data statistics by the second data source comprises:
data relating to the second impact parameter is collected by the second data source.
7. The data collecting method for data statistics as claimed in claim 6, wherein when the second influence parameter includes commodity evaluation data, the method further includes:
acquiring a collected commodity evaluation data set, and dividing all commodity evaluation data of the same data source into a plurality of groups of similar user sets UA according to the shopping tendency SO and the grading tendency ST of a user;
dividing all commodity evaluation data of the same commodity into a plurality of groups of evaluation sets EA according to a plurality of groups of similar user sets UA, and dividing each group of evaluation sets EA i All the evaluation data in (1) are subjected to cluster classification and are sequentially ordered from high to low according to the data volume contained in the clusters, and each group of evaluation sets EA is reserved i Taking the evaluation data in the first N clusters with the median total exceeding a preset proportion as a credible evaluation data set, and taking the evaluation data of all the commodity evaluation data of the same commodity except the credible evaluation data as a suspicious evaluation data set;
for the credible evaluation data set, judging whether the credible evaluation data of the commodity is real according to the evaluation behavior tendency of the user corresponding to each credible evaluation data in the credible evaluation data set in all commodities, if so, retaining the credible evaluation data judged to be real, and otherwise, deleting the credible evaluation data judged to be unreal;
and for the suspicious evaluation data set, judging whether the suspicious evaluation data is real according to the evaluation elements of each suspicious evaluation data in the suspicious evaluation data set, judging whether the evaluation is specific and whether a merchant replies, if so, keeping the suspicious evaluation data judged to be real as the real evaluation data of the commodity, and otherwise, deleting the suspicious evaluation data judged to be unreal.
8. The data collecting method applied to data statistics as claimed in claim 7, wherein the acquiring the collected commodity evaluation data set further comprises:
filtering the collected commodity evaluation data set according to suspicious users provided by data sources;
and filtering commodity evaluation data of which the evaluation elements are larger than a preset element threshold and the similarity is larger than a preset similarity threshold from the commodity evaluation data set after the suspicious user is filtered to obtain an available commodity evaluation data set, wherein the evaluation elements comprise texts, pictures or videos.
9. A data acquisition device for data statistics, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any one of claims 1 to 8 when executing the computer program.
10. A computer-readable storage medium, having stored thereon a computer program which, when executed, implements the method of any one of claims 1 to 8.
CN202211710953.8A 2022-12-29 2022-12-29 Data acquisition method and device applied to data statistics and storage medium Pending CN115829657A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211710953.8A CN115829657A (en) 2022-12-29 2022-12-29 Data acquisition method and device applied to data statistics and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211710953.8A CN115829657A (en) 2022-12-29 2022-12-29 Data acquisition method and device applied to data statistics and storage medium

Publications (1)

Publication Number Publication Date
CN115829657A true CN115829657A (en) 2023-03-21

Family

ID=85519373

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211710953.8A Pending CN115829657A (en) 2022-12-29 2022-12-29 Data acquisition method and device applied to data statistics and storage medium

Country Status (1)

Country Link
CN (1) CN115829657A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117633326A (en) * 2023-12-04 2024-03-01 北京曜志科技有限公司 Data monitoring method for Internet mass data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117633326A (en) * 2023-12-04 2024-03-01 北京曜志科技有限公司 Data monitoring method for Internet mass data

Similar Documents

Publication Publication Date Title
Kent Data construction and data analysis for survey research
TWI598755B (en) Data analysis system, data analysis method, computer program product storing data analysis program, and storage medium storing data analysis program
Plieninger et al. Validity of multiprocess IRT models for separating content and response styles
Williams et al. Is UTAUT really used or just cited for the sake of it? A systematic review of citations of UTAUT’s originating article
US8341101B1 (en) Determining relationships between data items and individuals, and dynamically calculating a metric score based on groups of characteristics
US20170011413A1 (en) Tools and methods for determining relationship values
Algur et al. Conceptual level similarity measure based review spam detection
US20160019411A1 (en) Computer-Implemented System And Method For Personality Analysis Based On Social Network Images
US20130204823A1 (en) Tools and methods for determining relationship values
Cooper et al. Personality assessment through the situational and behavioral features of Instagram photos
Hachaj et al. Clustering of trending topics in microblogging posts: A graph-based approach
Zhang et al. An integrated model of the antecedents and consequences of perceived information overload using WeChat as an example
US20150248501A1 (en) Content analytics
US20130262355A1 (en) Tools and methods for determining semantic relationship indexes
CN115829657A (en) Data acquisition method and device applied to data statistics and storage medium
Tsutsumi et al. Towards business partnership recommendation using user opinion on Facebook
Stewart Secondary analysis and archival research: Using data collected by others.
Ping et al. Enhanced review facilitation service for C2C support: Machine learning approaches
Song et al. Impression space model for the evaluation of Internet advertising effectiveness
WO2020045526A1 (en) Information analysis device and program
Bao et al. A schema-oriented product clustering method using online product reviews
JP7141382B2 (en) Target characteristic information determination program, device and method based on difference in factor score between characteristic investigation methods
Shandilya et al. Fairness for whom? Understanding the reader’s perception of fairness in text summarization
Bao et al. How Does the Review Tag Function Benefit Highly-Rated Popular Products in Online Markets?
Botos et al. Improving Food Consciousness-Opportunities of Smartphone Apps to Access Food Information.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination