US20140108409A1 - System and method for detecting personal experience event reports from user generated internet content - Google Patents
System and method for detecting personal experience event reports from user generated internet content Download PDFInfo
- Publication number
- US20140108409A1 US20140108409A1 US14/106,878 US201314106878A US2014108409A1 US 20140108409 A1 US20140108409 A1 US 20140108409A1 US 201314106878 A US201314106878 A US 201314106878A US 2014108409 A1 US2014108409 A1 US 2014108409A1
- Authority
- US
- United States
- Prior art keywords
- segment
- term
- post
- categories
- terms
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000014509 gene expression Effects 0.000 claims description 10
- 238000001514 detection method Methods 0.000 claims description 8
- 238000001914 filtration Methods 0.000 claims description 6
- 230000001186 cumulative effect Effects 0.000 claims description 5
- 230000011218 segmentation Effects 0.000 claims description 5
- 239000003814 drug Substances 0.000 description 28
- 229940079593 drug Drugs 0.000 description 20
- 230000008569 process Effects 0.000 description 18
- 208000024891 symptom Diseases 0.000 description 18
- 238000012549 training Methods 0.000 description 9
- 230000015654 memory Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 206010019233 Headaches Diseases 0.000 description 4
- 231100000869 headache Toxicity 0.000 description 4
- 238000012417 linear regression Methods 0.000 description 4
- 239000000825 pharmaceutical preparation Substances 0.000 description 4
- 229940127557 pharmaceutical product Drugs 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000007689 inspection Methods 0.000 description 3
- BSYNRYMUTXBXSQ-UHFFFAOYSA-N Aspirin Chemical compound CC(=O)OC1=CC=CC=C1C(O)=O BSYNRYMUTXBXSQ-UHFFFAOYSA-N 0.000 description 2
- 229960001138 acetylsalicylic acid Drugs 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 238000000682 scanning probe acoustic microscopy Methods 0.000 description 2
- 238000007790 scraping Methods 0.000 description 2
- BNRNXUUZRGQAQC-UHFFFAOYSA-N Sildenafil Natural products CCCC1=NN(C)C(C(N2)=O)=C1N=C2C(C(=CC=1)OCC)=CC=1S(=O)(=O)N1CCN(C)CC1 BNRNXUUZRGQAQC-UHFFFAOYSA-N 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 235000013361 beverage Nutrition 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000029087 digestion Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012797 qualification Methods 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 238000012958 reprocessing Methods 0.000 description 1
- DEIYFTQMQPDXOT-UHFFFAOYSA-N sildenafil citrate Chemical compound OC(=O)CC(O)(C(O)=O)CC(O)=O.CCCC1=NN(C)C(C(N2)=O)=C1N=C2C(C(=CC=1)OCC)=CC=1S(=O)(=O)N1CCN(C)CC1 DEIYFTQMQPDXOT-UHFFFAOYSA-N 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 229940094720 viagra Drugs 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G06F17/30867—
Definitions
- the present invention relates to Internet search engines generally and to customized search engines for user generated experience reports in particular.
- the Internet contains a plethora of reports that are at least somewhat related to consumer products and services.
- the sources for these reports are varied. For example, manufacturer/providers may provide information as part of their marketing efforts. Their competitors may provide conflicting information to promote competing products and services.
- Nominally disinterested parties provide independent reviews, although such reviews are often prejudiced by concerns not readily apparent to the reader.
- Such products and services are also often mentioned “by the way” as background for other subjects, making it difficult to weed out “true” reports from a multitude of “hits” received when using conventional Internet search engines.
- the Internet also contains “forum” sites where users can post opinions and discuss various issues of interest. Some of the user posts on such sites constitute “personal experience” reports wherein consumers discuss their actual personal experiences using products and services. A typical such personal experience would be something like: “I used product X and my digestion improved immediately.” In such manner, forum sites may provide valuable firsthand information from actual consumers of products and services.
- a method implementable in a computing device for detecting in an Internet post a shortest relevant text product experience may include detecting a pair of anchors from two anchor categories, where the anchor categories are also term categories associated with a predefined search subject and represent two essential relevant components of experience reports.
- the method may additionally include defining a basic segment as a shortest section of text between the pair of anchors.
- the experience reports may include personal experience reports.
- the method may also include detecting when the shortest section of text does not include at least one term from each of a minimum number of term categories.
- the method may additionally include expanding the basic segment to extend beyond the shortest section of text to include at least one term from each of the minimum number of term categories
- a segment analyzer for detecting in an Internet post the shortest relevant text segment with user generated personal product experience.
- the segment analyzer may include an anchor detection module to access an anchor database to detect at least a pair of anchors in the Internet posts that pass through a post filtering module, where the pair of anchors represents one term from a name category associated with a pre-defined search subject, and one term from a second category associated with the pre-defined search subject.
- the segment analyzer may additionally include a basic segmentation unit to define basic post segments, where the basic post segments contain at least the pair of anchors and expressions from additional categories of terms located in a term database.
- the segment analyzer may additionally include a density calculator to calculate a cumulative density of the expressions in the basic post segments, where each of the expressions is weighted to indicate its contribution to the density.
- the segment analyzer may additionally include a segment optimizer to expand the basic post segments in accordance with a maximum cumulative density as calculated by the density calculator.
- the density calculator recalculates the density value for the expanded basic segment.
- the segment optimizer iteratively repeats the expanding and the recalculating until the recalculated density value is less than a previously calculated density value.
- FIG. 1 is a block diagram of a novel user-generated personal experience retrieval system 100 , designed and operative in accordance with a preferred embodiment of the present invention
- FIG. 2 is a block diagram of the segment analyzer of the embodiment of FIG. 1 ;
- FIG. 3 is a block diagram of a novel process to be performed by the system of FIG. 1 ;
- FIG. 4 is an illustration of an exemplary Internet post to be analyzed and processed by the system of FIG. 1 ;
- FIGS. 5-7B are illustrations of exemplary scoring tables to be used during the process of FIG. 3 ;
- FIG. 8 is a schematic diagram of a novel forum website selection utility, constructed and operative in accordance with a preferred embodiment of the present invention.
- FIG. 9 is a block diagram of a novel process to be performed by the system of FIG. 8 .
- An Internet user generated personal experience event report may be a statement written by users on an Internet platform (such as a message board), referring to their own experience with regard to a specific product or service.
- a specialized search process may be configured to identify such reports related to a specific field of products and/or services in order to filter out “false hits” and extraneous information that may typically be retrieved by a search engine.
- System 100 may comprise post collector 50 in communication with forums 20 on Internet 10 .
- System 100 may also comprise segment analyzer 200 , scoring engine 300 and user search interface 350 .
- system 100 may be configured to identify user-generated personal experience event reports that may be related to pharmaceutical products.
- a typical subject for which there may be demand for collating and analyzing user-generated personal experience event reports may be pharmaceuticals.
- potential users of pharmaceuticals may understandably wish to study personal experience event reports prior to beginning a treatment.
- system 100 and its methods of operation may therefore be described hereinbelow in the context of a pharmaceutical based configuration.
- the present invention may be configured for any suitable subject for which personal experience event reports may be posted on the Internet, for example, automobiles, airline travel, banking services, food and beverages, etc
- Post collector 50 may periodically collect posts from a “collection list” of chat forums 20 on Internet 10 .
- the collected posts may be forwarded to segment analyzer 200 to identify segments of forum posts that may be likely to contain personal experience event reports regarding the subject for which system 100 may be configured.
- segment analyzer may identify post segments that may be likely to contain personal experience event reports regarding the use of pharmaceuticals.
- These segments may be forwarded to scoring engine 300 which may “score” the segments in terms of their likely relevance as personal reports. Scored segments may then be stored in personal experience database 110 along with addressing information, such as a uniform resource locator (URL) for the original post. Users may then use user search interface 350 to search database 110 for user-generated personal experience event reports regarding the products/services for which system 100 may be configured. For example, a user may search for event reports relating to “Drug A” in order to find out if anyone that had personally used Drug A had reported regarding its success and/or any side effects suffered when using it. The output of such a search may consist of a list of chat posts, sorted according to the score assigned by scoring engine 300 . It will be appreciated that the present invention may include any suitable implementation for user search interface 350 , such as, for example, a browser based utility for inputting search parameters and displaying links to related user generated personal experience event reports.
- a browser based utility for inputting search parameters and displaying links to related user generated personal experience event reports.
- the collection list used by post collector 50 may include chat forums 20 deemed to be relevant to the subject for which system 100 may be configured. For example, if system 100 is configured for personal reports on pharmaceutical products, the collection list may include a list on chat forums 20 on which it may be likely that users may post personal experience event reports relating to their use of pharmaceutical products. It will be appreciated that post collector 50 may be configured with to include any suitable method such as known in the art for “scraping” forum posts from the collection list. It will similarly be appreciated that post collector 50 may be configured perform such “scraping” on an incremental basis to avoid reprocessing older posts.
- the present invention may also include a novel pre-collection process for compiling the collection list for system 100 .
- the present invention may include any suitable method for compiling the collection list, including manual inspection.
- Segment analyzer 200 may comprise post filtering module 210 , anchor detection module 220 , basic segmentation unit 230 , density calculator 240 and segment optimizer 250 . Segment analyzer 200 may also comprise filter database 215 , anchor database 225 and terms database 235 , each of which may be referenced by the other elements of segment analyzer 200 .
- FIG. 3 illustrates a novel post segmentation process 260 that may be executed by segment analyzer 200 to derive optimally segmented user-generated personal experience event reports from the posts collected by post collector 50 .
- Post filtering module 210 may receive (step 262 ) posts from post collector 50 .
- Post filtering module 210 may filter (step 264 ) these posts according to terms found in filter database 215 .
- Filter database 215 may store a list of categorized relevant terms which module 210 may search for in each post. Depending on the configuration of system 100 , at least one term from a combination of some the categories must be found in a post for that post to pass through the step 264 .
- the categories may include, for example, product/service name, indication of personal reference, and indication of personal experience.
- the product/service name category may consist of names of product/services regarding which a user of system 100 may wish to search for personal experience event reports. It will be appreciated that other configurations for system 100 are included in the present invention.
- the terms in the product/services name category may include a list of automobile makes, manufacturers and nicknames, such as, for example: “Corvette”, “Chevrolet”, “Chevy”, and “Vette”.
- the category for indications of personal reference may include terms such as “I”, “my”, “me”, “mine”, “myself”, etc. that may indicate that the post refers to an actual personal experience.
- the category for personal experience may include terms such as, for example, “I used”, “I bought” “I had”, etc. that may indicate that the poster had an actual personal experience; that the report was not based on hearsay or opinion.
- a post may have to contain at least one term from each of these categories in order to pass through step 264 .
- Drug name i.e. product/service name
- indication of personal reference indication of personal drug experience
- symptom indication of personal drug experience
- personal symptom experience may be precise medical terms, such as, for example, “headache”, or alternatively they may also include user descriptions such as “my head exploded”.
- Personal symptom experience terms may be indicative of the poster having a personal cause/reason for using the indicated drug, for example: “I suffered from”, “I have experienced”.
- post filtering module may be configured to require terms from only four categories, wherein a term from only one of the personal experience and personal symptom experience categories may be required.
- similar categories may be used to configure system 100 for non-pharmaceutical products and/or services. For example, if system 100 is configured for automobile research, the symptom category may be replaced by a “preference category” including terms such as “family car”, “sports car”, “road handling” or “seven seats”. Similarly, the personal symptom experience category may be replaced by a personal preference category including terms such as “I need a bigger car”, “I wanted a sports car” or “I value engine performance”.
- Anchor detection module 220 may detect (step 266 ) segment anchors in posts that contain all of the required term categories. Module 220 may reference database 225 for lists of segment anchor terms to match to terms in the posts. Segment anchors may represent a pair of term categories that may together define the personal experience event reports of interest for system 100 .
- the segment anchors may be the drug name and symptom categories.
- the segment anchors may be the drug name and personal symptom experience categories.
- segment anchors for a pharmaceutical configuration may be terms from the drug name and symptom categories.
- Database 225 may be populated by a publicly available database of drugs and symptoms.
- Basic segmentation unit 230 may then segment (step 268 ) the posts based on the anchors identified in step 266 to find the minimal text segments in the post that have at least one term from each of the categories required for the filter process in step 264 .
- Unit 230 may first search for the required terms between the identified anchors and may then incrementally search before and after the anchors one word at a time until at least one of the terms from all of the relevant categories may be identified in order to define basic segments.
- Density calculator 240 may reference terms database 235 to calculate (step 270 ) the density of relevant terms in each basic segment.
- the density may be defined as the ratio of the relevant terms each multiplied by an associated weight stored in database 235 , divided by the overall number of words in the basic segment.
- each term in database 235 may have a different defined weight that may reflect its value as a predictor of the likelihood that the post being analyzed may represent a user generated personal experience event report. Accordingly, the calculated density score may provide a measure of the amount of relevant information contained in the specified segment. It will be appreciated that any suitable method may be used to assign the weights. As will be described hereinbelow, in accordance with a preferred embodiment of the present invention, linear regressions may be run on a training set of data to derive these weights.
- terms database 235 may also store other categories of terms that may also be used to assess the likelihood of a segment containing a valid user-generated personal experience event report.
- terms database 235 may also store terms relating to a “negative” category. Terms such as “heard of”, “likely”, “I've been told”, “did not” may typically impact negatively on the likelihood that a given report is a true personal experience, and may therefore be significant when assessing a given segment at the next step of the process.
- other categories may be added as well.
- each term in such a category may be weighted to reflect its value as a predictor of the likelihood that the post being analyzed may represent a user generated personal experience event report.
- Segment optimizer 250 may incrementally check each word before and after the segment to find (step 272 ) the next term from database 235 . Density calculator 240 may then recalculate (step 274 ) the density as in step 270 . If the result is that density has increased (step 276 ), segment optimizer may again find (step 272 ) the next term. Steps 272 and 274 may be repeated until the density ceases to increase (step 276 ) at which point the final, presumably optimized, segment may be output by segment analyzer 200 .
- FIG. 4 illustrates an exemplary post as analyzed by segment analyzer 200 .
- Terms 282 and 284 may represent anchor terms, “symptom” and “drug name” respectively.
- Term 281 may represent a personal experience term
- terms 288 may represent personal reference terms
- terms 289 may represent negative terms.
- Segment analyzer may use density calculator 240 to compare the density of the two sets in order to define a basic segment 285 .
- Segment analyzer 200 may use terms 282 A and 284 A to define basic segment 285 since they reflect a denser segment; they “enclose” personal experience term 281 , whereas terms 282 B and 284 B are much farther away from term 281 .
- segment analyzer 200 may optimize basic segment 285 by expanding it to include additional terms and recalculating density (steps 272 and 274 ). Accordingly, an exemplary optimal segment 290 may be defined by expanding basic segment 285 to include terms 287 and 288 A as well. It will also be appreciated that the second and third sentences may contain several negative terms 289 , which may decrease the likelihood that an optimal segment may be in found in those sentences.
- FIG. 5 illustrates an exemplary factor weight table 305 , suitable for use with a pharmaceutical configuration of system 100 .
- Scoring engine 300 may use such a table to “score” the optimized segments received from segment analyzer 200 in order to assess the likelihood that they may contain relevant user-generated personal experience event reports.
- Each factor 310 may represent a possible situation that may occur in a segment, and may be weighted to reflect the effect of such a situation on the likelihood that a post may indeed be a relevant user-generated personal experience event report. It will be appreciated that any suitable method may be used to assign the weights. As will be described hereinbelow, in accordance with a preferred embodiment of the present invention, linear regressions may be run on a training set of data to derive these weights.
- high concept density i.e high density as calculated by density calculator 240
- density calculator 240 may likely indicate that a post may indeed be a relevant user-generated personal experience event report.
- the appearance of a second drug between the anchors may lessen this likelihood, and accordingly may be given a negative weight, for example: ⁇ 5.
- the proximity of terms may also reflect on the likelihood that a post may indeed be a relevant user-generated personal experience event report. For example, the farther apart a drug or experience and an associated side effect term may be mentioned in the segment, the less likely that they represent a “true” personal experience event report for that drug. Accordingly, proximity factors may be assigned negative weights.
- the exemplary values in table 305 may be derived from statistical modeling of actual pharmaceutical related forum posts. However, the present invention may also include other feature-weight sets for both pharmaceutical and other configurations.
- FIG. 6 illustrates table 305 (now labeled 305 ′) with exemplary values added based on an exemplary post segment.
- scoring engine 300 may multiply each factor value per its associated weight, and then add the products for the final score. The score for these exemplary values would thus be computed as:
- System 100 may be configured to store all posts with a score above a certain threshold in personal experience database 110 .
- FIGS. 7A and 7B show the scoring for two exemplary post segments referring to “Drug B”.
- FIG. 7A shows a score of +14.83
- FIG. 7B shows a score of ⁇ 14.46.
- the salient differences between the two examples may be that the example in FIG. 7A has an explicit “symptom experience (i.e. “no sex drive”) and lacks a negating factor; whereas the example in FIG. 7B has a negating factor (“heard”) and lacks an explicit symptom experience (“can cause” which may indicate a lack of actual experience).
- the post from FIG. 7A may be determined to qualify as a user generated personal experience event report, whereas, the post from FIG. 7B may not.
- the threshold for qualification may be configurable.
- a forum website selection utility may be used to identify appropriate websites for collection by post collector 50 , thus reducing the “universe” of websites for post collection to a manageable number of relevant websites with non-commercial/SPAM authentic user generated personal experience event reports.
- FIG. 8 illustrates forum website selection utility 400 , constructed and operative in accordance with a preferred embodiment of the present invention.
- Utility 400 may comprise pre-collection post collector 450 , pattern recognizer 430 , training set scoring engine 440 and candidate scoring engine 460 .
- Utility 400 may communicate with Internet 10 via post collector 450 , which may be configured with functionality for collecting posts from Internet websites similar to that of post collector 50 .
- pre-collection post collector 450 may collect Internet posts from training and candidate websites as part of a process to generate website collection list 465
- post collector 50 may collect posts from the websites in collection list 465 .
- Pre-collection post collector 450 may collect (step 510 ) posts from a training set of websites that may include “good” websites 405 which may be known to have user generated personal experience event reports.
- the training set may also include “bad” websites 410 , which may be known to have content related to the search subject (i.e. pharmaceuticals, cars, etc depending on the configuration of system 100 ) which may not qualify as user generated personal experience event reports.
- “Good” websites 405 may be defined by any suitable method.
- a generic search engine may be used to locate websites according to relevant keywords, and at least a subset of the website's content may be manually examined to determine whether or not the website includes user generated personal experience event reports.
- the posts collected by pre-collection post collector 450 may be filtered to contain only verified authentic user generated personal experience event reports.
- the relevant keywords may be provided by an outside source such as known relevant terms database 425 .
- database 425 may be a publicly available database of medical terms that may include comprehensive lists of drugs and known symptoms. Similar methods may also be used to define “bad” websites.
- Pattern recognizer 430 may detect (step 520 ) recurring patterns in the training set posts. It will be appreciated that any known, suitable methods for pattern detection/recognition may be used in the context of step 430 . For example, such detection may include starting by searching for instances of terms from known relevant terms database 425 .
- database 425 may contain examples of at least one (and preferably both) of the anchor categories for which system 100 may be configured.
- database 425 may contain a list of drugs and known symptoms. It will be appreciated that database 425 may provide the basis for anchor database 225 .
- Step 430 may also include detection of recurring terms that may not be found in database 425 .
- indications of personal reference/experience terms such as those in filter database 215 may also be detected.
- Exemplary such terms may include phrases such as: “I took” or “I felt better”.
- filter database 215 may be at least in part populated based on some or all of the terms detected in step 430 .
- step 430 may be “negative” in nature.
- terms such as “buy”, “sale”, “selling” may indicate an attempt to sell or market a product and that the post may therefore not be an authentic user generated personal experience event report.
- Such terms may typically be found in posts on bad websites 410 .
- step 520 may include detection of larger expressions as well.
- a “moving window” may be used to check for recurring combination expressions including one or more of the anchor terms from database 425 .
- pattern recognizer 430 may initially detect anchors “Drug A” (drug name) and “headache (symptom). By incrementally employing a moving window to detect combination expression around these anchors, pattern recognizer may also detect larger expressions such as personal experience term “I took” in juxtaposition to anchor term “Drug A”, and a variant on the initial symptom term, “headache was gone”.
- Pattern recognizer 430 may be configured do perform statistical analysis on the terms detected in step 520 to track their occurrences and determine their significance.
- utility 400 may be configured to facilitate inspection of the results of step 520 by a user of system 100 , and to enable the user to adjust the input data as necessary to achieve a truer result. Accordingly, step 520 may be repeated as necessary.
- the patterns detected by pattern recognizer 430 may be stored in detected patterns database 415 .
- Training set scoring engine 440 may score (step 530 ) the terms in detected patterns database 415 to produce weighted indicators of the likelihood that a given website may or may not contain user generated personal experience event reports. Such scoring may employ any suitable method. For example, engine 440 may run a linear regression on the terms in detect patterns database 415 vis-à-vis the training set of posts from “good” and “bad” websites to determine the weight of each term as an indicator of likelihood that a given website is either “good” or “bad”.
- engine 440 may expand the scoring process to also include other indicators from ranking sources database 470 .
- Database 470 may represent rankings from external sources such as, for example, Google page ranks and/or Alexa ratings.
- Engine 440 may include the associated rankings for the page on which each post may be located as additional factors when running the linear regression on the terms in detect patterns database 415 .
- engine 440 may expand the scoring process to also include additional factors that may be calculated or derived from the original posts.
- additional factors may include, for example, the query rank of the original query that identified the post as a candidate and meta keywords of the page.
- engine 440 may expand the scoring process to also include the number of images and/or links on the page. It will be appreciated that most user forums have relatively few images and links per page. Accordingly, a higher number of links or images per page may tend to indicate a “bad” website.
- engine 440 may also expand the scoring process to also include statistical data from cumulative scoring.
- Such factors may include, for example, the ratio of posts to the number of discussion (aka “threads”); or the overall ranking of a given anchor and/or term in “good” and “bad” websites.
- the anchor term “Aspirin” may have an overall high ranking in “good” posts; statistically, personal experience event reports citing Aspirin may typically be genuine.
- the anchor term “Viagra” may typically be indicative of SPAM or commercial posts.
- utility 400 may be configured to facilitate inspection of the results of step 530 by a user of system 100 , and to enable the user to adjust the input data as necessary to achieve a truer result. Accordingly, step 530 may be repeated as necessary.
- the patterns scored by engine 440 may be stored in weighted indicators database 435 . It will be appreciated that weighted indicators database 435 may therefore contain a superset (including calculated weights) of the terms in detected patterns database 415 and known relevant terms 425 . It will also be appreciated that database 435 may provide the basis for terms database 235 .
- Pre-collection post collector 450 may collect (step 540 ) posts from candidate websites 420 on the Internet by formulating search queries based on positive term based indicators from weighted indicators database 435 .
- Candidate scoring engine 460 may then score (step 550 ) each website 420 vis-à-vis all of the factors in weighted indicators database 435 to assess its likelihood to contain user generated personal experience event reports.
- System 100 may be configured with a threshold weighted score to determine whether or not a given website 420 may be considered likely to contain user generated personal experience event reports.
- Utility 400 may update (step 560 ) website collection list 465 to include websites 420 that exceed such a threshold. It will be appreciated that process 500 may be performed on a periodic basis to continually update list 465 . Accordingly, utility 400 may also record websites 420 with weighted scores below the threshold to avoid examining them again in the future.
- website collection list 465 may be used by post collector 50 in the embodiment of FIG. 1 .
- Embodiments of the present invention may include apparatus for performing the operations herein.
- This apparatus may be specially constructed for the desired purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer.
- a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk, including floppy disks, optical disks, magnetic-optical disks, read-only memories (ROMs), compact disc read-only memories (CD-ROMs), random access memories (RAMs), electrically programmable read-only memories (EPROMs), electrically erasable and programmable read only memories (EEPROMs), magnetic or optical cards, Flash memory, or any other type of media suitable for storing electronic instructions and capable of being coupled to a computer system bus.
- ROMs read-only memories
- CD-ROMs compact disc read-only memories
- RAMs random access memories
- EPROMs electrically programmable read-only memories
- EEPROMs electrically erasable and
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A method implementable on a computing device for detecting in an Internet post a shortest relevant text product experience is disclosed. The method includes detecting a pair of anchors from two anchor categories, where the anchor categories are also term categories associated with a predefined search subject and represent two essential relevant components of experience reports. The method additionally includes defining a basic segment as a shortest section of text between the pair of anchors.
Description
- This application is a continuation application claiming benefit from U.S. patent application Ser. No. 13/253,090 filed Oct. 5, 2011 which is hereby incorporated in its entirety by reference.
- The present invention relates to Internet search engines generally and to customized search engines for user generated experience reports in particular.
- The Internet contains a plethora of reports that are at least somewhat related to consumer products and services. The sources for these reports are varied. For example, manufacturer/providers may provide information as part of their marketing efforts. Their competitors may provide conflicting information to promote competing products and services. Nominally disinterested parties provide independent reviews, although such reviews are often prejudiced by concerns not readily apparent to the reader. Such products and services are also often mentioned “by the way” as background for other subjects, making it difficult to weed out “true” reports from a multitude of “hits” received when using conventional Internet search engines.
- The Internet also contains “forum” sites where users can post opinions and discuss various issues of interest. Some of the user posts on such sites constitute “personal experience” reports wherein consumers discuss their actual personal experiences using products and services. A typical such personal experience would be something like: “I used product X and my digestion improved immediately.” In such manner, forum sites may provide valuable firsthand information from actual consumers of products and services.
- Unfortunately, personal experience event reports are typically posted in free text with only nominal constraints on form or content, rendering them unstructured and difficult to identify by non-manual processes. It is therefore be difficult to identify and collate personal experience event reports using conventional Internet search engines, even when such search engines are configured to search forum sites.
- There is provided, in accordance with an embodiment of the present invention, a method implementable in a computing device for detecting in an Internet post a shortest relevant text product experience. The method may include detecting a pair of anchors from two anchor categories, where the anchor categories are also term categories associated with a predefined search subject and represent two essential relevant components of experience reports. The method may additionally include defining a basic segment as a shortest section of text between the pair of anchors.
- In accordance with an embodiment of the present invention, the experience reports may include personal experience reports.
- In accordance with an embodiment of the present invention, the method may also include detecting when the shortest section of text does not include at least one term from each of a minimum number of term categories. The method may additionally include expanding the basic segment to extend beyond the shortest section of text to include at least one term from each of the minimum number of term categories
- There is provided, in accordance with an embodiment of the present invention, a segment analyzer for detecting in an Internet post the shortest relevant text segment with user generated personal product experience. The segment analyzer may include an anchor detection module to access an anchor database to detect at least a pair of anchors in the Internet posts that pass through a post filtering module, where the pair of anchors represents one term from a name category associated with a pre-defined search subject, and one term from a second category associated with the pre-defined search subject. The segment analyzer may additionally include a basic segmentation unit to define basic post segments, where the basic post segments contain at least the pair of anchors and expressions from additional categories of terms located in a term database.
- In accordance with an embodiment of the present invention, the segment analyzer may additionally include a density calculator to calculate a cumulative density of the expressions in the basic post segments, where each of the expressions is weighted to indicate its contribution to the density. The segment analyzer may additionally include a segment optimizer to expand the basic post segments in accordance with a maximum cumulative density as calculated by the density calculator.
- In accordance with an embodiment of the present invention, the density calculator recalculates the density value for the expanded basic segment.
- In accordance with an embodiment of the present invention, the segment optimizer iteratively repeats the expanding and the recalculating until the recalculated density value is less than a previously calculated density value.
- The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
-
FIG. 1 is a block diagram of a novel user-generated personalexperience retrieval system 100, designed and operative in accordance with a preferred embodiment of the present invention; -
FIG. 2 is a block diagram of the segment analyzer of the embodiment ofFIG. 1 ; -
FIG. 3 is a block diagram of a novel process to be performed by the system ofFIG. 1 ; -
FIG. 4 is an illustration of an exemplary Internet post to be analyzed and processed by the system ofFIG. 1 ; -
FIGS. 5-7B are illustrations of exemplary scoring tables to be used during the process ofFIG. 3 ; -
FIG. 8 is a schematic diagram of a novel forum website selection utility, constructed and operative in accordance with a preferred embodiment of the present invention; and -
FIG. 9 is a block diagram of a novel process to be performed by the system ofFIG. 8 . - It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
- In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present invention.
- Applicants have realized that currently available Internet search engines are inefficient tools for searching Internet forums for user generated personal experience event reports that may be used to evaluate and compare products and services. An Internet user generated personal experience event report may be a statement written by users on an Internet platform (such as a message board), referring to their own experience with regard to a specific product or service. A specialized search process may be configured to identify such reports related to a specific field of products and/or services in order to filter out “false hits” and extraneous information that may typically be retrieved by a search engine.
- Reference is now made to
FIG. 1 which illustrates a novel user-generated personalexperience retrieval system 100, designed and operative in accordance with a preferred embodiment of the present invention.System 100 may comprisepost collector 50 in communication withforums 20 on Internet 10.System 100 may also comprisesegment analyzer 200,scoring engine 300 anduser search interface 350. - In accordance with a preferred embodiment of the present invention,
system 100 may be configured to identify user-generated personal experience event reports that may be related to pharmaceutical products. It will be appreciated that a typical subject for which there may be demand for collating and analyzing user-generated personal experience event reports may be pharmaceuticals. For example, potential users of pharmaceuticals may understandably wish to study personal experience event reports prior to beginning a treatment. To illustrate such an embodiment,system 100 and its methods of operation may therefore be described hereinbelow in the context of a pharmaceutical based configuration. However, it will be appreciated that the present invention may be configured for any suitable subject for which personal experience event reports may be posted on the Internet, for example, automobiles, airline travel, banking services, food and beverages, etc -
Post collector 50 may periodically collect posts from a “collection list” ofchat forums 20 on Internet 10. The collected posts may be forwarded to segmentanalyzer 200 to identify segments of forum posts that may be likely to contain personal experience event reports regarding the subject for whichsystem 100 may be configured. For example, segment analyzer may identify post segments that may be likely to contain personal experience event reports regarding the use of pharmaceuticals. - These segments may be forwarded to scoring
engine 300 which may “score” the segments in terms of their likely relevance as personal reports. Scored segments may then be stored inpersonal experience database 110 along with addressing information, such as a uniform resource locator (URL) for the original post. Users may then useuser search interface 350 to searchdatabase 110 for user-generated personal experience event reports regarding the products/services for whichsystem 100 may be configured. For example, a user may search for event reports relating to “Drug A” in order to find out if anyone that had personally used Drug A had reported regarding its success and/or any side effects suffered when using it. The output of such a search may consist of a list of chat posts, sorted according to the score assigned by scoringengine 300. It will be appreciated that the present invention may include any suitable implementation foruser search interface 350, such as, for example, a browser based utility for inputting search parameters and displaying links to related user generated personal experience event reports. - The collection list used by
post collector 50 may include chatforums 20 deemed to be relevant to the subject for whichsystem 100 may be configured. For example, ifsystem 100 is configured for personal reports on pharmaceutical products, the collection list may include a list onchat forums 20 on which it may be likely that users may post personal experience event reports relating to their use of pharmaceutical products. It will be appreciated that postcollector 50 may be configured with to include any suitable method such as known in the art for “scraping” forum posts from the collection list. It will similarly be appreciated that postcollector 50 may be configured perform such “scraping” on an incremental basis to avoid reprocessing older posts. - As will be disclosed hereinbelow, the present invention may also include a novel pre-collection process for compiling the collection list for
system 100. However, it will be appreciated that the present invention may include any suitable method for compiling the collection list, including manual inspection. - Reference is now made to
FIG. 2 which illustratessegment analyzer 200 in greater detail.Segment analyzer 200 may comprisepost filtering module 210,anchor detection module 220,basic segmentation unit 230,density calculator 240 andsegment optimizer 250.Segment analyzer 200 may also comprisefilter database 215,anchor database 225 andterms database 235, each of which may be referenced by the other elements ofsegment analyzer 200. - Reference is now also made to
FIG. 3 which illustrates a novelpost segmentation process 260 that may be executed bysegment analyzer 200 to derive optimally segmented user-generated personal experience event reports from the posts collected bypost collector 50. -
Post filtering module 210 may receive (step 262) posts frompost collector 50.Post filtering module 210 may filter (step 264) these posts according to terms found infilter database 215.Filter database 215 may store a list of categorized relevant terms whichmodule 210 may search for in each post. Depending on the configuration ofsystem 100, at least one term from a combination of some the categories must be found in a post for that post to pass through thestep 264. The categories may include, for example, product/service name, indication of personal reference, and indication of personal experience. The product/service name category may consist of names of product/services regarding which a user ofsystem 100 may wish to search for personal experience event reports. It will be appreciated that other configurations forsystem 100 are included in the present invention. For example, ifsystem 100 is configured for automobile research, the terms in the product/services name category may include a list of automobile makes, manufacturers and nicknames, such as, for example: “Corvette”, “Chevrolet”, “Chevy”, and “Vette”. The category for indications of personal reference may include terms such as “I”, “my”, “me”, “mine”, “myself”, etc. that may indicate that the post refers to an actual personal experience. The category for personal experience may include terms such as, for example, “I used”, “I bought” “I had”, etc. that may indicate that the poster had an actual personal experience; that the report was not based on hearsay or opinion. In accordance with a preferred embodiment of the present invention, a post may have to contain at least one term from each of these categories in order to pass throughstep 264. - It will be appreciated, however, that depending on the configuration of
system 100 there may be other term categories infilter database 215. For example, ifsystem 100 is configured for pharmaceuticals, the relevant terms may be divided into five categories: Drug name (i.e. product/service name), indication of personal reference, indication of personal drug experience, symptom, and personal symptom experience. Symptom terms may be precise medical terms, such as, for example, “headache”, or alternatively they may also include user descriptions such as “my head exploded”. Personal symptom experience terms may be indicative of the poster having a personal cause/reason for using the indicated drug, for example: “I suffered from”, “I have experienced”. In accordance with a preferred embodiment of the present invention, whensystem 100 may be configured for pharmaceuticals, terms from all five categories must be present in a post in order for it to pass throughstep 264. In accordance with an alternative preferred embodiment, post filtering module may be configured to require terms from only four categories, wherein a term from only one of the personal experience and personal symptom experience categories may be required. It will be appreciated that similar categories may be used to configuresystem 100 for non-pharmaceutical products and/or services. For example, ifsystem 100 is configured for automobile research, the symptom category may be replaced by a “preference category” including terms such as “family car”, “sports car”, “road handling” or “seven seats”. Similarly, the personal symptom experience category may be replaced by a personal preference category including terms such as “I need a bigger car”, “I wanted a sports car” or “I value engine performance”. -
Anchor detection module 220 may detect (step 266) segment anchors in posts that contain all of the required term categories.Module 220 may referencedatabase 225 for lists of segment anchor terms to match to terms in the posts. Segment anchors may represent a pair of term categories that may together define the personal experience event reports of interest forsystem 100. For example, in a pharmaceutical configuration, the segment anchors may be the drug name and symptom categories. Alternatively, the segment anchors may be the drug name and personal symptom experience categories. In accordance with a preferred embodiment of the present invention, segment anchors for a pharmaceutical configuration may be terms from the drug name and symptom categories.Database 225 may be populated by a publicly available database of drugs and symptoms. -
Basic segmentation unit 230 may then segment (step 268) the posts based on the anchors identified instep 266 to find the minimal text segments in the post that have at least one term from each of the categories required for the filter process instep 264.Unit 230 may first search for the required terms between the identified anchors and may then incrementally search before and after the anchors one word at a time until at least one of the terms from all of the relevant categories may be identified in order to define basic segments. -
Density calculator 240 may referenceterms database 235 to calculate (step 270) the density of relevant terms in each basic segment. The density may be defined as the ratio of the relevant terms each multiplied by an associated weight stored indatabase 235, divided by the overall number of words in the basic segment. It will be appreciated that each term indatabase 235 may have a different defined weight that may reflect its value as a predictor of the likelihood that the post being analyzed may represent a user generated personal experience event report. Accordingly, the calculated density score may provide a measure of the amount of relevant information contained in the specified segment. It will be appreciated that any suitable method may be used to assign the weights. As will be described hereinbelow, in accordance with a preferred embodiment of the present invention, linear regressions may be run on a training set of data to derive these weights. - It will also be appreciated that some of the terms may have negative values. In addition to the terms in
filter database 215,terms database 235 may also store other categories of terms that may also be used to assess the likelihood of a segment containing a valid user-generated personal experience event report. For example,terms database 235 may also store terms relating to a “negative” category. Terms such as “heard of”, “likely”, “I've been told”, “did not” may typically impact negatively on the likelihood that a given report is a true personal experience, and may therefore be significant when assessing a given segment at the next step of the process. Depending on the configuration ofsystem 100, other categories may be added as well. For example, in an exemplary configuration for pharmaceuticals, there may be an “outcome” or “result” category that may include terms such as “got better”, “recovered” or “condition worsened”. As in the embodiments described hereinabove, each term in such a category may be weighted to reflect its value as a predictor of the likelihood that the post being analyzed may represent a user generated personal experience event report. -
Segment optimizer 250 may incrementally check each word before and after the segment to find (step 272) the next term fromdatabase 235.Density calculator 240 may then recalculate (step 274) the density as instep 270. If the result is that density has increased (step 276), segment optimizer may again find (step 272) the next term.Steps segment analyzer 200. - Reference is now made to
FIG. 4 which illustrates an exemplary post as analyzed bysegment analyzer 200. Terms 282 and 284 may represent anchor terms, “symptom” and “drug name” respectively.Term 281 may represent a personal experience term,terms 288 may represent personal reference terms, andterms 289 may represent negative terms. It will be appreciated that there may be two sets of anchor terms 282 and 284. Segment analyzer may usedensity calculator 240 to compare the density of the two sets in order to define abasic segment 285.Segment analyzer 200 may useterms basic segment 285 since they reflect a denser segment; they “enclose”personal experience term 281, whereasterms term 281. As described hereinabove,segment analyzer 200 may optimizebasic segment 285 by expanding it to include additional terms and recalculating density (steps 272 and 274). Accordingly, an exemplaryoptimal segment 290 may be defined by expandingbasic segment 285 to includeterms 287 and 288A as well. It will also be appreciated that the second and third sentences may contain severalnegative terms 289, which may decrease the likelihood that an optimal segment may be in found in those sentences. - Reference is now made to
FIG. 5 which illustrates an exemplary factor weight table 305, suitable for use with a pharmaceutical configuration ofsystem 100. Scoringengine 300 may use such a table to “score” the optimized segments received fromsegment analyzer 200 in order to assess the likelihood that they may contain relevant user-generated personal experience event reports. Eachfactor 310 may represent a possible situation that may occur in a segment, and may be weighted to reflect the effect of such a situation on the likelihood that a post may indeed be a relevant user-generated personal experience event report. It will be appreciated that any suitable method may be used to assign the weights. As will be described hereinbelow, in accordance with a preferred embodiment of the present invention, linear regressions may be run on a training set of data to derive these weights. - For example, high concept density, i.e high density as calculated by
density calculator 240, may likely indicate that a post may indeed be a relevant user-generated personal experience event report. On the other hand, the appearance of a second drug between the anchors may lessen this likelihood, and accordingly may be given a negative weight, for example: −5. The proximity of terms may also reflect on the likelihood that a post may indeed be a relevant user-generated personal experience event report. For example, the farther apart a drug or experience and an associated side effect term may be mentioned in the segment, the less likely that they represent a “true” personal experience event report for that drug. Accordingly, proximity factors may be assigned negative weights. It will be appreciated that the exemplary values in table 305 may be derived from statistical modeling of actual pharmaceutical related forum posts. However, the present invention may also include other feature-weight sets for both pharmaceutical and other configurations. -
FIG. 6 , to which reference is now made, illustrates table 305 (now labeled 305′) with exemplary values added based on an exemplary post segment. In order to score the post, scoringengine 300 may multiply each factor value per its associated weight, and then add the products for the final score. The score for these exemplary values would thus be computed as: -
Score=23*(−2)+1*(−3)+0*(−5)+0*(−5)+9*1+0.34*2+0*4+1*(−10)+1*10+0*(−10)=−39.28 - A negative score may indicate that the likelihood of a relevant report may be low.
System 100 may be configured to store all posts with a score above a certain threshold inpersonal experience database 110. -
FIGS. 7A and 7B , to which reference is now made, show the scoring for two exemplary post segments referring to “Drug B”.FIG. 7A shows a score of +14.83, whereasFIG. 7B shows a score of −14.46. The salient differences between the two examples may be that the example inFIG. 7A has an explicit “symptom experience (i.e. “no sex drive”) and lacks a negating factor; whereas the example inFIG. 7B has a negating factor (“heard”) and lacks an explicit symptom experience (“can cause” which may indicate a lack of actual experience). Accordingly, the post fromFIG. 7A may be determined to qualify as a user generated personal experience event report, whereas, the post fromFIG. 7B may not. It will be appreciated that the threshold for qualification may be configurable. - It will be appreciated that it may not be possible to continuously perform comprehensive searches for user generated personal experience event reports from among all of the content available on the Internet. By necessity, the “collection list” referred to hereinabove may therefore represent only a small fraction of the websites on the Internet. In accordance with a preferred embodiment of the present invention, a forum website selection utility may be used to identify appropriate websites for collection by
post collector 50, thus reducing the “universe” of websites for post collection to a manageable number of relevant websites with non-commercial/SPAM authentic user generated personal experience event reports. Reference is now made toFIG. 8 which illustrates forumwebsite selection utility 400, constructed and operative in accordance with a preferred embodiment of the present invention. -
Utility 400 may comprisepre-collection post collector 450,pattern recognizer 430, training setscoring engine 440 andcandidate scoring engine 460.Utility 400 may communicate withInternet 10 viapost collector 450, which may be configured with functionality for collecting posts from Internet websites similar to that ofpost collector 50. As may be described hereinbelow,pre-collection post collector 450 may collect Internet posts from training and candidate websites as part of a process to generatewebsite collection list 465, whereaspost collector 50 may collect posts from the websites incollection list 465. - Reference is also made to
FIG. 9 which illustrates a novelwebsite selection process 500 to be performed byutility 400 in accordance with a preferred embodiment of the present invention.Pre-collection post collector 450 may collect (step 510) posts from a training set of websites that may include “good”websites 405 which may be known to have user generated personal experience event reports. In accordance with an alternative preferred embodiment of the present invention, the training set may also include “bad”websites 410, which may be known to have content related to the search subject (i.e. pharmaceuticals, cars, etc depending on the configuration of system 100) which may not qualify as user generated personal experience event reports. - “Good”
websites 405 may be defined by any suitable method. For example, a generic search engine may be used to locate websites according to relevant keywords, and at least a subset of the website's content may be manually examined to determine whether or not the website includes user generated personal experience event reports. In accordance with a preferred embodiment of the present invention, the posts collected bypre-collection post collector 450 may be filtered to contain only verified authentic user generated personal experience event reports. The relevant keywords may be provided by an outside source such as knownrelevant terms database 425. For example, if system may be configured for pharmaceuticals,database 425 may be a publicly available database of medical terms that may include comprehensive lists of drugs and known symptoms. Similar methods may also be used to define “bad” websites. -
Pattern recognizer 430 may detect (step 520) recurring patterns in the training set posts. It will be appreciated that any known, suitable methods for pattern detection/recognition may be used in the context ofstep 430. For example, such detection may include starting by searching for instances of terms from knownrelevant terms database 425. In accordance with a preferred embodiment of the present invention,database 425 may contain examples of at least one (and preferably both) of the anchor categories for whichsystem 100 may be configured. For example,database 425 may contain a list of drugs and known symptoms. It will be appreciated thatdatabase 425 may provide the basis foranchor database 225. - Step 430 may also include detection of recurring terms that may not be found in
database 425. For example, indications of personal reference/experience terms such as those infilter database 215 may also be detected. Exemplary such terms may include phrases such as: “I took” or “I felt better”. In accordance with a preferred embodiment of the present invention,filter database 215 may be at least in part populated based on some or all of the terms detected instep 430. - It will be appreciated that some of the recurring terms detected by
step 430 may be “negative” in nature. For example, terms such as “buy”, “sale”, “selling” may indicate an attempt to sell or market a product and that the post may therefore not be an authentic user generated personal experience event report. Such terms may typically be found in posts onbad websites 410. - It will be appreciated that
step 520 may include detection of larger expressions as well. For example, a “moving window” may be used to check for recurring combination expressions including one or more of the anchor terms fromdatabase 425. For example, in the text: “this morning I took Drug A and less than an hour later my headache was gone,”pattern recognizer 430 may initially detect anchors “Drug A” (drug name) and “headache (symptom). By incrementally employing a moving window to detect combination expression around these anchors, pattern recognizer may also detect larger expressions such as personal experience term “I took” in juxtaposition to anchor term “Drug A”, and a variant on the initial symptom term, “headache was gone”.Pattern recognizer 430 may be configured do perform statistical analysis on the terms detected instep 520 to track their occurrences and determine their significance. - It will be appreciated that
utility 400 may be configured to facilitate inspection of the results ofstep 520 by a user ofsystem 100, and to enable the user to adjust the input data as necessary to achieve a truer result. Accordingly, step 520 may be repeated as necessary. The patterns detected bypattern recognizer 430 may be stored in detectedpatterns database 415. - Training set
scoring engine 440 may score (step 530) the terms in detectedpatterns database 415 to produce weighted indicators of the likelihood that a given website may or may not contain user generated personal experience event reports. Such scoring may employ any suitable method. For example,engine 440 may run a linear regression on the terms in detectpatterns database 415 vis-à-vis the training set of posts from “good” and “bad” websites to determine the weight of each term as an indicator of likelihood that a given website is either “good” or “bad”. - In accordance with a preferred embodiment of the present invention,
engine 440 may expand the scoring process to also include other indicators from rankingsources database 470.Database 470 may represent rankings from external sources such as, for example, Google page ranks and/or Alexa ratings.Engine 440 may include the associated rankings for the page on which each post may be located as additional factors when running the linear regression on the terms in detectpatterns database 415. - In accordance with a preferred embodiment of the present invention,
engine 440 may expand the scoring process to also include additional factors that may be calculated or derived from the original posts. Such additional factors may include, for example, the query rank of the original query that identified the post as a candidate and meta keywords of the page. - In accordance with a preferred embodiment of the present invention,
engine 440 may expand the scoring process to also include the number of images and/or links on the page. It will be appreciated that most user forums have relatively few images and links per page. Accordingly, a higher number of links or images per page may tend to indicate a “bad” website. - In accordance with a preferred embodiment of the present invention,
engine 440 may also expand the scoring process to also include statistical data from cumulative scoring. Such factors may include, for example, the ratio of posts to the number of discussion (aka “threads”); or the overall ranking of a given anchor and/or term in “good” and “bad” websites. For example, the anchor term “Aspirin” may have an overall high ranking in “good” posts; statistically, personal experience event reports citing Aspirin may typically be genuine. However, the anchor term “Viagra” may typically be indicative of SPAM or commercial posts. - It will be appreciated that
utility 400 may be configured to facilitate inspection of the results ofstep 530 by a user ofsystem 100, and to enable the user to adjust the input data as necessary to achieve a truer result. Accordingly, step 530 may be repeated as necessary. The patterns scored byengine 440 may be stored inweighted indicators database 435. It will be appreciated thatweighted indicators database 435 may therefore contain a superset (including calculated weights) of the terms in detectedpatterns database 415 and knownrelevant terms 425. It will also be appreciated thatdatabase 435 may provide the basis forterms database 235. -
Pre-collection post collector 450 may collect (step 540) posts fromcandidate websites 420 on the Internet by formulating search queries based on positive term based indicators fromweighted indicators database 435.Candidate scoring engine 460 may then score (step 550) eachwebsite 420 vis-à-vis all of the factors inweighted indicators database 435 to assess its likelihood to contain user generated personal experience event reports.System 100 may be configured with a threshold weighted score to determine whether or not a givenwebsite 420 may be considered likely to contain user generated personal experience event reports. -
Utility 400 may update (step 560)website collection list 465 to includewebsites 420 that exceed such a threshold. It will be appreciated thatprocess 500 may be performed on a periodic basis to continually updatelist 465. Accordingly,utility 400 may also recordwebsites 420 with weighted scores below the threshold to avoid examining them again in the future. - It will be appreciated that
website collection list 465 may be used bypost collector 50 in the embodiment ofFIG. 1 . - Unless specifically stated otherwise, as apparent from the preceding discussions, it is appreciated that, throughout the specification, discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” or the like, refer to the action and/or processes of a computer, computing system, or similar electronic computing device that manipulates and/or transforms data represented as physical, such as electronic, quantities within the computing system's registers and/or memories into other data similarly represented as physical quantities within the computing system's memories, registers or other such information storage, transmission or display devices.
- Embodiments of the present invention may include apparatus for performing the operations herein. This apparatus may be specially constructed for the desired purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk, including floppy disks, optical disks, magnetic-optical disks, read-only memories (ROMs), compact disc read-only memories (CD-ROMs), random access memories (RAMs), electrically programmable read-only memories (EPROMs), electrically erasable and programmable read only memories (EEPROMs), magnetic or optical cards, Flash memory, or any other type of media suitable for storing electronic instructions and capable of being coupled to a computer system bus.
- The processes and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform the desired method. The desired structure for a variety of these systems will appear from the description below. In addition, embodiments of the present invention are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein.
- While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents will now occur to those of ordinary skill in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.
Claims (7)
1. A method implementable in a computing device for detecting in an Internet post a shortest relevant text product experience, the method comprising:
detecting a pair of anchors from two anchor categories, wherein said anchor categories are also term categories associated with a predefined search subject, and represent two essential relevant components of experience reports; and
defining a basic segment as a shortest section of text between said pair of anchors.
2. A method according to claim 1 wherein said experience reports comprise personal experience reports.
3. A method according to claim 1 , also comprising:
detecting when said shortest section of text does not include at least one term from each of a minimum number of term categories, and;
expanding said basic segment to extend beyond said shortest section of text to include at least one said term from each of said minimum number of term categories
4. A segment analyzer for detecting in an Internet post the shortest relevant text segment with user generated personal product experience comprising:
an anchor detection module to access an anchor database to detect at least a pair of anchors in the Internet posts that pass through a post filtering module, wherein said pair of anchors represents one term from a name category associated with a pre-defined search subject, and one term from a second category associated with said pre-defined search subject; and
a basic segmentation unit to define basic post segments, wherein said basic post segments contain at least said pair of anchors and expressions from additional categories of terms located in a term database.
5. A segment analyzer according to claim 4 comprising:
a density calculator to calculate a cumulative density of said expressions in said basic post segments, wherein each of said expressions is weighted to indicate its contribution to said density; and
a segment optimizer to expand said basic post segments in accordance with a maximum cumulative density as calculated by said density calculator.
6. A segment analyzer according to claim 5 wherein said density calculator recalculates said density value for said expanded basic segment.
7. A segment analyzer according to claim 6 wherein said segment optimizer iteratively repeats said expanding and said recalculating until said recalculated density value is less than a previously calculated density value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/106,878 US20140108409A1 (en) | 2010-10-06 | 2013-12-16 | System and method for detecting personal experience event reports from user generated internet content |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US39022010P | 2010-10-06 | 2010-10-06 | |
US39021510P | 2010-10-06 | 2010-10-06 | |
US13/253,090 US8612455B2 (en) | 2010-10-06 | 2011-10-05 | System and method for detecting personal experience event reports from user generated internet content |
US14/106,878 US20140108409A1 (en) | 2010-10-06 | 2013-12-16 | System and method for detecting personal experience event reports from user generated internet content |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/253,090 Continuation US8612455B2 (en) | 2010-10-06 | 2011-10-05 | System and method for detecting personal experience event reports from user generated internet content |
Publications (1)
Publication Number | Publication Date |
---|---|
US20140108409A1 true US20140108409A1 (en) | 2014-04-17 |
Family
ID=45925937
Family Applications (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/253,090 Active 2031-12-30 US8612455B2 (en) | 2010-10-06 | 2011-10-05 | System and method for detecting personal experience event reports from user generated internet content |
US14/106,881 Abandoned US20140108430A1 (en) | 2010-10-06 | 2013-12-16 | System and method for detecting personal experience event reports from user generated internet content |
US14/106,878 Abandoned US20140108409A1 (en) | 2010-10-06 | 2013-12-16 | System and method for detecting personal experience event reports from user generated internet content |
US14/106,880 Abandoned US20140108429A1 (en) | 2010-10-06 | 2013-12-16 | System and method for detecting personal experience event reports from user generated internet content |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/253,090 Active 2031-12-30 US8612455B2 (en) | 2010-10-06 | 2011-10-05 | System and method for detecting personal experience event reports from user generated internet content |
US14/106,881 Abandoned US20140108430A1 (en) | 2010-10-06 | 2013-12-16 | System and method for detecting personal experience event reports from user generated internet content |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/106,880 Abandoned US20140108429A1 (en) | 2010-10-06 | 2013-12-16 | System and method for detecting personal experience event reports from user generated internet content |
Country Status (1)
Country | Link |
---|---|
US (4) | US8612455B2 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109254799A (en) * | 2018-08-29 | 2019-01-22 | 新华三技术有限公司 | The starting method, apparatus and communication equipment of bootstrap |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2888670A4 (en) | 2012-08-23 | 2015-07-01 | Ims Health Inc | Detecting drug adverse effects in social media and mobile applications |
US9076182B2 (en) | 2013-03-11 | 2015-07-07 | Yodlee, Inc. | Automated financial data aggregation |
US10037367B2 (en) | 2014-12-15 | 2018-07-31 | Microsoft Technology Licensing, Llc | Modeling actions, consequences and goal achievement from social media and other digital traces |
WO2016147276A1 (en) * | 2015-03-13 | 2016-09-22 | 株式会社Ubic | Data analysis system, data analysis method, and data analysis program |
US9971940B1 (en) * | 2015-08-10 | 2018-05-15 | Google Llc | Automatic learning of a video matching system |
US10771424B2 (en) * | 2017-04-10 | 2020-09-08 | Microsoft Technology Licensing, Llc | Usability and resource efficiency using comment relevance |
US12051023B2 (en) * | 2020-07-24 | 2024-07-30 | Content Square SAS | Benchmarking of user experience quality |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080126297A1 (en) * | 2006-11-29 | 2008-05-29 | Red Hat, Inc. | Automatic index based query optimization |
US20090113293A1 (en) * | 2007-08-19 | 2009-04-30 | Multimodal Technologies, Inc. | Document editing using anchors |
US20090175532A1 (en) * | 2006-08-01 | 2009-07-09 | Konstantin Zuev | Method and System for Creating Flexible Structure Descriptions |
US20100049590A1 (en) * | 2008-04-03 | 2010-02-25 | Infosys Technologies Limited | Method and system for semantic analysis of unstructured data |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6012053A (en) * | 1997-06-23 | 2000-01-04 | Lycos, Inc. | Computer system with user-controlled relevance ranking of search results |
AU3689901A (en) * | 2000-02-10 | 2001-08-20 | Involve Technology Llc | System for creating and maintaining a database of information utilizing user opinions |
US6946715B2 (en) * | 2003-02-19 | 2005-09-20 | Micron Technology, Inc. | CMOS image sensor and method of fabrication |
US20080034058A1 (en) * | 2006-08-01 | 2008-02-07 | Marchex, Inc. | Method and system for populating resources using web feeds |
US8463789B1 (en) * | 2010-03-23 | 2013-06-11 | Firstrain, Inc. | Event detection |
-
2011
- 2011-10-05 US US13/253,090 patent/US8612455B2/en active Active
-
2013
- 2013-12-16 US US14/106,881 patent/US20140108430A1/en not_active Abandoned
- 2013-12-16 US US14/106,878 patent/US20140108409A1/en not_active Abandoned
- 2013-12-16 US US14/106,880 patent/US20140108429A1/en not_active Abandoned
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090175532A1 (en) * | 2006-08-01 | 2009-07-09 | Konstantin Zuev | Method and System for Creating Flexible Structure Descriptions |
US20080126297A1 (en) * | 2006-11-29 | 2008-05-29 | Red Hat, Inc. | Automatic index based query optimization |
US20090113293A1 (en) * | 2007-08-19 | 2009-04-30 | Multimodal Technologies, Inc. | Document editing using anchors |
US20100049590A1 (en) * | 2008-04-03 | 2010-02-25 | Infosys Technologies Limited | Method and system for semantic analysis of unstructured data |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109254799A (en) * | 2018-08-29 | 2019-01-22 | 新华三技术有限公司 | The starting method, apparatus and communication equipment of bootstrap |
Also Published As
Publication number | Publication date |
---|---|
US20120089616A1 (en) | 2012-04-12 |
US8612455B2 (en) | 2013-12-17 |
US20140108429A1 (en) | 2014-04-17 |
US20140108430A1 (en) | 2014-04-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8612455B2 (en) | System and method for detecting personal experience event reports from user generated internet content | |
US11176142B2 (en) | Method of data query based on evaluation and device | |
EP2192500B1 (en) | System and method for providing robust topic identification in social indexes | |
US20110078157A1 (en) | Opinion search engine | |
US8140512B2 (en) | Consolidated information retrieval results | |
Wilczynski et al. | An overview of the design and methods for retrieving high-quality studies for clinical care | |
US8700621B1 (en) | Generating query suggestions from user generated content | |
US10885124B2 (en) | Domain-specific negative media search techniques | |
US20130290320A1 (en) | Recommending keywords | |
KR102252188B1 (en) | Product recommendation system and method reflecting user purchasing criterion | |
US20090076927A1 (en) | Distinguishing accessories from products for ranking search results | |
KR101100830B1 (en) | Entity searching and opinion mining system of hybrid-based using internet and method thereof | |
US20130132401A1 (en) | Related news articles | |
CN104699730A (en) | Identifying and displaying relationships between candidate answers | |
WO2013142493A1 (en) | Analyzing and answering questions | |
US8880390B2 (en) | Linking newsworthy events to published content | |
WO2020101477A1 (en) | System and method for dynamic entity sentiment analysis | |
JP5013065B2 (en) | Rustic monitoring system, ruling monitoring method and program | |
JP2012141985A (en) | System and method for determining sequence of keywords for each user group | |
US20110131213A1 (en) | Apparatus and Method for Mining Comment Terms in Documents | |
Boyer et al. | How to sort trustworthy health online information? Improvements of the automated detection of HONcode criteria | |
CN102915358A (en) | Method and device for realizing navigation website | |
CN111125561A (en) | Network heat display method and device | |
JP2011018155A (en) | Method, device and program for creating infant vocabulary development database | |
CN102915357B (en) | A kind of method and apparatus realizing guidance to website |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |