WO2017024553A1 - 一种信息情感分析方法和系统 - Google Patents

一种信息情感分析方法和系统 Download PDF

Info

Publication number
WO2017024553A1
WO2017024553A1 PCT/CN2015/086751 CN2015086751W WO2017024553A1 WO 2017024553 A1 WO2017024553 A1 WO 2017024553A1 CN 2015086751 W CN2015086751 W CN 2015086751W WO 2017024553 A1 WO2017024553 A1 WO 2017024553A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
vocabulary
ambiguous
sentiment
module
Prior art date
Application number
PCT/CN2015/086751
Other languages
English (en)
French (fr)
Inventor
易峥
夏炜
Original Assignee
浙江核新同花顺网络信息股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江核新同花顺网络信息股份有限公司 filed Critical 浙江核新同花顺网络信息股份有限公司
Priority to PCT/CN2015/086751 priority Critical patent/WO2017024553A1/zh
Priority to US15/752,184 priority patent/US10437871B2/en
Publication of WO2017024553A1 publication Critical patent/WO2017024553A1/zh
Priority to US16/550,479 priority patent/US10831808B2/en
Priority to US17/086,469 priority patent/US11481422B2/en
Priority to US17/936,374 priority patent/US11868386B2/en
Priority to US18/523,978 priority patent/US20240104127A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/32Digital ink
    • G06V30/36Matching; Classification
    • G06V30/373Matching; Classification using a special pattern or subpattern alphabet

Definitions

  • the invention belongs to the field of natural language processing, and relates to information collection, information processing and machine learning, in particular to a sentiment analysis method based on a language model.
  • One aspect of the present invention relates to an information sentiment analysis method based on ambiguity analysis, the method comprising using an ambiguity analysis model and an sentiment analysis model to perform ambiguity analysis and sentiment analysis on information.
  • Another aspect of the present invention relates to a method of training the ambiguity analysis model and the sentiment analysis model, including collecting information, constructing a lexicon, using the lexicon to perform ambiguity analysis and sentiment analysis, collecting corpus, training models, and the like.
  • another aspect of the present invention relates to an information sentiment analysis system including an input and output module, an acquisition module, a processing module, and a database.
  • the technical solution disclosed in the present specification is capable of collecting information, generating a database of information, filtering out non-ambiguous information in the information base, and performing sentiment analysis on the non-ambiguous information.
  • the technical solution disclosed in the present specification includes an ambiguity analysis model, which can perform ambiguous and non-ambiguous analysis on the collected information by using a certain algorithm to generate a non-ambiguous information set.
  • the technical solution disclosed in the present specification further includes An sentiment analysis model that uses a certain algorithm to perform sentiment analysis on information. The information may be from the non-ambiguous information set or from the information base.
  • the technical solution disclosed in the present specification further includes a method of training a ambiguity analysis model.
  • the method for training the ambiguity analysis model comprises: extracting information, scoring the information by using a certain scoring rule, generating a model training corpus according to the scoring result, and training the ambiguity analysis model by using the model training corpus.
  • the technical solution disclosed in the present specification further includes a method of training an sentiment analysis model.
  • the method for training an sentiment analysis model includes: extracting information, matching the information by using a matching rule, generating a model training corpus according to the matching result, and training the sentiment analysis model by using the model training corpus.
  • Figure 1 Schematic diagram of an information sentiment classification system module
  • FIG. 1 Schematic diagram of the acquisition module
  • FIG. 3 Schematic diagram of the processing module
  • FIG. 4 Schematic diagram of the input and output modules
  • FIG. 1 Schematic diagram of the database
  • Figure 6 Schematic diagram of the system user interaction process
  • Figure 7 Schematic diagram of the flow of the information sentiment classification system
  • Figure 8 Schematic diagram of the model training process
  • Figure 9 Schematic diagram of the use scenario
  • Figure 10 Schematic diagram of an embodiment of the collection process
  • Figure 11 is a schematic diagram showing the flow of a system embodiment of a system applied to a financial product field
  • Figure 12 Schematic diagram of the flow of the system applied to the ambiguity analysis of the financial product field
  • Figure 13 Flow chart of an ambiguity analysis embodiment
  • Figure 14 Detailed flow chart of the ambiguity analysis embodiment
  • Figure 15 Flow chart of an embodiment of the ambiguity analysis model training
  • Figure 16 Schematic diagram of the flow of the system applied to the sentiment analysis of the financial product field
  • Figure 17 Flow chart of an sentiment analysis embodiment
  • Figure 19 Flow chart of an sentiment analyzer training embodiment
  • Figure 20 Schematic diagram of an embodiment of a user interaction interface.
  • the information processing method and system involved in the present specification can collect information, construct a thesaurus, and use the thesaurus to perform ambiguity analysis and sentiment analysis on the information.
  • the present specification is directed to an information sentiment analysis system including an input output module, an acquisition module, a processing module, and a database.
  • Different embodiments of the present invention are applicable to a variety of fields including, but not limited to, investments in finance and derivatives (including but not limited to stocks, bonds, gold, paper gold, silver, foreign exchange, precious metals, futures, money funds, etc.) , technology (including but not limited to mathematics, physics, chemistry and chemical engineering, biology and bioengineering, electronic engineering, communication systems, internet, internet of things, etc.), politics (including but not limited to politicians, political events, countries), news (From the regional perspective, including but not limited to regional news, domestic news, international news; from the main body of the news, including but not limited to political news, science and technology news, economic news, life news, weather news, etc.).
  • the invention can be applied to different types of databases including, but not limited to, hierarchical databases, networked databases, and relational databases.
  • databases including, but not limited to, hierarchical databases, networked databases, and relational databases.
  • the technical solution disclosed in the present specification is capable of collecting information, generating a database of information, filtering out non-ambiguous information in the information base, and performing sentiment analysis on the non-ambiguous information.
  • the technical solution disclosed in the present specification includes an ambiguity analysis model, which can perform ambiguous and non-ambiguous analysis on the collected information by using a certain algorithm to generate a non-ambiguous information set.
  • the technical solution disclosed in the present specification further includes an sentiment analysis model capable of performing sentiment analysis on the information using a certain algorithm.
  • the information may be from the non-ambiguous information set or from the information base.
  • the technical solution disclosed in the present specification further includes a method of training a ambiguity analysis model.
  • the method for training the ambiguity analysis model comprises: extracting information, scoring the information by using a certain scoring rule, generating a model training corpus according to the scoring result, and training the ambiguity analysis model by using the model training corpus.
  • the technical solution disclosed in the present specification further includes a method of training an sentiment analysis model.
  • the method for training an sentiment analysis model includes: extracting information, matching the information by using a matching rule, generating a model training corpus according to the matching result, and training the sentiment analysis model by using the model training corpus.
  • Figure 1 shows a schematic of a system that can be used for information sentiment analysis.
  • This system may include, but is not limited to, one or more acquisition modules 101, one or more processing modules 102, one or more input and output modules 103, one or more databases 104. Some or all of the above modules may be connected to the network 105. The above modules may be centralized or distributed, local or remote. In some embodiments these modules are independent; in some embodiments, some or all of the modules may also be integrated into one integral module.
  • the acquisition module 101 obtains the required information in various ways.
  • the manner in which the information is obtained may be direct (eg, obtaining information directly from the network 105) or indirect (eg, by acquiring information from acquisition units of other modules).
  • the way information is obtained can be centralized (for example, obtained through a certain channel) or distributed (for example, through multiple channels).
  • the manner in which the information is obtained may be local (eg, from a local module or unit with storage functionality, etc.) or remote (eg, crawling through a search engine, etc.).
  • the manner in which the information is obtained may be wired (eg, via cable or fiber optic cable, etc.) or wireless (eg, via radio or optical signals, etc.).
  • the way to get information can be manual or automatic.
  • the way to obtain information can be based on existing algorithms or user-defined algorithms.
  • the manner in which the information is obtained may be a similar method in any of the above manners, or a combination of any of the above.
  • the source of the information required may be the network 105 (metropolitan area network, wide area network, local area network, etc.), news, newspaper, media, or may be the processing module 102(s), the input and output module 103 (one or more) One or more of database 104(s), and the like.
  • the collection module 101 may extract the required information from all the information or part of the information generated in the intermediate processing of the processing module 102; the collection module 101 may input some words, phrases, sentences, uploaded pictures, audio, and video through the user.
  • the required information is collected in the information; the acquisition module 101 can also extract the required information from the database 104.
  • the collection module 101 can also transmit all the collected information or part of the information to one or more of the processing module 102, the database 104, the input and output module 103, and the like.
  • the above-mentioned required information may include, but is not limited to, one or more of an industry-specific name vocabulary, a vocabulary strongly associated with a specific vocabulary, information including the vocabulary, and vocabulary containing emotional information.
  • the industry may include, but is not limited to, one or more of sports, entertainment, economics, politics, culture, and the like.
  • the above-mentioned specific name vocabulary may include, but is not limited to, one or more of a proper noun, full name, abbreviation, code, synonym, acronym, and the like of a specific industry.
  • the above-mentioned vocabulary strongly associated with a specific name vocabulary may include, but is not limited to, a noun, a verb, an adjective, a short sentence, a phrase collocation, an industry vocabulary of a specific vocabulary of the specific domain, a synonym, an antonym, a common collocation, One or more of a component noun, a derivative word, a compound word, and the like.
  • the information including the above vocabulary may include, but is not limited to, one or more of a dictionary, a news, a research report on a company, an announcement, a product manual, and a related website webpage.
  • the categories of the above emotional vocabulary may include, but are not limited to, one or more of positive, negative, neutral, and the like.
  • the form of the information may include, but is not limited to, one or more of text, picture, audio, video, and the like.
  • the information usage language required above may include, but is not limited to, one or more of Chinese, English, Japanese, Korean, French, German, and the like.
  • the above description of the required information is only a specific example and should not be considered as the only feasible implementation. Obviously, for those skilled in the art, after understanding the basic principles of the required information, various modifications and changes may be made to the content of the required information without departing from the principle, but these corrections And changes are still within the scope of the above description.
  • Processing module 102 can communicate bi-directionally with network 105.
  • the processing module 102 can communicate bidirectionally with the acquisition module 101.
  • Processing module 102 can communicate bi-directionally with database 104.
  • the processing module 102 can communicate bidirectionally with the input and output module 103.
  • the processing module 102 can directly collect the required information from the network 105, and can also receive the information transmitted by the collection module 101, including but not limited to a specific name vocabulary, a vocabulary strongly associated with a specific name vocabulary, information including the vocabulary, and One or more of vocabulary including emotional information.
  • Processing module 102 can also send information to network 105.
  • the information may include, but is not limited to, information processed by the processing module 102, as well as information processed by the processing module 102, and the like.
  • the information processed by the processing module 102 may include, but is not limited to, information that is classified by applying a specific classification rule. After the processing module 102 completes the information processing, the processed information may be stored in the database 104 in accordance with a particular storage method. Similarly, the processing module 102 can also store unprocessed information transmitted by the acquisition module 101 or the network 105 into the database 104.
  • the storage method may include, but is not limited to, a sequential storage method, a link storage method One or more of a method, an index storage method, and a hash storage method.
  • the unprocessed information may include, but is not limited to, one or more of unclassified words, phrases, sentences, paragraphs, and the like.
  • the processed information may include, but is not limited to, one or more of categorized words, phrases, sentences, paragraphs, and the like.
  • the processing module 102 can also send information to the input and output module 103.
  • the information may include, but is not limited to, processed information, unprocessed information, and the like.
  • the processing module 102 can also receive data or instructions sent by the input and output module 103 and make corresponding actions by parsing the received data or instructions.
  • the input and output module 103 can exchange system internal information with peripheral devices and receive external information.
  • the input/output module 103 can connect to the peripheral device through the network 105 or directly connect to the peripheral device.
  • the input and output module 103 can receive information input by the user.
  • the information entered by the user may come from the network 105, may be from a peripheral device, or may be from a third party in communication with the system.
  • the input and output module 103 can push the generated output result to the peripheral device for display to the user.
  • the peripheral device may include, but is not limited to, one or more of a mouse, a keyboard, a touch pad, a trackball, a voice recognition device, a graphic image recognition device, a display device, a mobile phone, a PC, a Macintosh, a tablet, and the like.
  • the form of user input may include, but is not limited to, one or more of numbers, characters, symbols, words, sounds, graphic images, videos, and the like.
  • the output manner may include, but is not limited to, classifying and outputting information that is classified by a specific classification rule.
  • the input and output module 103 can communicate or exchange information with the acquisition module 101.
  • the input and output module 103 can receive the information transmitted by the acquisition module 101.
  • the input and output module 103 can transmit the user input information received by the peripheral device to the acquisition module 101.
  • the input and output module 103 can output the information collected by the acquisition module 101, and can display the information to the user through the peripheral device.
  • the input output module 103 can communicate or exchange information with the processing module 102.
  • the input and output module 103 can transmit the received information to the processing module 102 for processing.
  • the input and output module 103 can output the information transmitted by the received processing module 102, and can display the information to the user through the peripheral device.
  • the input output module 103 is capable of communicating information with the database 104.
  • the input and output module 103 can output the information transmitted by the received database 104, and can display the information to the user through the peripheral device.
  • the input output module 103 can communicate the received input information to the database 104.
  • the database 104 or other storage devices within the system have information storage capabilities.
  • the database 104 or other storage device within the system can digitize the information and store it in a storage device that utilizes electrical, magnetic or optical means.
  • the database 104 or other storage devices within the system are used to store various information such as programs and data.
  • the database 104 or other storage devices in the system may be devices that store information by means of electrical energy, such as various memories, random access memory (RAM), read only memory (ROM), and the like.
  • the database 104 or other storage devices within the system may be devices that store information using magnetic energy, such as hard disks, floppy disks, magnetic tapes, magnetic core memories, magnetic bubble memories, USB flash drives, and the like.
  • the database 104 or other storage device within the system may be a device that optically stores information, such as a CD or DVD.
  • the database 104 or other storage device within the system may be a device that stores information using magneto-optical means, such as a magneto-optical disk or the like.
  • the access mode of the database 104 or other storage devices in the system may be random storage, serial access storage, read-only storage, and the like.
  • the database 104 or other storage device within the system may be a non-persistent memory or a permanent memory.
  • the database 104 or other storage devices in the system may be local, remote, or on a cloud server.
  • the database 104 is capable of classifying, sorting, filtering, and the like of its internal information.
  • the database 104 or other storage devices within the system can communicate or exchange information with the acquisition module 101.
  • the database 104 or other storage devices in the system can receive the information collected by the collection module 101 and store it in the database 104 or other storage devices in the system. Based on the received instructions, the information in the database 104 or other storage devices in the system can be extracted and passed to the acquisition module 101.
  • the above instructions may be directly from the acquisition module 101; or may be from other modules, such as the input and output module 103, the processing module 102, and the like.
  • the above instructions may be from the database 104 or other storage devices in the system, such as the timing indication database 104 or other storage devices in the system to send information to the collection module 101, and the like.
  • the database 104 or other storage devices within the system can communicate or exchange information with the processing module 102, can receive the information communicated by the processing module 102, and store it. Information in the instructions, database 104, or other storage devices within the system can be extracted and passed to the processing module 102.
  • the above instructions may be directly from the acquisition module 101; or may be from other modules, such as the input and output module 103, the acquisition module 101, and the like.
  • the above instructions may be from the database 104 or other storage devices in the system, such as the timing indication database 104 or other storage devices in the system.
  • the processing module 102 transmits information and the like.
  • the database 104 or other storage devices within the system can communicate and exchange information with the input and output module 103, can receive information communicated by the input and output module 103, and store it in the database 104 or other storage devices within the system. Based on the received information, the information in the database 104 or other storage devices in the system can be extracted and passed to the input and output module 103.
  • the above instructions may be directly from the input and output module 103; or may be from other modules, such as the acquisition module 101 and the processing module 102.
  • the above instructions may come from the database 104 or other storage devices in the system, such as the timing indication database 104 or other storage devices in the system to send information to the input and output module 103, and the like.
  • the connections between the modules in the system, between the modules and the peripheral devices, and the connections between the system and the cloud server can be wired or wirelessly connected.
  • the wired connection may include, but is not limited to, using one or more of a metal cable, an optical cable, or a hybrid cable of metal and optical, such as a coaxial cable, a communication cable, a flexible cable, a spiral cable, and a non-metallic sheath. Cables, metal sheathed cables, multi-core cables, twisted-pair cables, ribbon cables, shielded cables, telecommunications cables, twin-stranded cables, parallel twin-core conductors, and twisted-pair cables.
  • the wireless connection medium may be other types, for example, transmission signals of other electrical signals or optical signals.
  • the wireless connection may include, but is not limited to, one or more of radio communication, free space optical communication, acoustic communication, and electromagnetic induction.
  • the radio communication includes but is not limited to the IEEE 802.11 series of standards, the IEEE 802.15 series of standards (such as Bluetooth technology and ZigBee technology, etc.), the first generation mobile communication technology, and the second generation mobile communication technology (such as FDMA).
  • the free-space optical communication may include, but is not limited to, one or more of visible light, infrared signals, and the like.
  • the acoustic communication may include, but is not limited to, one or more of sound waves, ultrasonic signals, and the like.
  • the electromagnetic induction includes, but is not limited to, near field communication technology and the like.
  • the examples described above are for convenience of description only, and the wirelessly connected medium may be of other types, for example, Z-wave technology, Bluetooth low energy (BLE) technology, 433 MHz communication protocol band, other charging civil radio bands, and military use. Wireless Electric frequency band, etc.
  • connection mode can be used in a single mode or in a plurality of connection manners in the system. In the process of combining different connection modes, the corresponding gateway device needs to be used to achieve information interaction.
  • Individual modules can also be integrated to implement the functionality of more than one module on the same device or electronic component.
  • the peripheral device can also be integrated on the implementation device or electronic component of one or more modules, and the single or multiple modules can also be integrated on a single or multiple peripheral devices or electronic components.
  • the manner of information transmission between modules may be direct or indirect, and may be wired or wireless, and may be performed sequentially or simultaneously, and may be periodic or aperiodic. .
  • the above description of the mode of transmission of information between modules is merely a specific example and should not be considered as the only feasible implementation. Obviously, for those skilled in the art, after understanding the basic principles of information transmission between modules, various modifications and changes may be made to the content of the required information without departing from the principle. Corrections and changes are still within the scope of the above description.
  • FIG. 2 shows a schematic diagram of the acquisition module 101.
  • the acquisition module 101 can include, but is not limited to, one or more acquisition units 201, one or more processing units 202, one or more storage units 203, and the like.
  • the above units may be centralized or distributed, local or remote. In some embodiments these units are independent, and in some embodiments, the units may be separate, and in other embodiments, some of the units may also be integrated into one unitary unit.
  • the acquisition module 101 can collect information through the collection unit 201. All or part of the collected information may be stored in the storage unit 203 and may also be stored in the database 104. All or part of the collected information may be passed to the processing unit 202 for processing. The processing result can be stored in the storage unit 203.
  • the processing of the information may include, but is not limited to, extracting some of the key words in the information, and evaluating the value of the information (for example, the degree of association of the collected information with the information required by the user may be estimated).
  • the information processed by the processing unit 202 may be from the collection unit 201, may be from the storage unit 203, or may be from other modules or devices having storage functions (for example, the database 104, etc.) in the system.
  • the information in the storage unit 203 may be further delivered to the database 104 for storage, or may be passed to the processing module 102 for processing, or may be passed to the input and output module 103 for output.
  • the manner of information transmission between different unit modules may be wired or wireless, and may be direct or indirect, and may be performed simultaneously or sequentially, and may be periodic or aperiodic. Wait.
  • the processing module 102 can include, but is not limited to, one or more ambiguity analysis modules 301, one or more sentiment analysis modules 306, and one or more storage modules 315. In some embodiments, these modules may be separate, and in other embodiments, some of the modules may also be integrated into one unitary unit.
  • the ambiguity analysis module 301 can obtain information, process the information, and generate ambiguous or non-ambiguous corpus for training the ambiguity analysis model 312.
  • the ambiguity analysis module 301 can include, but is not limited to, one or more acquisition units 302, one or more matching units 303, one or more processing units 304, one or more corpus collection units 305, and one or more ambiguity analysis models 312. .
  • the acquisition unit 302 of the ambiguity analysis module 301 acquires the required information in various ways.
  • the acquisition unit 302 of the ambiguity analysis module 301 can also retrieve the required information directly from the network 105.
  • the manner of obtaining information may be centralized or distributed, may be local or remote, may be wired or wireless, may be manual or automatic, or may be A combination of multiple ways.
  • the acquisition unit 302 in the processing module 102 can collect information.
  • the information may be a keyword dictionary 502, an ambiguous list 504, a related word dictionary 503 (see FIG. 5), and contents stored in the information base 511, which have been constructed in the database 104.
  • the matching unit 303 of the ambiguity analysis module 301 can match the information in the information base 511.
  • Processing module 102 can send a keyword request and a dictionary request to database 104. After receiving the request, the database 104 will request the keyword dictionary 502, the related word dictionary 503 and the ambiguous column. Table 504 is sent to processing module 102.
  • the matching unit 303 in the processing module 102 matches the keywords in accordance with a specific algorithm.
  • the particular algorithm may include, but is not limited to, one or more of a prefix search, a suffix search, a substring search, and the like.
  • Processing unit 304 scores the matching results to quantify the degree of ambiguity of the information. This scoring result can be used as a preliminary criterion for measuring whether the statement is ambiguous in the subsequent ambiguity analysis process.
  • the factors involved in the scoring may include, but are not limited to, a specific vocabulary length, a vocabulary length of the related vocabulary, a length of the overall message, a weight of different specific vocabulary in the information, a weight of different related vocabulary in the information, and a quantity of related vocabulary and One or more of the number of specific words, and the like.
  • the above description of the matching unit 303 and the processing unit 304 is merely a specific example and should not be considered as the only feasible implementation.
  • the corpus collection unit 305 can be configured to collect a collection of features.
  • the set of elements may include, but is not limited to, keywords, surrounding vocabulary, relative positional information, and ambiguous or non-ambiguous sentence-formed elements that may be stored in the corpus collection unit 305.
  • the set of features can be used to train the ambiguity analysis model 312.
  • the ambiguous scoring results described above are a quantification of the degree of ambiguity of the information.
  • several thresholds can be set for this score. These thresholds can be preliminarily divided into strongly ambiguous statements and obviously non-ambiguous statements, so that the classified information is initially classified ambiguously and non-ambiguously.
  • the ambiguous scoring result cannot directly determine whether a vocabulary or information is an ambiguous statement, the vocabulary or information may enter a further review step.
  • the review steps can include, but are not limited to, manual review, automatic model review, or a combination of the two.
  • the factors involved may include, but are not limited to, the length of a specific vocabulary, the vocabulary length of the related vocabulary, the length of the overall message, the weight of the different specific vocabulary in the information, the weight of the different related vocabulary in the information, and the correlation.
  • the classification result of the information may be used to train the model used in the review step, wherein the classification algorithm of the model may include, but is not limited to, a decision tree, Rocchio, Naive Bayes, a neural network, a support vector machine, Linear least squares fit, K-nearest neighbor, legacy Algorithm, maximum entropy, etc.
  • the classification algorithm of the model may include, but is not limited to, a decision tree, Rocchio, Naive Bayes, a neural network, a support vector machine, Linear least squares fit, K-nearest neighbor, legacy Algorithm, maximum entropy, etc.
  • the ambiguity analysis module 301 can include, but is not limited to, one or more ambiguity analysis models 312. After a certain period of training, the ambiguity analysis model 312 can determine whether the description of the specific name in the news is ambiguous. After the determination is completed, the non-ambiguous statement set is output.
  • the set of non-ambiguous statements may be stored, and the storage location may include, but is not limited to, one or more of the storage module 315, the database 104, or other devices having storage functions of the system.
  • the set of non-ambiguous statements can be delivered to other modules (eg, sentiment analysis module 306) for processing.
  • the ambiguity analysis model 312 can also perform ambiguous determinations with the aid of a manual or machine.
  • the sentiment analysis module 306 can include, but is not limited to, one or more acquisition units 307, one or more matching units 308, one or more processing units 309, one or more corpus collection units 310, and one or more sentiment analysis 311.
  • the units may be centralized or distributed, local or remote. In some embodiments, the above units may be independent. In some embodiments, some of the units may also be integrated into one unitary unit.
  • the sentiment analysis module 306 can perform sentiment classification on the non-ambiguous information obtained by the ambiguity analysis module 301.
  • the sentiment category may include, but is not limited to, positive, negative, neutral, and the like.
  • the collection module 101 may construct one or more emotional vocabulary matching libraries 507 (see FIG. 5) including emotional vocabulary collocations by means of information collection and the like.
  • the sentiment vocabulary collocation library 507 is stored in the database 104.
  • the acquisition unit 307 in the sentiment analysis module 306 can collect information.
  • the collected information may include, but is not limited to, the emotional vocabulary matching library 507 in the database 104 and the content stored in the information base 511, and the like.
  • the matching unit 308 of the sentiment analysis module 306 matches the non-ambiguous information output by the ambiguity analysis module 301, and the matching method may include, but is not limited to, a regular expression or the like.
  • the processing unit 309 can calculate the accuracy of the collocation, and determine the collocation with the accuracy greater than the predetermined threshold as a strong emotional collocation (eg, a sharp increase can be regarded as a strong emotional collocation).
  • the processing unit 309 can score a sentence that does not contain a strong emotional collocation, and according to the corresponding emotion type The number determines the type of emotion of the sentence.
  • the strong emotional collocation is stored in the corpus collection unit 310.
  • the function of the corpus collection unit 310 may include, but is not limited to, collecting a set of elements such as an emotional collocation, an emotional collocation, and an emotional sentence.
  • the sentiment classification methods are mainly divided into two categories: dictionary-based and machine-based learning.
  • dictionary-based method a dictionary in which the emotional polarity of the word is marked may be defined in advance, and the positive and negative emotional polarity of the sentence or the article may be preset by the number of positive or negative emotional words, weights, and the like. It is measured by a certain calculation method.
  • the machine learning-based method can classify the problem of sentiment classification into the problem of text classification. It can use the classification methods commonly used in text classification (including but not limited to decision tree, Rocchio, naive Bayes, neural network, support vector machine).
  • the dictionary and machine learning methods can be combined to emotionally classify sentences or articles.
  • the sentiment analysis module 306 can include, but is not limited to, one or more sentiment analyzers 311. After a certain period of training, the sentiment analyzer 311 can directly determine the emotion type of the non-ambiguous statement in the news. After the judgment is completed, a set of sentences classified by emotion is obtained. The set of sentimentally classified statements may be stored, and the storage location may include, but is not limited to, one or more of the storage module 315, the database 104, or other devices having storage functions of the system. The sentiment analyzer 311 can also perform sentiment analysis with the aid of a manual or machine.
  • FIG. 4 shows a schematic diagram of the input and output module 103.
  • Input output module 103 may include, but is not limited to, one or more interface units 401, one or more identification units 402, one or more storage units 403, and one or more expansion units 404.
  • the above units may be centralized or distributed, local or remote. In some embodiments these units are independent. In some embodiments, some of the units may also be integrated into one unitary unit.
  • the interface unit 401 of the input output module 103 can be configured to receive input information as well as output system generated results.
  • the information can be passed to the acquisition module 101.
  • the information may be passed to the processing module 102 for processing including, but not limited to, ambiguity analysis or sentiment analysis.
  • the information can be stored.
  • the storage location may be a storage unit 403, a database 104, and One or more of other systems having storage functions, and the like.
  • the output result may be information classified according to a certain rule, such as positive information, negative information, neutral information, and the like.
  • the output may be displayed to the user via the peripheral device.
  • the identification unit 402 can be configured to identify the emotion tags in the information that has been subjected to sentiment analysis, thereby guiding the interface unit 401 to classify and display the information according to the emotion tags.
  • the storage unit 403 can be configured to store information, and the stored information can be from the interface unit 401 and the identification unit 402.
  • the stored information may be from one or more of other modules, such as acquisition module 101, processing module 102, database 104, and the like.
  • the expansion unit 404 of the input and output unit 103 can be configured to provide a function expansion mechanism to assist the system in performing function expansion according to the needs of the user.
  • the extended functions may include, but are not limited to, one or more of a subscription function, an information sharing function, an intelligent learning, an update function, and the like.
  • the extension unit 404 can store the keyword information input by the user, the user-defined information push period, the information push method, the information sharing object, the information sharing content, the system update period, and the like into the user database 513 in the database 104 (see Figure 5).
  • the expansion unit 404 of the input and output module 103 of the system can be configured to provide a subscription function.
  • the user can choose to subscribe to the information containing the specific keyword, and the extension unit 404 can push the sentiment analysis information to the user in various ways according to the user subscription.
  • the extension unit 404 includes, but is not limited to, providing push information to the user, and may also recommend users who are interested in similar interests, may also recommend comments of the information, and provide a rating indicating whether the information is helpful or the like.
  • the manner in which the extension unit 404 pushes may include, but is not limited to, mobile client software, email, SMS, RSS portal, online single-user aggregator, search engine, browser, instant messaging software, social networking, and the like.
  • the extension unit 404 push period may be set by the system, or may be user-defined, may be periodic or irregular, and may be real-time or delayed.
  • the periodic push cycle may include, but is not limited to, one or more of several hours, days, weeks, months, several quarters, years, and the like.
  • the irregular push period may include, but is not limited to, one or more of workdays, holidays, or early, middle, and late in different countries.
  • the form of information content pushed by the extension unit 404 may include, but is not limited to, one or more of text, voice, picture, animation, video, and the like.
  • the information content pushed by the extension unit 404 may include, but is not limited to, an information content update that the user has browsed, which may be used.
  • the information that the user pays attention to may be one of the information recommended by the system according to the user record, or one or more of the heat conditions of the same type of information.
  • the above description of the extension unit 404 is merely a specific example and should not be considered as the only feasible implementation. Obviously, for those skilled in the art, after understanding the basic principle of the extension unit 404, the specific manner and steps of implementing the extension unit 404 and the extension unit 404 may be implemented without departing from this principle. The functions are subject to various modifications and changes in form and detail, but such modifications and changes are still within the scope of the above description.
  • the expansion unit 404 of the input and output module 103 of the system can be configured to provide intelligent learning functionality.
  • the expansion unit 404 can intelligently learn, analyze, and memorize the usage habits of the user, including but not limited to common fields, searching for high frequency keywords, and more concerned emotional categories.
  • the extension unit 404 can automatically memorize, or according to the user's annotation memory, a subsidiary of a multinational company that the user frequently clicks, and preferentially display the related information of the subsidiary after the user inputs the company name.
  • the extension unit 404 can learn information of different emotion categories or domains that the user is interested in at different time periods, and cooperate with the extension unit 404 to perform information push at a specific time period.
  • the above description of the extension unit 404 and the functions it implements are merely specific examples and should not be considered as the only possible implementation. Obviously, for those skilled in the art, after understanding the basic principles of the extension unit 404 and the functions implemented thereby, it is possible to implement the extension unit 404 and the specific manner in which it is implemented without departing from this principle. Various modifications and changes in form and detail are made with the steps, but such modifications and changes are still within the scope of the above description.
  • the expansion unit 404 of the input and output module 103 of the system can be configured to provide an information sharing function.
  • Information sharing is a way for users to share interesting information with friends in various ways.
  • Information sharing is a way of publishing information that users can use, sharing to a designated place, selecting who can see the information, and so on.
  • the content of the information sharing may be a single piece of information or a plurality of pieces of information, and may be information of a part of the selected content or information of the entire content of the page, and may be information content sharing or information comment sharing, and may be information sharing. It can also be informational help rating sharing, etc.
  • Information sharing methods may include, but are not limited to, one of SMS, MMS, email, QQ, MSN, WeChat, Weibo, Douban, Twitter, Facebook, Instagram, everyone, instant messaging software tools, or the like.
  • the information sharing receiving object may include, but is not limited to, one or more of a single friend, a plurality of friends, a circle of friends, a public social circle, a forum, other users, and the like.
  • the content format of the information sharing may include, but is not limited to, one or more of text, pictures, voice, animation, video, web link, and the like.
  • Figure 5 shows a schematic diagram of the units included or used in database 104.
  • Database 104 includes, but is not limited to, one or more keyword lexicons 501, one or more sentiment lexicons 505, one or more repositories 511, one or more corpora 508, one or more semantic knowledge bases 512, One or more user databases 513 and the like.
  • Keyword vocabulary 501 can include, but is not limited to, one or more keyword dictionaries 502, one or more related word dictionaries 503, one or more disambiguation lists 504, and the like. The above description of the dictionary is for convenience of explanation and does not have a limiting effect.
  • the keyword dictionary 502 can be configured to store vocabulary including but not limited to a particular name.
  • the above specific name vocabulary includes, but is not limited to, specific nouns, full names, abbreviations, codes, synonyms, abbreviations, and the like in a specific field.
  • the specific name vocabulary in the keyword dictionary 502 may be from the acquisition module 101 or may be from the processing module 102.
  • the related word dictionary 503 can be configured to store related words including, but not limited to, a particular name vocabulary.
  • the related vocabulary may include, but is not limited to, proper nouns, nouns, verbs, adjectives, phrase collocations, short sentences, industry vocabulary, synonym, antonyms, common collocations, components related to the specific vocabulary mentioned above. Nouns, derivations, compound words, etc.
  • the ambiguous list 504 can be configured to store a particular named vocabulary that may be ambiguous, including but not limited to, manually, modeled, or a combination of both.
  • Emotional thesaurus 505 can include, but is not limited to, one or more emotional vocabulary libraries 506 and one or more emotional vocabulary matching libraries 507, and the like.
  • Emotional vocabulary 506 can be configured to store, but not limited to, emotional vocabulary.
  • the emotional vocabulary refers to a vocabulary containing emotional information. Such as, good, excellent, increase, good, growth, profit, up, make up, earn, daily limit, soaring profit, decrease, decrease, sharp decrease, make up, decline, loss, loss, loss, down limit, reduction, Reduce the vocabulary.
  • the emotional vocabulary may include, but is not limited to, nouns, verbs, adjectives, and the like that express emotions.
  • Sources of information in the sentiment vocabulary 506 may include, but are not limited to, Internet open source dictionaries, professional dictionaries, and the like.
  • the emotional vocabulary collocation library 507 can be configured to store, but not limited to, an emotional vocabulary collocation.
  • the emotional vocabulary collocation may include, but is not limited to, a phrase collocation, a short sentence, a synonym, an antonym, a common collocation, a component noun, a derivative word, a compound word, and the like, which are matched with the emotional vocabulary in the emotional vocabulary 506.
  • the sources of information in the vocabulary collocation library 507 may include, but are not limited to, Internet open source dictionaries, professional dictionaries, news, research reports about companies, announcements, product manuals, and related websites.
  • the emotional vocabulary matching library 507 can be a fixed vocabulary or can be continuously updated and expanded.
  • the expansion method of the emotional vocabulary matching library 507 includes, but is not limited to, a PMI algorithm or the like.
  • the corpus 508 can include, but is not limited to, an ambiguous corpus 509, an emotional corpus 510, and the like.
  • the ambiguous corpus 509 can be configured to store ambiguous corpus including but not limited to.
  • the ambiguous corpus may include, but is not limited to, vocabulary, phrase collocation, statements, etc. that have been ambiguous/non-ambiguously labeled.
  • the sentiment corpus 510 can be configured to store, but not limited to, an emotional corpus.
  • the sentiment corpus may include, but is not limited to, a vocabulary, a phrase collocation, a sentence, and the like that have been subjected to an emotional category annotation.
  • the source of the corpus in the ambiguous corpus 509 may include, but is not limited to, the corpus collection unit 305 in the ambiguity analysis module 301, and the source of the corpus in the sentiment corpus 510 may include, but is not limited to, the corpus collection unit 310 of the sentiment analysis module 306.
  • Sources of ambiguous corpus 509 and sentiment corpus 510 may include, but are not limited to, Internet open source dictionaries, professional dictionaries, news, research reports about companies, announcements, product manuals, and related websites.
  • Information repository 511 can be configured to store information including, but not limited to, containing keywords.
  • the information in the information base 511 may have been subjected to ambiguity analysis or sentiment analysis, or may not be subjected to ambiguity analysis or sentiment analysis.
  • the source of the information may be the acquisition module 101.
  • the semantic knowledge base 512 can be configured to store vocabulary, phrases, sentences, paragraphs, and the like, including but not limited to concept based. By retrieving the semantic knowledge base 512, the genre types of vocabulary, phrases, sentences, and paragraphs can be identified.
  • the semantic knowledge base 512 is particularly capable of recognizing phrases, sentences, paragraphs, and the like that do not contain emotional vocabulary.
  • User database 513 can be configured to store information including, but not limited to, related to the user.
  • the user-related information may include, but is not limited to, a user's personal information, a user's history. Retrieve information, user's custom settings, and more.
  • the personal information of the user may include, but is not limited to, a login account of the user, a login password, a period of the user login system, time information, and the like.
  • the history search information of the user may include, but is not limited to, a history search keyword of the user, a search information result obtained according to the search keyword of the user, and the like.
  • the user's customized setting information may include, but is not limited to, one or more of a user's setting of subscription information, a user's setting for information sharing, a user's setting for intelligent learning, a user's setting for system update, and the like.
  • the setting of the subscription information by the user may include, but is not limited to, one or more of a keyword of information that the user needs to subscribe, a information push period set by the user, a push format, a push location, and the like.
  • the user's settings for information sharing may include, but are not limited to, information sharing objects, information sharing formats, periods of information sharing, and the like.
  • the setting of the user for intelligent learning may include, but is not limited to, an intelligent learning period or the like.
  • the settings of the user for system updates may include, but are not limited to, an update period or the like.
  • the system may include a user interaction interface.
  • the user interaction interface can receive user input either directly or through a peripheral device, or present information of one or more sentiment categories to the user.
  • the user input accepted by the user interaction interface can be stored in the storage unit 403 and then transmitted to other modules, such as the acquisition module 101, the processing module 102, and the database 104; and can also be directly transmitted to the other modules.
  • the information output by the user interaction interface may be from the storage unit 403.
  • the information output by the user interaction interface may be directly from the identification unit 402, or other modules, such as the acquisition module 101, the processing module 102, the database 104, and the like.
  • the user interaction interface can be a graphical user interface, a direct manipulation interface, a web-based user interface (web-based user interface (WUI)), a touch screen (Touchscreen). ), Command line interface, Touch user interface, Hardware interface, Attentive user interface, Batch interface, Conversational Interface Agent, Crossing-based interface, Gesture interface, Intelligent user interface, Motion tracking interface, Multi-screen interface -screen interface), no-command user interface, object-oriented user interface, Reflexive user interface, search interface, tangible Tangible User Interface, Task-Focused Interface, Text-based user interface, Voice user interface, Natural-language interface , Zero-Input interface, Zooming user interface, etc.
  • the user interaction interface can display and display information. The information of different emotion categories can be displayed on one page or on different pages.
  • the display form can include, but is not limited to, text, pictures, audio, video, animation, broadcast, and the like.
  • statements representing sentiment categories may be presented in one or more highlighted forms, such as textual highlighting using one or more colors different from the body of the message.
  • the colors may include, but are not limited to, red, blue, yellow, pink, orange, green, purple, and the like.
  • a statement representing an emotional category may take one or more fonts different from the body of the information.
  • the fonts may include, but are not limited to, Times New Roman, Song, Carcass, Italic, Bold, Times New Roman, Calibri, and the like.
  • a statement representing an emotional category may take one or more character sizes different from the body of the information body.
  • the dimensions may include, but are not limited to, No.
  • the underline may include, but is not limited to, a straight line, a broken line, or the like.
  • the informational highlighting in the form of a picture may employ one or more differently shaped frames including, but not limited to, circles, squares, rectangles, diamonds, ovals, and the like.
  • Information highlighting in the form of pictures can take the form of one or more colors.
  • the color of the frame may include, but is not limited to, red, blue, yellow, pink, orange, green, purple, and the like.
  • Information highlighting in the form of audio and broadcast uses one or more volume levels.
  • the user interaction interface can present the user with emotionally analyzed information in one or more fields.
  • Such fields may include, but are not limited to, investments in finance and its derivatives (including but not limited to In stocks, bonds, gold, paper gold, silver, foreign exchange, precious metals, futures, money funds, etc., technology (including but not limited to mathematics, physics, chemical and chemical engineering, biological and biological engineering, electronic engineering, communication systems, the Internet) , the Internet of Things, etc., politics (including but not limited to politicians, political events, countries), news (from the regional perspective, including but not limited to regional news, domestic news, international news; from the main body of the news, including but not Limited to political news, technology news, economic news, life news, weather news, etc.).
  • the user can add a field of interest in the user interaction interface as a quick view mode, thereby quickly viewing the sentiment analysis information of one or more areas of interest.
  • the user interaction interface can provide a user with a favorite, and the user can put one or more kinds of information in the favorites for the next use.
  • the form of the collection information can be network link, text, picture, audio, video, animation, Broadcasting can also be any combination of several. The form of the combination may be repeated regularly or irregularly.
  • the user interaction interface can adopt the default user interface or a custom interface.
  • the user can design the user interface according to his own habits and preferences, including but not limited to setting the color of the interface, the size of the interface, the layout of the interface, and the style of the interface. Wait.
  • the user interaction interface presents the user with the emotional category of the information including but not limited to the overall information sentiment category, a sub-category emotional category or multiple sub-information emotional categories; the sentiment analysis trend of the information to the user, including but It is not limited to the emotional category trend of the overall information, the emotional category trend of a sub-category information, the emotional category trend of various sub-category information; the information of the pushed subscription is displayed to the user: a reminder is sent to the user, and the reminder form may include but is not limited to text , sound, images, video, vibration, dynamic pop-ups, etc.
  • the shape of the popup may include, but is not limited to, a circle, a square, a rectangle, a diamond, an ellipse, and the like.
  • the user selects the subscription information according to the positive and negative sentiment analysis that needs to be viewed according to the reminder.
  • the system can further include an update module that can update the vocabulary and repository in the database 104, and/or can analyze the ambiguity model 312.
  • the relevant algorithm parameters of the sense analyzer 311 are updated.
  • the update module can obtain the required information in various ways.
  • the manner of obtaining information may be centralized or distributed, local or remote, wired or wireless, manual or automatic, or multiple A combination of ways.
  • the above description of the manner in which information is obtained is merely a specific example and should not be considered as the only feasible implementation.
  • various modifications and changes in the form and details of the specific ways and steps of obtaining information may be carried out without departing from this principle. , but these corrections and changes are still within the scope of the above description
  • the required information content may include, but is not limited to, a specific name vocabulary, a related vocabulary of a specific name vocabulary, information including the vocabulary, algorithm parameters for ambiguity analysis or sentiment analysis, and the like.
  • the above-mentioned specific name vocabulary may include, but is not limited to, a proper noun, full name, abbreviation, code, synonym, acronym, and the like in a specific field.
  • the related vocabulary of the above-mentioned specific name vocabulary may include, but is not limited to, proper nouns, nouns, verbs, adjectives, phrase collocations, short sentences, industry vocabulary, synonyms, antonyms, common collocations related to the specific vocabulary mentioned above. , component nouns, derivations, compound words, etc.
  • Information including the above vocabulary may include, but is not limited to, dictionaries, news, research reports about companies, announcements, product manuals, and related website pages.
  • Algorithm parameters for ambiguity analysis or sentiment analysis may include, but are not limited to, decision trees, Rocchio, Na ⁇ ve Bayes, neural networks, support vector machines, linear least squares fitting, K-nearest neighbors, genetic algorithms, maximum entropy, etc. Wait.
  • the update module can add to the database 104 using the collected information described above to obtain the updated database 104.
  • the update module can train the algorithm model with information updated in the database 104.
  • the update module can directly update the algorithm model using algorithm parameters collected for ambiguity analysis or sentiment analysis.
  • the update cycle can be regular or irregular.
  • the update module is updated periodically, either system-defined or user-defined. Periodic updates may include, but are not limited to, hours, days, weeks, months, quarters, years, and the like.
  • the information update module performs irregular updates, which can be system-defined or user-defined. Irregular updates may include, but are not limited to, updates on weekdays, holidays, or early, middle, and evening in different countries.
  • Information sources for the update module may include, but are not limited to, dictionaries, news media, research reports, announcements, product brochures, Weibo, WeChat, social networking sites, forums, publishers, and related website pages.
  • the updated content can be existing content or new content.
  • the system can periodically view news media such as financial websites. If a new information such as a stock name is included, and a new content related to the stock name appears, the update module updates the new content. If the stock name changes, the information update module can update. If the stock name has other alternative names, the information update module can be updated.
  • the above description of the update module and the update cycle and the update content are only specific examples and should not be regarded as the only feasible implementation. Obviously, for those skilled in the art, after understanding the update module and the update cycle, and updating the basic principles of the content, it is possible to make various corrections to the update module and the update cycle and the update content without deviating from the principle. Changes, but these corrections and changes are still within the scope of the above description.
  • the updating of the algorithm model in the ambiguity analysis module 301 or the sentiment analysis module 306 by the update module may be directly updated, may be updated with updated information, or may be updated after a certain amount of update information is accumulated.
  • the update of the ambiguity analysis model 312 in the ambiguity analysis module 301 may be manually audited or automatically audited by the system, or a combination of the two.
  • the above ambiguity analysis model 312 may include, but is not limited to, a decision tree, Rocchio, Na ⁇ ve Bayes, neural network, support vector machine, linear least squares fit, K-nearest neighbor, genetic algorithm, maximum entropy, and the like.
  • the system periodically checks news media such as financial websites, and if there is existing information, such as stock names, etc., and important information related to the stock name appears at the same time, the information update module can update the information base 511.
  • the ambiguity analysis module 301 can determine the ambiguity of the information.
  • the ambiguity information can enter the ambiguous collocation extraction step, extract the ambiguous collocation, and manually verify whether the information is a strong ambiguous collocation. Match will enter the letter
  • the information update module is used to update the ambiguity analysis model 312.
  • update ambiguity analysis module 301 and the ambiguity analysis model 312 is merely a specific example and should not be considered as the only feasible implementation. Obviously, for those skilled in the art, after understanding the basic principles of updating the ambiguity analysis module 301 and the ambiguity analysis model 312, it is possible to update the ambiguity analysis module 301 and the ambiguity analysis model 312 without departing from this principle. Various modifications and changes are made, but such modifications and changes are still within the scope of the above description.
  • the update of the sentiment analyzer 311 in the sentiment analysis module 306 may be manually audited or automatically audited by the system, or a combination of the two.
  • the sentiment analyzer 311 described above may include, but is not limited to, a decision tree, Rocchio, Na ⁇ ve Bayes, a neural network, a support vector machine, a linear least squares fit, a K-nearest neighbor, a genetic algorithm, a maximum entropy, and the like.
  • the system regularly reviews news media such as financial websites, and further updates the emotional vocabulary matching set through the positive and negative negative matching obtained after the positive and negative sentiment analysis process, and the updated matching set will enter the information updating module to update the sentiment analysis module.
  • Model the system regularly reviews news media such as financial websites, and further updates the emotional vocabulary matching set through the positive and negative negative matching obtained after the positive and negative sentiment analysis process, and the updated matching set will enter the information updating module to update the sentiment analysis module.
  • FIG. 6 shows a flow diagram of system user interaction. It should be noted that the flow in the following description is only some embodiments of the present invention, and those skilled in the art can apply the present invention to other similarities according to these descriptions without any creative labor. scene.
  • the system first obtains user input (step 601). The steps can be completed by the input and output module 103. Input forms include, but are not limited to, keyboard input, pointing device input (such as pointing stick input, mouse input, trackpad input, trackball input), voice recognition device input, graphic image recognition device input, etc.; input forms include but are not limited to Numbers, characters, symbols, text, sounds, graphic images, videos, etc.
  • the system can store the user input (step 604).
  • the system can store user input in the input and output module 103
  • the storage unit 403 may also store user input in a storage unit of another module (such as the storage module 315 of the processing module, the database 104, etc.). In some embodiments, storage is required. In other embodiments, storage is optional or not required.
  • the storage entered by the user may be permanent or temporary; it may be all stored or partially stored.
  • the system can utilize the stored user input to gain user habits, perform intelligent learning, prompt candidate words, and the like. After obtaining the user input, the system will retrieve the information based on the user input (step 602) and then generate an output based on the retrieved information (step 603). The system can also generate an output result directly from the information input by the user (step 603).
  • the step 603 can be completed by the input and output module 103.
  • the system can also display the generated output to the user through the peripheral device, or not.
  • the presentation is mandatory; in other embodiments, the presentation is optional or not required.
  • the system may retrieve the information in the database 104 according to the user input, or retrieve the information of other module storage units (such as the storage module 315 of the processing module) according to the user input, or retrieve the information through the network 105 according to the user input.
  • the above information can be stored (step 604). It may be stored in the storage unit 403 of the input/output module 103, or may be stored in a storage unit of another module (such as the storage module 315 of the processing module, the database 104, etc.). Storage can be permanent or temporary.
  • the system can generate an output (step 603).
  • the step 603 can be completed by the input and output module 103.
  • the system can also present the generated output to the user via the peripheral device.
  • the presentation can be real-time or delayed.
  • the presentation can be regular or irregular.
  • the user input includes periodic instructions (such as instructions to subscribe to certain information) that the system can recognize and push or present information that meets the user's input criteria to the user, either periodically or irregularly, according to user instructions.
  • Figure 7 shows a system flow chart of an information sentiment classification method.
  • the system first collects information (step 701).
  • the steps can be completed by the acquisition module 101.
  • the above information includes, but is not limited to, dictionaries, news, research reports about companies, announcements, product manuals, and related website pages.
  • the above information belongs to the industry including but not limited to sports, entertainment, economics, politics, culture, etc.
  • the form of the above information includes, but is not limited to, text, picture, audio, video, and the like.
  • the above information usage languages include, but are not limited to, Chinese, English, Japanese, Korean, French, German, and the like.
  • the source of the above information may be the network 105 or a module such as the database 104.
  • the system can analyze whether the information is ambiguous and obtain an ambiguity analysis result (step 702).
  • the step 702 can be completed by the ambiguity analysis module 301 in the processing module 102.
  • the above information may be all collected information, or may be part of the collected information.
  • the above analysis ambiguity can be performed manually, or it can be automatically determined by the ambiguity analysis model, or it can be combined.
  • the above ambiguity analysis models include, but are not limited to, decision trees, Rocchio, Na ⁇ ve Bayes, neural networks, support vector machines, linear least squares fits, K-nearest neighbors, genetic algorithms, maximum entropy, and the like.
  • the system may analyze the sentiment category of the ambiguity analysis result obtained in step 702 to obtain information including the sentiment category (step 703).
  • the step 703 can be completed by the sentiment analysis module 306 in the processing module 102.
  • when performing sentiment analysis only sentiment analysis may be performed on non-ambiguous information, and sentiment analysis may also be performed on ambiguous information.
  • the system may also omit step 702 and directly perform sentiment analysis without performing ambiguity analysis (steps 701 and 703 are performed).
  • the above sentiment analysis can be performed manually, or it can be automatically determined by the sentiment analysis model, or it can be combined.
  • the above sentiment analysis models include, but are not limited to, decision trees, Rocchio, Na ⁇ ve Bayes, neural networks, support vector machines, linear least squares fits, K-nearest neighbors, genetic algorithms, maximum entropy, and the like.
  • the sentiment analysis described above can divide information into, but not limited to, positive information, negative information, and neutral information.
  • the order of ambiguity analysis and sentiment classification may be reversed, that is, the collected information is first emotionally classified, and then the sentiment analysis information is disambiguated (execution steps 701, 703, 702). .
  • the intermediate processing results and final processing results of each of the above steps may be in accordance with a specific storage method. Storage is performed (step 704).
  • the storage method includes, but is not limited to, a sequential storage method, a link storage method, an index storage method, a hash storage method, and the like.
  • the stored location may be the storage module 315, may be the storage unit 203, or may be the database 104 or the like.
  • FIG 8 shows a flow chart of the system training model.
  • the system collects information through a module having an acquisition function (step 801).
  • the module having the collection function may be the collection module 101, the acquisition unit 302 in the ambiguity analysis module 301, the acquisition unit 307 of the sentiment analysis module 306, and the like.
  • the source of the above information may be the storage module 315, the database 104, or the network 105.
  • the above information includes, but is not limited to, industry-specific vocabulary words, vocabulary strongly associated with a particular vocabulary, information including the vocabulary, and vocabulary containing emotional information.
  • the above industries include, but are not limited to, sports, entertainment, economics, politics, culture, and the like.
  • the above specific name vocabulary includes but is not limited to a specific noun, full name, abbreviation, code, synonym, acronym in a specific field.
  • the above-mentioned vocabulary strongly related to a specific name vocabulary includes, but is not limited to, nouns, verbs, adjectives, short sentences, phrase collocations, industry vocabulary, synonym, antonym, common collocations, components related to the specific vocabulary of the above-mentioned specific vocabulary. Nouns, derivations, compound words, etc.
  • Information that includes the above vocabulary includes, but is not limited to, dictionaries, news, research reports about companies, announcements, product manuals, and related website pages.
  • the categories of the above emotional vocabulary include, but are not limited to, positive, negative, neutral, and the like.
  • the form of information includes, but is not limited to, text, picture, audio, video, and the like.
  • the above information usage languages include, but are not limited to, Chinese, English, Japanese, Korean, French, German, and the like.
  • the system builds the thesaurus and repository in step 802.
  • Step 802 can be completed by processing unit 102.
  • the above vocabulary includes but is not limited to the keyword vocabulary 501 and the sentiment vocabulary 505.
  • the keyword lexicon 501 includes, but is not limited to, a keyword dictionary 502 composed of a specific name vocabulary, a related word dictionary 503 composed of one or more vocabulary related to a specific name vocabulary, and an ambiguity obtained by one or more audit keyword dictionary 502. List 504.
  • the above emotional word vocabulary 505 This includes, but is not limited to, one or more emotional vocabulary libraries 506 and one or more emotional vocabulary matching libraries 507.
  • the information in the above information base contains a specific name vocabulary in the keyword dictionary 502.
  • the system may collect the corpus through the corpus collection unit 305 of the ambiguity analysis module 301 and the corpus collection unit 310 of the sentiment analysis module 306 (step 803), which may be completed by the processing unit 102.
  • the manner in which the corpus is collected includes, but is not limited to, a process of matching and scoring the collected information.
  • the collected corpus can be used to train the model (step 804), including but not limited to the ambiguity analysis model 312 and the sentiment analyzer 311, including but not limited to decision trees, Rocchio, naive Bayes, neural networks, Support vector machine, linear least squares fit, K-nearest neighbor, genetic algorithm, maximum entropy, etc.
  • the sentiment analyzer 311 includes, but is not limited to, a decision tree, Rocchio, Na ⁇ ve Bayes, a neural network, a support vector machine, a linear least squares fit, a K-nearest neighbor, a genetic algorithm, a maximum entropy, and the like.
  • the system may directly manually review the collected information as an ambiguous corpus or emotional corpus (step 801, step 803), and may directly train the collected information to manually train the model (step 801, step 804).
  • the process of steps 802 and 803 is not passed.
  • the intermediate processing result or the final processing result for each of the above steps may be stored (step 805).
  • the storage method includes, but is not limited to, a sequential storage method, a link storage method, an index storage method, a hash storage method, and the like.
  • the stored location may be the storage module 315, may be the storage unit 203, or may be the database 104 or the like.
  • Figure 9 is a schematic diagram of the usage scenario.
  • 902 is an information sentiment classification system that communicates with user 901 via network 903.
  • the information sentiment classification system 902 can be a server or a server group, and the distribution manner can be centralized or distributed.
  • the network 903 can be wired or wireless; it can be a local area network, or It is a wide area network.
  • the user 901 types an object name, such as a stock name, a futures name, a bond name, etc., through the input and output module 103 (see FIG. 1 for details).
  • the object name is transmitted to the information sentiment classification system 902 via the network 903 and parsed by the information sentiment classification system 902.
  • the object name is identified by the information sentiment classification system 902.
  • the system's processing module 102 see Figure 1 will begin searching the database 104 (see Figure 1 for details) to obtain a collection of articles containing the object names.
  • Each article in the collection of articles has different emotion types for the object name, and the processing module 102 of the system classifies the articles in the article collection according to the emotion type, such as: positive articles and each article Positive index of positive articles, negative news, negative index of each negative article, neutral news, etc. After the classification is completed, the classified collection of articles is transmitted to the input and output module 103 for presentation to the user 901.
  • the user 901 types in an object name such as a stock name, a futures name, a bond name, and the like.
  • the operation of typing the object name can be completed by the input and output module 103 (see FIG. 1 for details).
  • the object name is transmitted to the information sentiment classification system 902 via the network 903 and parsed by the information sentiment classification system 902.
  • the object name is identified by the information sentiment classification system 902.
  • the system collects information including the user input, and the information including the user input can be completed by the collection module 101, and is sent to the processing module 102 (see FIG. 1 for ambiguity analysis to filter out non-ambiguous information and perform sentiment analysis. .
  • the system can also first determine whether the user input contains ambiguous information, and if it does not contain ambiguous information, the sentiment analysis can be performed directly. Emotional classifications such as positive articles and positive indicators for each positive article, negative news, negative indices for each negative article, neutral news, etc. After the classification is completed, the classified collection of articles is transmitted to the input and output module 103 for presentation to the user 901.
  • the user 901 types two object names, such as stock name, futures name, bond name, etc., through the input and output module 103 (see FIG. 1).
  • the information sentiment classification system 902 parses and identifies the object name, and then returns an emotionally categorized collection of articles containing the object name.
  • the set of articles will be presented to the user 901 via the input and output module 103.
  • the user 901 can also For example, the number of articles owned by two object names for the same sentiment type, the two object names have a comparison of the number of positive articles in a week, and the two object names have a positive number of articles in one month, and two articles are in one There was a comparison of the number of negative articles during the year. With the help of the above data, the user 901 can make an effective decision.
  • Step 1001 is to collect information, and the step may be completed by the collection unit 201 of the acquisition module 101.
  • the source of information may be local, such as information stored in storage unit 203 of acquisition module 101, or information stored in database 104; or may be from network 105, such as an open Internet or a local area network.
  • Information includes, but is not limited to, existing dictionaries, news, research reports, announcements, product brochures, and related websites.
  • the information collected by the collection unit 201 can be directly stored in the storage unit 203 of the acquisition module 101, and can also be stored in the information base 511 of the database 104 (step 1007).
  • the information collected by the acquisition unit 201 can also be handed over to the processing unit 202 for processing.
  • the particular vocabulary is extracted from the information, which may be accomplished by processing unit 202.
  • the emotional vocabulary is extracted from the information, which may be performed by processing unit 202.
  • the emotional vocabulary collocation is extracted from the information, which may be performed by processing unit 202.
  • the specific vocabulary includes keywords, including but not limited to specific nouns, full names, abbreviations, codes, synonyms, acronyms; and strong related words related to the keywords, including but not limited to the above keywords Proper nouns, nouns, verbs, adjectives, phrase collocations, short sentences, industry vocabulary of specific vocabulary in the field, synonyms, antonyms, common collocations, component nouns, derivations, compound words, etc.
  • the extraction may be performed simultaneously; it may also be performed in steps.
  • Algorithms used for extraction include, but are not limited to, PMI algorithms, log likelihood ratio algorithms, and the like.
  • the above extraction steps may be performed simultaneously, or may be performed in steps, and may be combined in any possible order.
  • the steps described herein can be implemented in any suitable order, or concurrently, where appropriate.
  • the specific vocabulary may be extracted first (step 1005), and then the emotional vocabulary is extracted (step 1002), and the emotional vocabulary collocation is extracted (step 1003); step 1002 and step 1003 may be performed simultaneously, or may be If it is performed in sequence, step 1002 may be performed first, and then step 1003 may be performed. Step 1003 may be performed first and then step 1002 may be performed.
  • individual steps may be eliminated from any one method without departing from the spirit and scope of the subject matter of the collection process described herein. Aspects of any of the examples described above can be combined with aspects of any of the other examples described to form further examples without losing the effect sought.
  • the specific vocabulary extracted by the processing unit 202 can be stored in the keyword vocabulary 501 of the database 104 (step 1006), and the vocabulary and the emotional vocabulary collocation can be stored in the sentiment lexicon 505 of the database 104 (step 1004).
  • the steps described herein for the acquisition process may be performed in any suitable order, or concurrently, where appropriate.
  • individual steps may be eliminated from any one method without departing from the spirit and scope of the subject matter of the collection process described herein.
  • Aspects of any of the examples described above can be combined with aspects of any of the other examples described to form further examples without losing the effect sought.
  • Figure 11 is an embodiment of the system applied to the field of stock news.
  • the system collects daily news and Internet open dictionary, professional dictionary, etc. (step 1101, step 1102), constructs a financial product vocabulary source, a financial product related etymology, and an emotional vocabulary vocabulary (step 1103, step 1104, step 1108), step 1101. Step 1103, step 1104, step 1108 This can be done by the acquisition module 101.
  • the system can also store the collected information.
  • the stored location may be the database 104 or other units or modules having storage functions (eg, storage unit 203, etc.).
  • the system obtains an ambiguous list at step 1111. Thereafter, the system performs ambiguity analysis on the collected related stock news (step 1106).
  • Step 1106 can be completed by the ambiguity analysis module 301 in the processing module 102, and the non-ambiguous stock information is filtered into the sentiment analysis module 306 of the processing module 102 for emotion classification.
  • the ambiguity analysis of the news website information in the processing module 102 may be automatically completed by the system, or may be completed by manual review (step 1110), or may be completed by combining the two.
  • the emotional lexical vocabulary will be retrieved (step 1108), and the non-ambiguous stock information will be sentimentally analyzed by the emotional vocabulary (step 1107), and the emotional category of the stock news will be marked.
  • Step 1108 and step 1107 can be performed by sentiment analysis module 306 in processing module 102.
  • the judgment of the non-ambiguous stock information by the processing module sentiment analysis module may be automatically completed by the system, or may be completed by manual review (step 1110), or may be completed by combining the two.
  • Stock news that is tagged with an emotional category will be generated and presented to the user based on their emotional tag classification.
  • the collection module 101 can also collect stock vocabulary and stock related vocabulary by periodically collecting daily news, and expand stock vocabulary sources and stock related lexicons.
  • the collection module 101 can also collect the statement containing the stock information from the daily news, and perform training update on the processing module ambiguity analysis module 301 and the algorithm model of the processing module sentiment analysis module 306, and the training update can be performed under the supervision of the manual audit. It can also be done spontaneously by the system, or it can be done by combining the two.
  • Figure 12 shows an embodiment of ambiguity analysis.
  • the collection unit 302 collects stock name vocabulary, stock strong related vocabulary, ambiguous stock name vocabulary, and news network.
  • Information such as station news (step 1201, step 1202, and step 1203) may be the network 105, may be the storage module 315, or may directly search the database 104 or the like.
  • the system obtains an ambiguous list at step 1217.
  • Step 1217 can be completed by the ambiguity analysis module 301.
  • the matching unit 303 and the processing unit 304 of the ambiguity analysis module 301 score the stock news according to the stock name vocabulary, the stock strong related vocabulary and the ambiguous stock name vocabulary (step 1204), and according to the scoring result, the news can be divided into non-ambiguous news and strong ambiguity.
  • the non-ambiguous news can be directly processed into the sentiment analysis module 306 (step 1212); the strong ambiguous news can be extracted by the ambiguity analysis module 301 by the corpus collection unit 305, that is, the collocation combination of the ambiguous words and the related vocabulary ( Step 1213, step 1214), and then manually audited (step 1215) to obtain a strong ambiguous collocation (1216); the strong ambiguous collocation can be used to train the ambiguity analysis model 312 (step 1211), and can also be used to directly determine whether the information is ambiguous; Information containing strong ambiguity is ambiguous information.
  • the non-ambiguous news, the strong ambiguous news, and other news obtained by the scoring result may collect the sentence containing the stock through the corpus collecting unit 305 (step 1208, step 1209), and the human-reviewed sentence is marked as ambiguous or non-ambiguous. (Step 1210), thereby used to train the ambiguity analysis model 312 (1211).
  • the above ambiguity analysis model includes, but is not limited to, a maximum entropy model.
  • Figure 13 shows an embodiment of ambiguity analysis. It should be noted that the flow in the following description is only some embodiments of the present invention, and those skilled in the art can apply the present invention to other similarities according to these descriptions without any creative labor. scene.
  • information is first acquired (step 1301).
  • the information may be acquired by the acquisition module 101, or may be through other units or modules having information collection functions (for example, the collection unit 302 in the ambiguity analysis module 301, etc.), or may be a storage module (such as the database 104, other modules). Storage unit, etc.).
  • Above letter Information includes but is not limited to dictionaries, news, research reports, announcements, product brochures, and related website pages.
  • the above information belongs to the industry including but not limited to sports, entertainment, economics, politics, culture, etc.
  • the form of the above information includes, but is not limited to, text, picture, audio, video, and the like.
  • the above information usage languages include, but are not limited to, Chinese, English, Japanese, Korean, French, German, and the like.
  • the above information may be directly from the network 105, or may be extraction of information in the information base 511 in the database 104, and the like.
  • the information may be analyzed by the ambiguity analysis model 312 after acquisition (step 1302).
  • the above ambiguity analysis model includes but is not limited to a decision tree, Rocchio, Na ⁇ ve Bayes, neural network, support vector machine, linear least squares fit, K-nearest neighbor, genetic algorithm, maximum entropy, and the like.
  • the information analyzed by ambiguity can be labeled as including but not limited to ambiguous information or non-ambiguous information (step 1303).
  • the system can also manually mark the acquired information directly (step 1301, step 1303) without the analysis of the ambiguity analysis model.
  • the intermediate processing result and the final processing result in the above flow can be stored (step 1304).
  • the storage method includes, but is not limited to, a sequential storage method, a link storage method, an index storage method, a hash storage method, and the like.
  • the stored location may be the storage module 315, may be the storage unit 203, or may be the database 104 or the like.
  • Figure 14 shows another embodiment of ambiguity analysis, an ambiguity analysis process embodiment with manual supervision. It should be noted that the flow in the following description is only some embodiments of the present invention, and those skilled in the art can apply the present invention to other similarities according to these descriptions without any creative labor. scene.
  • the system extracts the keyword lexicon and information library in the database 104 (step 1401, step 1402), and the steps 1401 and 1402 may be completed by the acquisition unit 302.
  • the above keyword vocabulary includes, but is not limited to, one or more keyword lexicons 502, one or more related word lexicons 503, and one or more ambiguous lists 504.
  • the above keyword dictionary 502 is A dictionary composed of vocabulary names, including but not limited to specific nouns, full names, abbreviations, codes, synonyms, and acronyms in a specific field.
  • the related word dictionary 503 may be a dictionary composed of words related to a specific name word, wherein the words related to the specific name words may include, but are not limited to, for example, industry words, executive names, main product names, nouns, verbs , adjectives, phrase collocations, short sentences, domain-specific vocabulary industry vocabulary, synonyms, antonyms, common collocations, component nouns, derivatives, compound words, and the like, or any combination of the above vocabulary; the above ambiguous list may be Manual review is obtained from a keyword dictionary; the above information base may be information containing a specific name vocabulary.
  • the industry in which the above-mentioned specific name vocabulary belongs may include, but is not limited to, for example, sports, entertainment, economics, politics, culture, and the like.
  • the information contained in the above information base may include, but is not limited to, a dictionary, news, research reports about companies, announcements, product manuals, and related website web pages, or the like, or any combination of the above.
  • Step 1403 matches the keyword lexicon with the information base, the matching method includes but is not limited to a regular expression, a double array dictionary matching, etc., and step 1403 can be completed by the matching unit 303.
  • the system performs processing analysis on the matching result to obtain an analysis result Score.
  • Step 1404 can be completed by processing unit 304.
  • the score can be calculated using the following formula,
  • stock represents a specific name vocabulary involved in the information
  • i represents the i-th name vocabulary, strong related word or ambiguous name vocabulary of stock
  • weight represents the vocabulary of the name, strong related vocabulary Or the weight of the ambiguous name vocabulary
  • count represents the number of occurrences of the word i
  • doc_len represents the text length of the information.
  • ⁇ and ⁇ are set as threshold values (step 1405).
  • These two thresholds can be fixed, It is also possible to make certain adjustments depending on the specific situation. For example, users can customize these two thresholds to adjust the sensitivity of the system. In the case where the amount of information collected is very large, the user can increase the system sensitivity by increasing ⁇ or decreasing ⁇ to ensure the accuracy of the ambiguity determination. Conversely, in the case where the amount of information collected is very small, the user can reduce the sensitivity of the system by reducing ⁇ or increasing ⁇ to ensure the completeness of the information.
  • step 1405 If the analysis result obtained in step 1404 is greater than or equal to ⁇ (step 1405), the information is marked as non-ambiguous information (step 1409); if the analysis result obtained in step 1404 is less than or equal to ⁇ (step 1406), then The news is marked as ambiguous information (step 1408); if the analysis result obtained in step 1404 is between ⁇ and ⁇ , the news may be marked as ambiguous information or non-ambiguous information by manual review or model review (step 1408) , step 1409).
  • the above models include, but are not limited to, decision trees, Rocchio, Na ⁇ ve Bayes, neural networks, support vector machines, linear least squares fits, K-nearest neighbors, genetic algorithms, maximum entropy, and the like.
  • the above marking method may be manual, or may be automatically marked by the system, or may be combined with the two.
  • step 1403 all or part of the information in the keyword lexicon can be matched with the news containing the stock name.
  • the related word dictionary can be used to match the news, and the related word dictionary and the ambiguous list can be combined with the news. Make a match.
  • some steps in the process may be performed sequentially or synchronously, such as step 1401 and step 1402.
  • some steps in the process are not necessary. For example, if a news can be directly audited manually, it is ambiguous and skips other intermediate links.
  • Figure 15 shows an embodiment of a training ambiguity analysis model. It should be noted that the flow in the following description is only some embodiments of the present invention, and those skilled in the art can apply the present invention to other similarities according to these descriptions without any creative labor.
  • the system extracts the database 104 Keyword lexicon and information base (step 1501, step 1502). Step 1501 and step 1502 may be performed by acquisition unit 302.
  • the above keyword vocabulary includes, but is not limited to, one or more keyword lexicons 502, one or more related word lexicons 503, and one or more ambiguous lists 504.
  • the keyword dictionary 502 described above may be a dictionary composed of specific name words.
  • the above-mentioned specific name vocabulary may include, but is not limited to, a specific domain specific noun, full name, abbreviation, code, synonym, acronym, and the like, or any combination of the above-mentioned name vocabulary.
  • the above related word dictionary 503 may be a dictionary composed of words related to a specific name word.
  • Vocabulary related to a specific name vocabulary may include, but is not limited to, for example, industry vocabulary, executive name, main product name, noun, verb, adjective, phrase collocation, phrase, industry-specific vocabulary, synonym, antonym, Common collocations, component nouns, derivations, compound words, etc.
  • the above ambiguous list can be obtained from the keyword dictionary by manual review.
  • the above information base may contain information of a specific name vocabulary.
  • Step 1503 matches the keyword lexicon with the information base.
  • the matching method includes but is not limited to a regular expression, a double array dictionary matching, etc., and the matching may be completed by the matching unit 303.
  • the system analyzes the matching result and obtains the analysis result Score.
  • Step 1504 can be completed by processing unit 304.
  • the score can be calculated by the following formula,
  • news represents a certain information
  • stock represents a specific name vocabulary involved in the news
  • i represents the i-th name vocabulary, strong related word or ambiguous name vocabulary of stock
  • weight represents the vocabulary of the name, strongly related vocabulary or ambiguous name
  • count represents the number of occurrences of the word i
  • doc_len represents the text length of the information.
  • ⁇ and ⁇ are set as threshold values (step 1505).
  • a set of scores greater than or equal to the beta statement will be marked as a set of non-ambiguous statements, and a set of statements with a score less than or equal to a will be marked as a set of ambiguous statements.
  • These two thresholds can be fixed or adjusted according to the specific situation. For example, users can customize these two thresholds to adjust the sensitivity of the system. In the case where the amount of information collected is very large, the user can increase the system sensitivity by increasing ⁇ or decreasing ⁇ to ensure the accuracy of the ambiguity determination. Conversely, in the case where the amount of information collected is very small, the user can reduce the sensitivity of the system by reducing ⁇ or increasing ⁇ to ensure the completeness of the information.
  • step 1505 If the analysis result obtained in step 1504 is greater than ⁇ (step 1505), the information is marked as non-ambiguous information (step 1509).
  • the above marking method may be manual, or may be automatically marked by the system, or may be combined with the two.
  • the system collects the corpus. Step 1510 can be performed by corpus collection unit 305.
  • the collected corpus may be the entire non-ambiguous information, or a sentence containing a specific name vocabulary extracted from the information, or some non-ambiguous collocation.
  • step 1506 If the analysis result obtained in step 1504 is less than ⁇ (step 1506), the information is marked as ambiguous information (step 1508), and the marking method may be manual, or may be automatically marked by the system, or may be The two are combined to mark.
  • the corpus collection unit 305 can collect the corpus (step 1510).
  • the collected corpus may be the entire ambiguous information, or a sentence containing a specific name vocabulary extracted from the information, or some ambiguous collocation.
  • the information may be marked as ambiguous information or non-ambiguous information by manual review (step 1507, step 1508, step 1509).
  • the above marking method may be manual, or may be automatically marked by the system, or may be combined with the two.
  • the system collects the corpus. Step 1510 can be completed by corpus collection unit 305.
  • the collected corpus may be the entire ambiguous information, or may be a sentence containing a specific name vocabulary extracted from the information, or some ambiguous collocation and non-ambiguous collocation.
  • step 1503 all or part of the information in the keyword lexicon can be matched with the news containing the stock name.
  • the related word dictionary can be used to match the news, and the related word dictionary and the ambiguous list can be combined with the news. Make a match.
  • some steps in the process may be performed sequentially or synchronously. Steps 1501 and 1502 may be performed simultaneously or sequentially. In addition, some steps in the process are not necessary. For example, if a news can be directly audited manually, it is ambiguous and skips other intermediate links.
  • each sentence is segmented, thereby obtaining a group of specific names Elements of vocabulary, surrounding vocabulary, and relative positional information.
  • These features are formed into a set of features in a specified format, and the ambiguity analysis model Model is trained accordingly (step 1511):
  • This ambiguity analysis model Model can automatically determine the ambiguity of a stock name in a news in the ambiguity analysis module.
  • FIG 16 is an embodiment of an sentiment analysis module.
  • the system collects emotional seed vocabulary (step 1601).
  • the emotional seed vocabulary may include, but is not limited to, positive emotional vocabulary and negative emotional vocabulary, for example, good, excellent, increased, good, growth, profit, increase, make up, earn, daily limit, soar profit, decrease, decrease, sharp decrease Words such as compensatory declines, declines, losses, losses, losses, down limits, reductions, and reductions.
  • the system collects stock-related news by visiting various financial websites (step 1602).
  • an emotional vocabulary collocation is established and an expansion of the emotional vocabulary collocation is maintained (step 1603).
  • the expansion of the emotional vocabulary collocation can be accomplished by regularly accessing major financial websites and extracting stock-related news.
  • the system After the expansion of emotional vocabulary and emotional vocabulary is completed, the system will get emotional vocabulary. Match the set (step 1604).
  • the system filters out relevant and ambiguous sentences by manually or automatically reviewing stock-related news to obtain a non-ambiguous stock sentence set (step 1605).
  • the system matches the non-ambiguous stock sentence set with the emotional vocabulary collocation set to identify the sentiment type of the non-ambiguous stock sentence set.
  • a positive and negative sentence set will be obtained (step 1606).
  • the set of positive and negative sentences can be manually reviewed.
  • the manually reviewed sentences will be marked as three sentiment types, positive, negative, and neutral (step 1607).
  • the set of sentences that have been marked as neutral by manual review will be sent to the sentiment analyzer for sentiment category recognition training (step 1608).
  • the algorithms that the sentiment analyzer can employ include, but are not limited to, maximum entropy model, support vector machine algorithm, naive Bayes, and the like. After the sentiment analyzer completes the training, it will be possible to determine the sentence that is marked as neutral after the manual review (step 1611). Sentences that are marked as positive or negative by the type of emotion by manual review are sent to the positive and negative collocation engine for further emotion type recognition (step 1609).
  • the positive and negative matching scoring engine quantifies the matching degree of the non-ambiguous stock sentence set and the emotional vocabulary matching set, and gives a corresponding score according to the quantized result.
  • a high score value indicates that the stock sentence or stock sentence set contains one or more strong emotional vocabulary collocations, and the emotion type of the sentence or set may be directly determined to be positive or negative (step 1610).
  • a low score value indicates that the stock sentence or stock sentence set does not contain a strong emotional vocabulary collocation, so a sentence with a low score value will be sent to the sentiment analyzer for the emotion type determination (step 1611).
  • Fig. 17 is an example of determining the type of emotion.
  • the system obtains information in step 1701, which may be non-ambiguous information after ambiguity analysis and ambiguity information, may be information that has been emotionally classified but has not been marked with an emotion type, and may be initial information without any processing.
  • the information may be stored, for example, in database 104 after being acquired (step 1704).
  • the emotionally categorized but unmarked sentiment type information will be directly tagged with the sentiment category (step 1703).
  • the non-ambiguous information and ambiguous information will be sent to the sentiment analyzer Emotion classification.
  • the sentiment analyzer stores the information and stores it in the database 104.
  • the algorithms that the sentiment analyzer can employ include, but are not limited to, maximum entropy model, support vector machine algorithm, naive Bayes, and the like.
  • the sentiment analyzer first determines whether the ambiguous information and the non-ambiguous information contain a strong emotional collocation, and if a strong emotional collocation is included, the emotional type of the information can be directly determined (step 1702), and then marked as corresponding according to the result of the determination.
  • the emotional category (step 1703). For information that does not include a strong emotional collocation, the scoring engine in the sentiment analyzer will score the type of emotions included in the information, and ultimately determine the sentiment type of the information based on the scoring result. After the emotion type determination is completed, the emotion type of the information is marked (1703), and after the flag is completed, the information is stored in the database 104.
  • Figure 18 depicts an embodiment of an sentiment classification method.
  • the system obtains a non-ambiguous statement set G (step 1801).
  • Step 1801 can be accomplished by the acquisition unit 307 of the sentiment analysis module 306 by accessing the storage module 315.
  • the system acquires the emotional vocabulary matching set ⁇ (step 1802).
  • Step 1802 can be accomplished by processing unit 309 of sentiment analysis module 306 by accessing emotional vocabulary collocation library 507 in database 104.
  • the system matches the acquired non-ambiguous sentence set G with the sentiment vocabulary match set ⁇ (step 1803).
  • Step 1803 can be accomplished by matching unit 308 of sentiment analysis module 306.
  • Step 1803 is a logical judgment.
  • the system matches the set H with the strong positive and negative emotional vocabulary collocation set F (step 1807), the strong positive and negative emotional vocabulary collocation set F includes, but is not limited to, a vocabulary set whose artificially-recognized emotional matching accuracy is greater than a certain threshold (eg, accurate The rate is above 90%).
  • Step 1807 can be completed by matching unit 308.
  • Step 1808 logically determines the matching result, dividing the set H into sentences containing strong positive and negative collocations (step 1809) and sentences not including strong positive and negative emotional collocations (step 1810).
  • the sentiment analyzer 311 of the sentiment analysis module 306 will be Sentiments are classified into sentences that do not contain strong positive and negative sentiment collocations (step 1811).
  • the algorithms that the sentiment analyzer 311 can employ include, but are not limited to, algorithms of maximum entropy model, support vector machine model, naive Bayes, decision tree, and the like.
  • the system obtains all sentences M' of positive and negative emotions (step 1812).
  • Step 1813 will determine if all sentences in M' belong to an emotion, and if all sentences belong to an emotion, the system marks the news as the corresponding positive and negative emotion type (step 1815).
  • Step 1815 can be completed by processing unit 309. If the sentence in M' contains two or more emotions, the processing unit 309 of the sentiment analysis module 306 will compare the positive and negative sentiment category scores in M' according to a certain algorithm (step 1814), and then mark M' as The emotional category with a high score (step 1815).
  • the algorithm needs to satisfy the following conditions: First, the positive and negative degrees of strong matching can be artificially defined, and the positive and negative degrees are an element of the score. Second, the distance between strong collocation and stock is another factor to consider in the score.
  • the score is smaller than the score of any strong rule.
  • the positive and negative collocation scores of the headline are higher than those of other places (such as news content).
  • Step 1818 can be performed by sentiment analysis module 306.
  • the semantic knowledge base 512 can identify sentences, phrases, or paragraphs in natural language that do not contain emotional vocabulary but have emotional expressions. For example: Today my husband and I applied for a divorce, and he wanted to take the custody of the child from me. This sentence does not have any emotional vocabulary, so the emotional category of the sentence will not be recognized by ordinary sentiment analysis methods. But by retrieving the semantic knowledge base 512, the sentiment categories of this sentence will be identified.
  • the system After completing the second emotional decision through the semantic knowledge base 512, the system marks the sentence as the corresponding sentiment type (step 1815). After completing the sentiment classification, the system can provide a method of displaying the emotion type as a whole, and can also provide the same type of news for the emotional type of one or more financial products of the plurality of similar or multiple different types of financial products involved. Display method.
  • Figure 19 depicts an embodiment of a training sentiment analyzer.
  • the system collects and constructs a library of fruit emotion vocabulary, which can be completed by the acquisition module 101; its sources include but are not limited to literature (books, newspapers, periodicals, patent documents, dissertations, official documents, etc.), academic reports, markets Reports, news, reviews, web dictionaries, existing dictionaries in the field, research reports, announcements, product brochures and related websites of companies; access to information can be centralized or distributed, local It can also be remote, wired or wireless. It can be manual or automatic, or it can be combined in multiple ways.
  • the system further collects information to expand the emotional vocabulary dictionary and the emotional vocabulary collocation.
  • the further collecting information can be completed by the collecting unit 201 of the collecting module 101;
  • the sources include but are not limited to literature (books, newspapers, Journals, patent documents, dissertations, official documents, etc., academic reports, market reports, news, reviews, online dictionaries, existing dictionaries in the field, research reports, announcements, product manuals and related websites of companies; ways to obtain information It can be centralized or distributed, it can be local or remote, it can be wired or wireless, it can be manual or automatic, or it can be combined in many ways.
  • the algorithms used include, but are not limited to, PMI algorithm, log likelihood ratio algorithm, chi-square test, angle cosine, Dyce coefficient and class F1measure.
  • the emotional vocabulary matching set ⁇ is obtained (step 1901), and the non-ambiguous statement is obtained (step 1902). It should be noted that the acquisition of the emotional vocabulary matching set ⁇ may be step by step as described in this embodiment, or may be completed in one step.
  • the sentiment analysis module 306 matches the sentiment vocabulary collocation set ⁇ with the non-ambiguous sentence set (step 1903), and the matched sentence set is recorded as the sentiment sentence set H (step 1904).
  • the matching can be manual or automatic, and the algorithms that can be used include but are not limited to positive Then the expression.
  • the emotion statement set H is manually reviewed, and the sentences in the sentence set are marked as positive, negative, and neutral emotion categories (step 1905). After the review is completed, the set of sentences subjected to the artificial sentiment classification will be stored in the corpus collection unit 310 (step 1909).
  • the system automatically counts the data of the three positive emotion categories of the positive, negative and neutral sentences of the sentence matched by each emotional vocabulary in the emotional sentence set H, and obtains the sentiment classification accuracy rate R of the sentiment vocabulary collocation (step 1906).
  • the sentiment classification accuracy rate can be calculated by the following formula:
  • the negative and neutral sentiment classification accuracy of the emotional vocabulary match is R2, R3 and so on.
  • the accuracy R of the sentiment vocabulary with the three sentiment classifications is compared with a preset threshold (the threshold is set to 90% in this embodiment) (step 1907), and if an emotion category accuracy rate is greater than 90%, the emotion is determined.
  • Vocabulary collocation is a strong emotional match. For example, if the positive sentiment classification accuracy rate R1>90% of an emotional vocabulary collocation in the emotional sentence set H is directly determined, the emotional vocabulary collocation is directly determined to be a strong positive emotional vocabulary collocation. Collect all strong emotional vocabulary collocations to obtain a strong emotional vocabulary collocation set F (step 1908).
  • the strong emotional vocabulary collocation set F will be stored in the corpus collection unit 310 (step 1909).
  • the strong emotional vocabulary matching set F is defined as follows:
  • the corpus collection can be real-time or periodic.
  • the set H is marked as an emotional statement of three positive, negative, and neutral sentiment categories (step 1905), and can also be used as a corpus training sentiment analyzer (step 1910).
  • the algorithm model Model' that the sentiment analyzer can adopt is a supervised learning algorithm, including but not limited to Maximum Entropy Model, Naive Bayes, and Support Vector Machine. ), Non-negative Matrix Tri-factorization, Genetic Algorithm, K-Nearest Neighbor.
  • Features in the Supervised Algorithm Model Adoption but not limited to: number of vocabulary occurrences, lexical part of speech, relative position of vocabulary, dependence characteristics between vocabulary, abstract features of vocabulary (such as word vectors obtained by unsupervised learning).
  • the sentiment analyzer algorithm model Model' can be expressed as:
  • Figure 20 depicts an embodiment of a classification display.
  • Figure 20 depicts a user interaction interface for a classified display that can be displayed on a peripheral device including, but not limited to, a mobile device, a mobile phone, a laptop, a tablet, a wearable device, and a smart device. Home appliances, smart vehicles, smart instruments and equipment.
  • the category display is displayed on the graphical interface, and the related information related to the user's search keyword is listed in order according to the positive, negative, and neutral emotion categories.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于歧义分析的信息情感分析方法,包括利用歧义分析模型以及情感分析模型对信息进行歧义分析(702)与情感分析(703)。另一方面涉及训练所述歧义分析模型以及情感分析模型的方法,包括采集信息(801),构建词库与信息库(802),利用词库对信息进行歧义分析以及情感分析,采集语料(803),训练模型(804)等。同时,还涉及一种信息情感分析系统,包括输入输出模块(103)、采集模块(101)、处理模块(102)以及数据库(104)。

Description

一种信息情感分析方法和系统 技术领域
本发明属于自然语言处理领域,涉及信息采集、信息处理、机器学习,特别是涉及一种基于语言模型的情感分析方法。
背景技术
随着互联网的不断普及,人们越来越习惯于使用互联网获取信息。然而由于互联网覆盖范围的不断扩大与信息的不断增加,当人们试图使用互联网获取某种信息时,搜索得到的结果往往比较混杂,同一个词汇在不同的词汇搭配下可能产生不同的语义,而有时人们搜索时只是想要获得与某个词汇的某一个特定语义相关的信息,所以人们希望能够在获取信息时可以得到针对某特定语义进行歧义过滤后的信息结果。同时,有时人们往往希望能够快速得到关于某种信息的有感情倾向性的分类结果,从而帮助他们快速做出判断或者了解某种信息。
发明总结
本发明的一个方面涉及一种基于歧义分析的信息情感分析方法,方法包括利用歧义分析模型以及情感分析模型对信息进行歧义分析与情感分析。本发明另一方面涉及训练所述歧义分析模型以及情感分析模型的方法,包括采集信息,构建词库,利用词库对信息进行歧义分析以及情感分析,采集语料,训练模型等。同时,本发明的另一个方面涉及一种信息情感分析系统,包括输入输出模块、采集模块、处理模块以及数据库。
在一些实施例中,本说明书披露的技术方案能够采集信息,生成信息库,筛选出信息库中的非歧义信息,并对非歧义信息进行情感分析。
在一些实施例中,本说明书披露的技术方案包含一个歧义分析模型,能够运用一定算法对采集到的信息进行歧义与非歧义分析,生成非歧义信息集合。在一些实施例中,本说明书披露的技术方案进一步包含 一个情感分析模型,能够运用一定算法对信息进行情感分析。所述信息可以来自所述非歧义信息集合,也可以来自所述信息库。
在一些实施例中,本说明书披露的技术方案进一步包含一种训练歧义分析模型的方法。所述训练歧义分析模型的方法包括:提取信息,运用一定打分规则对所述信息进行打分,根据打分结果生成模型训练语料,利用所述模型训练语料训练所述歧义分析模型。
在一些实施例中,本说明书披露的技术方案进一步包含一种训练情感分析模型的方法。所述训练情感分析模型的方法包括:提取信息,运用一定匹配规则对所述信息进行匹配,根据匹配结果生成模型训练语料,利用所述模型训练语料训练所述情感分析模型。
附图描述
为了更清楚地说明本发明的实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单的介绍。显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图将本发明应用于其它类似情景。除非从语言环境中显而易见或另做说明,图中相同标号代表相同结构和操作。
图1:一种信息情感分类系统模块示意图;
图2:采集模块示意图;
图3:处理模块示意图;
图4:输入输出模块示意图;
图5:数据库示意图;
图6:系统用户交互流程示意图;
图7:信息情感分类系统流程示意图;
图8:模型训练流程示意图;
图9:使用情景示意图;
图10:采集流程实施例示意图;
图11:系统应用于金融产品领域系统实施例流程示意图;
图12:系统应用于金融产品领域歧义分析实施例流程示意图;
图13:歧义分析实施例流程图;
图14:歧义分析实施例详细流程图;
图15:歧义分析模型训练实施例流程图;
图16:系统应用于金融产品领域情感分析实施例流程示意图;
图17:情感分析实施例流程图;
图18:情感分析实施例详细流程图;
图19:情感分析器训练实施例流程图;
图20:用户交互界面实施例示意图。
发明内容
如本说明书和权利要求书中所示,除非上下文明确提示例外情形,“一”、“一个”、“一种”和/或“该”等词并非特指单数,也可包括复数。一般说来,术语“包括”与“包含”仅提示包括已明确标识的步骤和元素,而这些步骤和元素不构成一个排它性的罗列,方法或者设备也可能包含其它的步骤或元素。
本说明书涉及的信息处理方法与系统,能够采集信息,构建词库,并利用词库对信息进行歧义分析及情感分析。在一些实施例中,本说明书涉及一种信息情感分析系统,包括输入输出模块、采集模块、处理模块以及数据库。
本发明的不同实施例可适用于多种领域,包括但不限于:金融及其衍生物投资(包括但不限于股票、债券、黄金、纸黄金、白银、外汇、贵金属、期货、货币基金等)、科技(包括但不限于数学、物理、化学及化学工程、生物及生物工程、电子工程、通信系统、互联网、物联网等)、政治(包括但不限于政治人物、政治事件、国家)、新闻(从区域而言,包括但不限于地区新闻、国内新闻、国际新闻;从新闻主体而言,包括但不限于政治新闻、科技新闻、经济新闻、生活新闻、气象新闻等)等。以上对适用领域的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解一种基 于歧义分析的信息情感分析方法和系统的基本原理后,可能在不背离这一原理的情况下,对实施上述方法和系统的应用领域形式和细节上的各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
本发明可以适用于不同类型的数据库,包括但不限于层次式数据库、网络式数据库和关系式数据库。显然,对于本领域的专业人员来说,在了解一种基于歧义分析的信息情感分析方法和系统的基本原理后,可能在不背离这一原理的情况下,对实施上述方法和系统的应用领域形式和细节上的各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
在一些实施例中,本说明书披露的技术方案能够采集信息,生成信息库,筛选出信息库中的非歧义信息,并对非歧义信息进行情感分析。
在一些实施例中,本说明书披露的技术方案包含一个歧义分析模型,能够运用一定算法对采集到的信息进行歧义与非歧义分析,生成非歧义信息集合。在一些实施例中,本说明书披露的技术方案进一步包含一个情感分析模型,能够运用一定算法对信息进行情感分析。所述信息可以来自所述非歧义信息集合,也可以来自所述信息库。
在一些实施例中,本说明书披露的技术方案进一步包含一种训练歧义分析模型的方法。所述训练歧义分析模型的方法包括:提取信息,运用一定打分规则对所述信息进行打分,根据打分结果生成模型训练语料,利用所述模型训练语料训练所述歧义分析模型。
在一些实施例中,本说明书披露的技术方案进一步包含一种训练情感分析模型的方法。所述训练情感分析模型的方法包括:提取信息,运用一定匹配规则对所述信息进行匹配,根据匹配结果生成模型训练语料,利用所述模型训练语料训练所述情感分析模型。
为了更清楚地说明本发明的实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单的介绍。显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图将本发明应用于其它类似情景。除非从语言环境中显而易见或另做说明,图中相同标号代表相同 结构和操作。
图1展示的是一种系统的示意图,这个系统可用于信息情感分析。这个系统可以包含但不限于一个或多个采集模块101、一个或多个处理模块102、一个或多个输入输出模块103、一个或多个数据库104。上述模块中的部分或全部可以与网络105连接。上述模块可以是集中式的也可以是分布式的、可以是本地的也可以是远程的。在某些实施例中这些模块是独立的;某些实施例中,部分或全部模块也可以整合为一个整体模块共同作用。
采集模块101以各种方式获取所需要的信息。获取信息的方式可以是直接的(例如直接从网络105获取信息)也可以是间接的(例如通过其他模块的采集单元获取信息)。获取信息的方式可以是集中式的(例如通过某一种渠道获取)也可以是分布式的(例如通过多种渠道获取)。获取信息的方式可以是本地的(例如从本地的具有存储功能的模块或单元获取等)也可以是远程的(例如通过搜索引擎爬取获取等)。获取信息的方式可以是有线(例如通过电缆或光缆等)也可以是无线的(例如通过无线电或光信号等)。获取信息的方式可以是人工的也可以是自动的。获取信息的方式可以是基于现有算法的,也可以是用户自定义的算法。获取信息的方式可以是上述任何方式的类似方法,或上述任何方式的组合。上述所需要的信息的来源可以是网络105(城域网、广域网、局域网等),新闻、报纸、媒体,也可以是处理模块102(一个或多个)、输入输出模块103(一个或多个)、数据库104(一个或多个)等的一种或多种。例如,采集模块101可以从处理模块102中间处理过程中产生的全部信息或部分信息中提取所需要的信息;采集模块101可以通过用户输入的一些词汇、短语、句子、上传的图片、音频、视频等信息中采集所需信息;采集模块101也可以从数据库104中提取所需信息。采集模块101还可以将采集到的全部信息或部分信息输送到处理模块102、数据库104、输入输出模块103等中的一个或多个。上述所需要的信息可以包括但不限于行业特定名称词汇、与特定名称词汇强相关的词汇、包含上述词汇的信息以及包含情感信息的词汇等中的一种或多种。上述 行业可以包含但不限于体育、娱乐、经济、政治、文化等中的一种或多种。上述特定名称词汇可以包括但不限于特定行业的专有名词、全称、简称、代码、同义词、缩略词等中的一种或多种。上述与特定名称词汇强相关的词汇可以包括但不限于与上述特定名称词汇有关的名词、动词、形容词、短句、短语搭配、该特定领域特定词汇的行业词汇、近义词、反义词、常见搭配词、组成部分名词、派生词、复合词等中的一种或多种。包含上述词汇的信息可以包括但不限于词典、新闻、有关公司的研究报告、公告、产品手册、以及相关网站网页等中的一种或多种。上述情感词汇的类别可以包括但不限于正面、负面、中性等中的一种或多种。信息的形式可以包括但不限于文字的、图片的、音频的、视频的等中的一种或多种。上述所需要的信息使用语言可以包括但不限于中文、英文、日文、韩文、法文、德文等中的一种或多种。以上对所需要的信息的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解所需要的信息的基本原理后,可能在不背离这一原理的情况下,对所需要的信息的内容进行各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
处理模块102可与网络105进行双向通信。处理模块102可与采集模块101进行双向通信。处理模块102可与数据库104进行双向通信。处理模块102可与输入输出模块103进行双向通信。处理模块102可直接从网络105采集需要的信息,也可以接收由采集模块101传输的信息,所述信息包含但不限于特定名称词汇、与特定名称词汇强相关的词汇、包含上述词汇的信息以及包含情感信息的词汇等中的一种或多种。处理模块102也可以向网络105发送信息。所述信息可以包含但不限于经过处理模块102处理的信息,以及未经处理模块102处理的信息等。所述经过处理模块102处理的信息可以包含但不限于通过应用特定的分类规则完成分类的信息。处理模块102在完成信息处理之后,可以将经过处理的信息依照特定的存储方法存储在数据库104中。同样地,处理模块102也可将由采集模块101或网络105传输来未经处理的信息存入数据库104中。所述存储方法可以包含但不限于顺序存储方法、链接存储方 法、索引存储方法以及散列存储方法等中的一种或多种。所述未经处理的信息可以包含但不限于未经分类的词汇、短语、语句、段落等中的一种或多种。所述经过处理的信息可以包含但不限于经过分类的词汇、短语、语句、段落等中的一种或多种。处理模块102也可发送信息给输入输出模块103。所述信息可以包含但不限于经过处理的信息,以及未经处理的信息等。处理模块102也可接收输入输出模块103发送的数据或指令,并通过解析接收到的数据或指令做出相应的行为。
输入输出模块103可以将系统内部信息与外周设备进行交换并接收外部信息。输入输出模块103可以通过网络105连接外周设备,或者直接连接外周设备。输入输出模块103可以接收用户输入的信息。所述用户输入的信息可以来自网络105,也可以来自外周设备,也可以来自与系统相通信的第三方。输入输出模块103可以将生成的输出结果推送给外周设备,可用于展示给用户。所述外周设备可以包含但不限于鼠标、键盘、触控板、轨迹球、语音识别设备、图型图像识别设备、显示设备、移动电话、PC、Macintosh、平板电脑等中的一种或多种。用户输入的形式可以包括但不限于数字、字符、符号、文字、声音、图形图像、视频等中的一种或多种。输出方式可以包括但不限于将通过特定的分类规则完成分类的信息进行分类输出。输入输出模块103能够与采集模块101传递或交换信息。输入输出模块103可以接收采集模块101传递的信息。输入输出模块103可以将通过外周设备接收到的用户输入信息传递给采集模块101。输入输出模块103可以将采集模块101采集到的信息进行输出,可以将信息通过外周设备展示给用户。输入输出模块103能够与处理模块102传递或交换信息。输入输出模块103可以将接收到的信息传输给处理模块102进行处理。输入输出模块103可以将接收到的处理模块102传递的信息进行输出,可以将信息通过外周设备展示给用户。输入输出模块103能够与数据库104传递信息。输入输出模块103可以将接收到的数据库104传递的信息进行输出,可以将信息通过外周设备展示给用户。输入输出模块103可以将接收到的输入信息传递给数据库104。
数据库104或系统内其他存储设备具有信息存储功能。数据库104或系统内其他存储设备能够将信息数字化后再以利用电、磁或光学等方式的存储设备加以存储。数据库104或系统内其他存储设备用来存放各种信息例如程序和数据等。数据库104或系统内其他存储设备可以是利用电能方式存储信息的设备,例如各种存储器、随机存取存储器(Random Access Memory,RAM)、只读存储器(Read Only Memory,ROM)等。数据库104或系统内其他存储设备可以是利用磁能方式存储信息的设备,例如硬盘、软盘、磁带、磁芯存储器、磁泡存储器、优盘等。数据库104或系统内其他存储设备可以是利用光学方式存储信息的设备,例如CD或DVD等。数据库104或系统内其他存储设备可以是利用磁光方式存储信息的设备,例如磁光盘等。数据库104或系统内其他存储设备的存取方式可以是随机存储、串行访问存储、只读存储等。数据库104或系统内其他存储设备可以是非永久记忆存储器,也可以是永久记忆存储器。数据库104或系统内其他存储设备可以是本地的,也可以是远程的,也可以是云服务器上的。数据库104能够对其内部信息进行分类、排序、筛选等处理操作。数据库104或系统内其他存储设备可以与采集模块101传递或交换信息。数据库104或系统内其他存储设备可以接收采集模块101采集的信息,将其存储在数据库104或系统内其他存储设备上。根据收到的指令,数据库104或系统内其他存储设备里的信息可以被提取,传递给采集模块101。上述指令可以是直接来自于采集模块101;也可以是来自于其他模块,如输入输出模块103、处理模块102等。上述指令可以来自于数据库104或系统内其他存储设备,例如定时指示数据库104或系统内其他存储设备向采集模块101发送信息等。数据库104或系统内其他存储设备可以与处理模块102传递或交换信息,可以接收处理模块102传递的信息,将其存储。在到指令,数据库104或系统内其他存储设备里的信息可以被提取,传递给处理模块102。上述指令可以是直接来自于采集模块101;也可以是来自于其他模块,如输入输出模块103、采集模块101等。上述指令可以来自于数据库104或系统内其他存储设备,例如定时指示数据库104或系统内其他存储设备向 处理模块102发送信息等。数据库104或系统内其他存储设备能够与输入输出模块103传递和交换信息,可以接收输入输出模块103传递的信息,将其存储在数据库104或系统内其他存储设备。根据收到的信息,数据库104或系统内其他存储设备里的信息可以被提取,传递给输入输出模块103。上述指令可以是直接来自于输入输出模块103;也可以是来自于其他模块,如采集模块101、处理模块102。上述指令可以来自于数据库104或系统内其他存储设备,例如定时指示数据库104或系统内其他存储设备向输入输出模块103发送信息等。
系统中各个模块之间,模块和外周设备之间的连接,以及系统与云服务器之间的连接都可以通过有线连接或无线连接。其中有线连接可以包括但不限于使用金属电缆、光学电缆或者金属和光学的混合电缆等中的一种或多种,例如:同轴电缆、通信电缆、软性电缆、螺旋电缆、非金属护皮电缆、金属护皮电缆、多芯电缆、双绞线电缆、带状电缆、屏蔽电缆、电信电缆、双股电缆、平行双芯导线、和双绞线等。以上描述的例子仅作为方便说明之用,有线连接的媒介还可以是其它类型,例如,其它电信号或光信号等的传输载体。无线连接可以包括但不限于无线电通信、自由空间光通信、声通信、和电磁感应等中的一种或多种。其中所述无线电通信包括但不限于IEEE802.11系列标准、IEEE802.15系列标准(例如蓝牙技术和紫蜂(ZigBee)技术等)、第一代移动通信技术、第二代移动通信技术(例如FDMA、TDMA、SDMA、CDMA、和SSMA等)、通用分组无线服务技术、第三代移动通信技术(例如CDMA2000、WCDMA、TD-SCDMA、和WiMAX等)、第四代移动通信技术(例如TD-LTE和FDD-LTE等)、卫星通信(例如GPS技术等)和其它运行在ISM频段(例如2.4GHz等)的技术等。所述自由空间光通信可以包括但不限于可见光、红外线讯号等中的一种或多种。所述声通信可以包括但不限于声波、超声波讯号等中的一种或多种。所述电磁感应包括但不限于近场通信技术等。以上描述的例子仅作为方便说明之用,无线连接的媒介还可以是其它类型,例如,Z-wave技术、蓝牙低功耗(BLE)技术、433MHz通信协议频段、其它收费的民用无线电频段和军用无线 电频段等。
系统中各个模块之间,模块和外周设备之间的连接,以及系统与存储设备或云服务器之间的连接并不局限于以上所列举的技术。上述的连接方式在该系统中可以单一使用,也可以多种连接方式结合使用,在不同连接方式结合使用的过程中,需要配合相应的网关设备达到信息交互。各个模块也可以集成在一起,通过同一个设备或电子元件上实现一个以上模块的功能。外周设备也可以集成在一个或多个模块的实施设备或电子元件上,而单个或多个模块亦可以集成在单个或多个外周设备或电子元件上。另外,模块间信息传输的方式可以是直接的也可以是间接的、可以是有线的也可以是无线的,可以是顺序进行的也可以是同时进行的,可以是周期的也可以是非周期的等。以上对模块间信息传输方式的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解模块间信息传输方式的基本原理后,可能在不背离这一原理的情况下,对所需要的信息的内容进行各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
图2展示的是采集模块101的示意图。采集模块101可以包含但不限于一个或多个采集单元201、一个或多个处理单元202、一个或多个存储单元203等。上述单元可以是集中式的也可以是分布式的、可以是本地的也可以是远程的。在某些实施例中这些单元是独立的,在一些实施例中,这些单元可以是独立的,在其他实施例中,部分单元也可以整合为一个整体单元共同作用。
采集模块101可以通过采集单元201采集信息。采集到的全部或部分信息可以存储到存储单元203中,还可以存储到数据库104中。所述采集到的全部或部分信息可以传递给处理单元202进行处理。处理结果可以存储到存储单元203中。对于所述信息的处理可以包含但不限于提取信息中的一些关键词汇,对信息的价值进行评估(例如,可以估计采集到的信息与用户所需要的信息的关联程度)等。处理单元202处理的信息可以是来自于采集单元201,也可以是来自于存储单元203,还可以来自于其他模块或系统内具有存储功能的设备(例如,数据库104等)。 存储单元203中的信息可以进一步交付给数据库104进行存储,也可以传递给处理模块102进行处理,还可以传递到输入输出模块103进行输出。不同单元模块之间信息传递的方式可以是有线的也可以是无线的,可以是直接的也可以是间接的,可以是同时进行的也可以是顺序进行的,可以是周期的也可以是非周期的等。
图3展示的是处理模块102的示意图。处理模块102可以包含但不限于一个或多个歧义分析模块301、一个或多个情感分析模块306以及一个或多个存储模块315。在一些实施例中,这些模块可以是独立的,在其他实施例中,部分模块也可以整合为一个整体单元共同作用。
歧义分析模块301可以获取信息,对信息进行处理,生成用于训练歧义分析模型312的歧义或非歧义语料。歧义分析模块301可以包含但不限于一个或多个采集单元302、一个或多个匹配单元303、一个或多个处理单元304、一个或多个语料采集单元305以及一个或多个歧义分析模型312。歧义分析模块301的采集单元302以各种方式获取所需要的信息。歧义分析模块301的采集单元302也可直接从网络105获取需要的信息。获取信息的方式可以是集中式的也可以是分布式的、可以是本地的也可以是远程的、可以是有线的也可以是无线的,可以是人工的也可以是自动的、也可以是上述多种方式相结合的。需要注意的是,以上对获取信息的方式的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解获取信息的基本原理后,可能在不背离这一原理的情况下,对获取信息的具体方式与步骤进行形式和细节上的各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
处理模块102中的采集单元302可以采集信息。所述信息可以是数据库104中已构建的关键词词典502、歧义列表504、相关词词典503(见图5)以及信息库511中存储的内容等。根据所述采集到的信息,歧义分析模块301的匹配单元303可以对信息库511中的信息进行匹配。处理模块102可以向数据库104发送关键词请求以及词典请求。数据库104收到请求后,将请求的关键词词典502、相关词词典503和歧义列 表504发送给处理模块102。处理模块102中的匹配单元303依照特定算法对所述关键词进行匹配。所述特定算法可以包括但不限于前缀搜索、后缀搜索、子串搜索等中的一种或多种。处理单元304对匹配结果进行打分,用以量化信息的歧义程度。此打分结果可以在后续歧义分析过程中作为衡量语句是否歧义的初步标准。该打分涉及的因素可以包含但不限于特定词汇长度、相关词汇的词汇长度、整体消息的长度、不同特定词汇在信息中所占权重、不同相关词汇在信息中所占权重、相关词汇的数量与特定词汇的数量等中的一种或多种。以上对所匹配单元303与处理单元304的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。语料采集单元305可以配置为采集要素集合。所述要素集合可以包含但不限于关键词、周围词汇、相对位置信息以及歧义或非歧义的句子形成的要素,所述要素集合可以被存入语料采集单元305中。在一些实施例中,所述要素集合可以用于训练歧义分析模型312。显然,对于本领域的专业人员来说,在了解匹配单元与处理单元的基本原理后,可能在不背离这一原理的情况下,对所需要的信息的内容进行各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
上文所述歧义打分结果是对信息歧义程度的量化。在一些实施例中,可以对这个分数设定几个阈值。这些阈值可以初步划分出强歧义的语句和明显非歧义的语句,从而对待分类信息初步进行歧义和非歧义分类。在一些实施例中,当用歧义打分结果无法直接判断某词汇或信息是否为歧义语句时,该词汇或信息可以进入一个进一步审核步骤。审核步骤可以包括但不限于人工审核、模型自动审核或二者结合的方式。在审核步骤中,涉及到的因素可以包括但不限于特定词汇长度、相关词汇的词汇长度、整体消息的长度、不同特定词汇在信息中所占权重、不同相关词汇在信息中所占权重、相关词汇与特定词汇的数量等,从而得到信息的歧义非歧义分类结果。
在一些实施例中,信息的分类结果可以用于对审核步骤中使用的模型进行训练,其中模型的分类算法可以包括但不限于决策树、Rocchio、朴素贝叶斯、神经网络、支持向量机、线性最小平方拟合、K-近邻、遗 传算法、最大熵等。以上对歧义分析模块301的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解歧义分析的基本原理后,可能在不背离这一原理的情况下,对所需要的信息的内容进行各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
歧义分析模块301可以包含但不限于一个或多个歧义分析模型312。经过一定时间的训练,歧义分析模型312可以判定新闻中关于具体名称的描述是否有歧义。判定完成之后,输出非歧义语句集合。所述非歧义语句集合可进行存储,存储位置可以包含但不限于存储模块315、数据库104或系统其它具有存储功能的设备等中的一种或多种。所述非歧义语句集合可以交付给其它模块(例如情感分析模块306)进行处理。歧义分析模型312也可以在人工或机器的辅助下完成歧义判断。
所述情感分析模块306可以包含但不限于一个或多个采集单元307、一个或多个匹配单元308、一个或多个处理单元309、一个或多个语料采集单元310以及一个或多个情感分析器311。所述单元可以是集中式的也可以是分布式的、可以是本地的也可以是远程的。在一些实施例中,上述单元可以是独立的。在一些实施例中,部分单元也可以整合为一个整体单元共同作用。情感分析模块306可以对歧义分析模块301所得到的非歧义信息进行情感分类。所述情感类别可以包括但不限于正面、负面、中性等。在一些实施例中,采集模块101可以通过信息采集等方法,构建包含情感词汇搭配的一个或多个情感词汇搭配库507(见图5)。所述情感词汇搭配库507被存储在数据库104中。情感分析模块306中的采集单元307可以采集信息。所述采集到的信息可以包含但不限于数据库104中的情感词汇搭配库507以及信息库511中存储的内容等。根据所述信息,情感分析模块306的匹配单元308对歧义分析模块301输出的非歧义信息进行匹配,匹配方法可以包含但不限于正则表达式等。处理单元309可以计算搭配的准确率,并将准确率大于预定阈值的搭配判定为强情感搭配(例如,急剧增长可以视为强情感搭配)。处理单元309可以给没有包含强情感搭配的句子打分,并根据对应情感类型的分 数判断所述句子的情感类型。所述强情感搭配被存入语料采集单元310中。语料采集单元310的功能可以包含但不限于采集有情感搭配、无情感搭配以及情感句子等要素集合。
情感分类方法主要分为两类:基于词典和基于机器学习。基于词典的方法中,可以事先定义一个标注了词的情感极性的词典,句子或者文章的正负面情感极性通过在其中出现的正面或者负面情感词汇的数量、权值等预设属性特征,以一定的计算方法进行衡量。基于机器学习的方法可以把情感分类的问题归类为文本分类的问题,可以采用在文本分类中常用的分类方法(包括但不限于决策树、Rocchio、朴素贝叶斯、神经网络、支持向量机、线性最小平方拟合、K-近邻、遗传算法、最大熵等),通过对标注好情感极性的文本的训练学习,得到分类器,对新的文本进行情感分类。在一些实施例中,词典与机器学习的方法可以相结合对句子或文章进行情感分类。
情感分析模块306可以包含但不限于一个或多个情感分析器311。经过一定时间的训练,情感分析器311可以直接判定新闻中非歧义语句的情感类型。判定完成之后,得到经过情感分类的语句集合。所述经过情感分类的语句集合可进行存储,存储位置可以包含但不限于存储模块315、数据库104或系统其它具有存储功能的设备等中的一种或多种。情感分析器311也可以在人工或机器的辅助下完成情感分析。
图4展示的是输入输出模块103的示意图。输入输出模块103可以包括但不限于一个或多个接口单元401、一个或多个识别单元402、一个或多个存储单元403以及一个或多个扩展单元404。上述单元可以是集中式的也可以是分布式的、可以是本地的也可以是远程的。在一些实施例中这些单元是独立的。在一些实施例中,部分单元也可以整合为一个整体单元共同作用。
输入输出模块103的接口单元401可以配置为接收输入信息以及输出系统生成的结果。所述信息可以传递给采集模块101。所述信息可以传递给处理模块102进行包含但不限于歧义分析或情感分析等处理。所述信息可以进行存储。存储位置可以是存储单元403、数据库104以及 系统其它具有存储功能的设备等中的一种或多种。所述输出结果可以是按照一定规则分好类的信息,如正面信息、负面信息、中性信息等。所述输出结果可以通过外周设备显示给用户。
识别单元402可以配置为识别已进行情感分析的信息中的情感标签,进而指导接口单元401依据情感标签对信息进行分类展示。
存储单元403可以配置为对信息进行存储,存储的信息可以是来自于接口单元401、识别单元402。存储的信息可以是来自于其他模块,如采集模块101、处理模块102、数据库104等中的一种或多种。
输入输出单元103的扩展单元404可以配置为根据用户的需求,提供一种功能扩展的机制,帮助系统完成功能扩展。所述扩展的功能可以包括但不限于订阅功能,信息分享功能,智能学习、更新功能等中的一种或多种。扩展单元404可以将用户输入的关键词信息、用户自定义的信息推送周期、信息推送方式、信息分享的对象、信息分享的内容、系统更新周期等信息存入数据库104中的用户数据库513(见图5)。
基于本发明的一些实施例,系统的输入输出模块103的扩展单元404可配置为提供订阅功能。用户可以选择订阅包含特定关键词的信息,扩展单元404可以根据用户订阅,通过各种方式将经过情感分析的信息推送给用户。扩展单元404包括但不限于为用户提供推送信息,也可以推荐关注兴趣相似的用户,还可以推荐信息的评论,并且提供信息有无帮助的评分等。扩展单元404推送的方式可以包含但不限于移动客户端软件、电子邮件、短信、RSS门户网站、在线单用户聚合器、搜索引擎、浏览器、即时通讯软件、社交网络等。扩展单元404推送周期可以是系统设定的,也可以是用户自定义的,可以是定期的也可以是不定期的,可以是实时的也可以是延时的。定期推送周期可以包括但不限于几个小时、几天、几周、几个月、几个季度、几年等中的一种或多种。不定期推送周期可以包括但不限于不同国家的工作日、节假日或者早、中、晚等中的一种或多种。扩展单元404推送的信息内容形式可以包括但不限于文字、语音、图片、动画、视频等中的一种或多种。扩展单元404推送的信息内容可以包括但不限于用户已浏览的信息内容更新,可以是用 户关注的信息,也可以是系统根据用户记录推荐的信息,还可以是同类信息关注的热度情况等中的一种或多种。以上对扩展单元404的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解扩展单元404的基本原理后,可能在不背离这一原理的情况下,对实施扩展单元404的具体方式与步骤、以及扩展单元404所能实现的功能进行形式和细节上的各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
基于本发明的一些实施例,系统的输入输出模块103的扩展单元404可以配置为提供智能学习功能。扩展单元404可以智能学习、分析并记忆用户的使用习惯,包括但不限于常用领域、检索高频关键词、较关注的情感类别等。例如,在一些实施例中,扩展单元404可以自动记忆,或根据用户标注记忆,用户常点击的某跨国公司的子公司,在用户输入该公司名称后优先展示该子公司相关信息。再例如,在一些实施例中,扩展单元404可以学习用户在不同时段所关注的不同情感类别或领域的信息,与扩展单元404配合在特定时段进行信息推送。以上对扩展单元404及其所实现功能的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解扩展单元404及其所实现功能的基本原理后,可能在不背离这一原理的情况下,对实施扩展单元404及其所实现功能的具体方式与步骤进行形式和细节上的各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
基于本发明的一些实施例,系统的输入输出模块103的扩展单元404可以配置为提供信息分享功能。信息分享是用户通过各种方式把感兴趣的信息分享给朋友。信息分享是用户可使用的发布信息方式,分享到指定的地方,选择哪些人可以看到该信息等。信息分享的内容可以是单条信息也可以是多条信息,可以是部分选取内容的信息也可以是页面整体内容的信息,可以是信息内容分享也可以是信息评论分享,可以是信息的关注度分享也可以是信息的帮助评分分享等。信息分享的方式可以包括但不限于短信、彩信、电子邮件、QQ、MSN、微信、微博、豆瓣、Twitter、Facebook、Instagram、人人、即时通讯软件工具等中的一种或 多种。信息分享接收对象可以包括但不限于单个朋友、多个朋友、朋友圈、公共社交圈、论坛、其他用户等中的一个或多个。信息分享的内容格式可以包括但不限于文字、图片、语音、动画、视频、网页链接等中的一种或多种。以上对信息分享模块及其所实现功能的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解信息分享模块及其所实现功能的基本原理后,可能在不背离这一原理的情况下,对实施信息分享模块及其所实现功能的具体方式与步骤进行形式和细节上的各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
图5展示的是数据库104中所包含或用到的单元的示意图。数据库104包括但不限于一个或多个关键词词库501、一个或多个情感词词库505、一个或多个信息库511、一个或多个语料库508、一个或多个语义知识库512、一个或多个用户数据库513等。关键词词库501可以包括但不限于一个或多个关键词词典502、一个或多个相关词词典503、一个或多个歧义列表504等。以上对于词典的描述只是为了方便说明,并不具有限定作用。关键词词典502可以配置为存储包括但不限于特定名称词汇。上述特定名称词汇包括但不限于特定领域的专有名词、全称、简称、代码、同义词、缩略词等。关键词词典502中的特定名称词汇可以来自于采集模块101,也可以来自于处理模块102。相关词词典503可以配置为存储包括但不限于特定名称词汇的相关词汇。所述相关词汇可以包括但不限于与上述特定名称词汇有关的专有名词、名词、动词、形容词、短语搭配、短句、该领域特定词汇的行业词汇、近义词、反义词、常见搭配词、组成部分名词、派生词、复合词等。歧义列表504可以配置为存储包括但不限于经过人工、模型或两者相结合方式审核后可能具有歧义的特定名称词汇。情感词词库505可以包括但不限于一个或多个情感词汇库506以及一个或多个情感词汇搭配库507等。情感词汇库506可以配置为存储包含但不限于情感词汇。所述情感词汇指包含情感信息的词汇。如,佳、优、增、好、增长、盈、涨、补涨、赚、涨停、飙升盈利、减少、降、锐减、补跌、下降、亏损、赔、亏、跌停、减持、 降低等词汇。所述情感词汇可以包括但不限于表达情感的名词、动词、形容词等。情感词汇库506中信息的来源可以包括但不限于互联网开源词典、专业词典等。情感词汇搭配库507可以配置为存储包含但不限于情感词汇搭配。所述情感词汇搭配可以包括但不限于与情感词汇库506中情感词汇进行搭配的短语搭配、短句、近义词、反义词、常见搭配词、组成部分名词、派生词、复合词等。情感词汇搭配库507中信息的来源可以包括但不限于互联网开源词典、专业词典、新闻、有关公司的研究报告、公告、产品手册及相关网站等资讯。
情感词汇搭配库507可以是固定的词库,也可以是不断更新扩充的。情感词汇搭配库507的扩充方法包含但不限于PMI算法等。语料库508可以包括但不限于歧义语料库509以及情感语料库510等。歧义语料库509可以配置为存储包含但不限于歧义语料。所述歧义语料可以包含但不限于已进行歧义/非歧义标注的词汇、短语搭配、语句等。情感语料库510可以配置为存储包含但不限于情感语料。所述情感语料可以包含但不限于已进行情感类别标注的词汇、短语搭配、语句等。
歧义语料库509中的语料的来源可以包含但不限于歧义分析模块301中的语料采集单元305,情感语料库510中的语料的来源可以包含但不限于情感分析模块306的语料采集单元310。歧义语料库509及情感语料库510的来源可以包括但不限于互联网开源词典、专业词典、新闻、有关公司的研究报告、公告、产品手册及相关网站等资讯等。
信息库511可以配置为存储包含但不限于包含关键词的信息。信息库511中的信息可以是已经过歧义分析或情感分析的,也可以是未经过歧义分析或情感分析的。所述信息的来源可以是采集模块101。
语义知识库512可以配置为存储包含但不限于基于概念的词汇、短语、句子以及段落等。通过检索语义知识库512,词汇、短语、句子及段落的情感类型可以被识别出来。语义知识库512特别是能识别不包含情感词汇的短语、句子、段落等。
用户数据库513可以配置为存储包含但不限于与用户相关的信息。所述与用户相关的信息可以包含但不限于用户的个人信息、用户的历史 检索信息、用户的自定义设置信息等。所述用户的个人信息可以包含但不限于用户的登录账号、登录密码,用户登录系统的周期、时间的信息等。所述用户的历史检索信息可以包含但不限于用户的历史检索关键词以及根据用户的检索关键词得到的检索信息结果等。所述用户的自定义设置信息可以包含但不限于用户对于订阅信息的设置、用户对于信息分享的设置、用户对于智能学习、用户对于系统更新的设置等中的一种或多种。所述用户对于订阅信息的设置可以包含但不限于用户需要订阅的信息的关键词、用户设置的信息推送周期、推送格式、推送位置等中的一种或多种。所述用户对于信息分享的设置可以包含但不限于信息分享对象、信息分享格式、信息分享的周期等。所述用户对于智能学习的设置可以包含但不限于智能学习周期等。所述用户对于系统更新的设置可以包含但不限于更新周期等。
以上对于数据库的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解数据库的基本原理后,可能在不背离这一原理的情况下,数据库进行形式和细节上的各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
在本发明的一些实施例中,系统可以包含用户交互界面。用户交互界面能够直接或通过外周设备接收用户输入,或将一种或多种情感类别的信息展示给用户。用户交互界面所接受的用户输入,能够存储在存储单元403中,再传递给其他模块,如采集模块101、处理模块102、数据库104;也能够直接传递给上述其他模块。用户交互界面所输出的信息,可以来自于存储单元403。用户交互界面所输出的信息可以直接来自于识别单元402,或者其他模块,如采集模块101、处理模块102、数据库104等。用户交互界面可以是图形用户交互界面(Graphical user interface),也可以是直接操作界面(Direct manipulation interface)、基于网络的用户界面(Web-based user interfaces or web user interface(WUI))、触摸屏(Touchscreen)、命令行界面(Command line interface)、触摸用户界面(Touch user interface)、硬件接口(Hardware interface)、注视用户界面(Attentive user interface)、成批接口(Batch interface)、 会话接口代理(Conversational Interface Agent)、基于交叉的接口(Crossing-based interface)、手势接口(Gesture interface)、智能用户界面(Intelligent user interface)、运动跟踪接口(Motion tracking interface)、多屏幕接口(Multi-screen interface)、无命令行用户界面(Non-command user interface)、面向对象的用户界面(Object-oriented user interface)、反射性的用户界面(Reflexive user interface)、检索界面(Search interface)、有形用户界面(Tangible User Interface)、基于任务的界面(Task-Focused Interface)、基于文本的用户界面(Text-based user interface)、语音用户界面(Voice user interface)、自然语言界面(Natural-language interface)、零输入接口(Zero-Input interface)、缩放用户界面(Zooming user interface)等。用户交互界面能够对信息进行分类展示,不同情感类别的信息可以显示在一个页面,也可以在不同页面显示,显示形式可以包括但不限于文字、图片、音频、录像、动画、广播等。不同显示形式下,表示情感类别的语句可以采用一种或多种高亮的形式进行展示,如,文字形式的信息高亮采用一种或多种不同于信息主体文字的颜色。所述颜色可以包括但不限于红色、蓝色、黄色、粉色、橙色、绿色、紫色等。表示情感类别的语句可以采用一种或多种不同于信息主体文字的字体。所述字体可以包括但不限于宋体、仿宋、楷体、斜体、黑体、Times New Roman、Calibri等。表示情感类别的语句可以采用一种或多种不同于信息主体文字的字符尺寸。所述尺寸可以包括但不限于二号、三号、四号、小四、五号、小五等。表示情感类别的语句可以采用下划线。所述下划线可以包括但不限于直线、虚线等。图片形式的信息高亮形式可以采用一种或多种不同形状的框架包括但不限于圆形、方形、矩形、菱形、椭圆形等。图片形式的信息高亮形式可以采用一种或多种颜色的框架。所述框架的颜色可以包括但不限于红色、蓝色、黄色、粉色、橙色、绿色、紫色等。音频、广播形式的信息高亮形式采用一种或多种音量。
用户交互界面可以向用户展示一种领域或多种领域经情感分析后的信息。所述领域可以包括但不限于金融及其衍生物投资(包括但不限 于股票、债券、黄金、纸黄金、白银、外汇、贵金属、期货、货币基金等)、科技(包括但不限于数学、物理、化学及化学工程、生物及生物工程、电子工程、通信系统、互联网、物联网等)、政治(包括但不限于政治人物、政治事件、国家)、新闻(从区域而言,包括但不限于地区新闻、国内新闻、国际新闻;从新闻主体而言,包括但不限于政治新闻、科技新闻、经济新闻、生活新闻、气象新闻等)等。此外,用户可以在用户交互界面添加关注的领域作为快捷查看方式,进而快速查看关注的一种或多种领域的情感分析后的信息。用户交互界面可以为用户提供收藏夹,用户可以将一种或多种信息置于收藏夹内,方便下一次的使用,收藏信息的形式可以是网络链接、文字、图片、音频、录像、动画、广播,也可以是任意几种的组合。组合的形式可以是按规律重复的、也可以是无规律分布的。用户交互界面可以采用默认的用户界面,也可以采用自定义界面,用户可以按照自己的习惯、喜好设计用户界面,包括但不限于设定界面的颜色、界面的尺寸、界面的布局、界面的风格等。
以上对用户交互界面的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解用户交互界面的基本原理后,可能在不背离这一原理的情况下,对实施用户交互界面的具体方式与步骤进行形式和细节上的各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。例如,用户交互界面向用户展示信息的情感类别情况包括但不限于整体信息情感类别情况,一种子类信息情感类别情况或多种子类信息情感类别情况;向用户展示信息的情感分析走势,包括但不限于整体的信息的情感类别走势、一种子类信息的情感类别走势、多种子类信息的情感类别走势;向用户展示推送的订阅的信息:向用户发出提醒,提醒形式可以包括但不限于文字、声音、图像、视频、震动、动态弹出框等。弹出框的形状可以包含但不限于圆形、方形、矩形、菱形、椭圆形等。用户依照提醒选择需要查看的根据正负面情感分析的订阅信息。
在一些实施例中,系统可以进一步包含一个更新模块,能够对数据库104中的词库与信息库进行更新,和/或能够对歧义分析模型312、情 感分析器311的相关算法参数进行更新。更新模块可以以各种方式获取所需要的信息。获取信息的方式可以是集中式的也可以是分布式的、可以是本地的也可以是远程的、可以是有线的也可以是无线的,可以是人工的也可以是自动的、也可以是多种方式相结合的。以上对获取信息的方式的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解获取信息的基本原理后,可能在不背离这一原理的情况下,对获取信息的具体方式与步骤进行形式和细节上的各种修正和改变,但是这些修正和改变仍在以上描述的范围之内
所需要的信息内容可以包含但不限于由特定名称词汇、特定名称词汇的相关词汇、包含这些词汇的信息、用于歧义分析或情感分析的算法参数等。上述特定名称词汇可以包括但不限于特定领域的专有名词、全称、简称、代码、同义词、缩略词等。上述特定名称词汇的相关词汇可以包括但不限于与上述特定名称词汇有关的专有名词、名词、动词、形容词、短语搭配、短句、该领域特定词汇的行业词汇、近义词、反义词、常见搭配词、组成部分名词、派生词、复合词等。包含上述词汇的信息可以包括但不限于词典、新闻、有关公司的研究报告、公告、产品手册、以及相关网站网页等。用于歧义分析或情感分析的算法参数可以包括但不限于决策树、Rocchio、朴素贝叶斯、神经网络、支持向量机、线性最小平方拟合、K-近邻、遗传算法、最大熵等分类算法等。
在一些实施例中,更新模块可以利用采集到的上述信息添加至数据库104获得更新过的数据库104。更新模块可以利用更新过数据库104中的信息训练算法模型。此外,更新模块可以利用采集到用于歧义分析或情感分析的算法参数直接更新算法模型。以上对更新模块及采集单元、更新模块的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解更新模块及采集单元、更新模块基本原理后,可能在不背离这一原理的情况下,对更新模块及采集单元、更新模块的内容进行各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
更新的周期可以定期的或不定期的。更新模块进行定期更新,可以是系统设定的也可以是用户自定义的。定期更新的周期可以包括但不限于几个小时、几天、几周、几个月、几个季度、几年等。信息更新模块进行不定期更新,可以是系统设定的也可以是用户自定义的。不定期更新可以包括但不限于在不同国家的工作日、节假日或者早、中、晚等进行更新。更新模块的信息来源可以包括但不限于词典、新闻媒体、有关公司的研究报告、公告、产品手册、微博、微信、社交网站、论坛、出版商以及相关网站网页等。更新的内容可以是已有的内容,也可以是新的内容。例如,在一些实施例中,系统可以定期查看财经网站等新闻媒体。若包含已有信息如股票名称等,同时出现了新的与该股票名称相关的内容,更新模块针对新内容进行更新。若该股票名称发生变更,信息更新模块可以进行更新。若该股票名称有其他可替代名称,信息更新模块可以进行更新。以上对更新模块及更新周期、更新内容的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解更新模块及更新周期、更新内容基本原理后,可能在不背离这一原理的情况下,对更新模块及更新周期、更新内容进行各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
更新模块对歧义分析模块301或情感分析模块306中的算法模型进行更新可以是直接更新的,也可以是随更新的信息而更新的,也可以是累积一定量的更新信息后更新的。
歧义分析模块301中的歧义分析模型312的更新可以是经过人工审核的也可以是系统自动审核的,也可以是二者结合的。上述的歧义分析模型312可以包括但不限于决策树、Rocchio、朴素贝叶斯、神经网络、支持向量机、线性最小平方拟合、K-近邻、遗传算法、最大熵等。例如,系统定期查看财经网站等新闻媒体,若包含已有信息,如股票名称等,同时出现了与该股票名称相关的重要信息,信息更新模块可以对信息库511进行更新。歧义分析模块301可以对该信息进行歧义判断,若为歧义信息,该歧义信息可以进入歧义搭配提取步骤,从中提取出歧义搭配,并人工审核该信息是否确为强歧义搭配,审核通过后,该搭配将进入信 息更新模块,用来更新歧义分析模型312。
以上对更新歧义分析模块301和歧义分析模型312的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解更新歧义分析模块301和歧义分析模型312的基本原理后,可能在不背离这一原理的情况下,对更新歧义分析模块301和歧义分析模型312进行各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
情感分析模块306中的情感分析器311的更新可以是经过人工审核的也可以是系统自动审核的,也可以是二者结合的。上述情感分析器311可以包含但不限于决策树、Rocchio、朴素贝叶斯、神经网络、支持向量机、线性最小平方拟合、K-近邻、遗传算法、最大熵等。例如,系统定期查看财经网站等新闻媒体,通过正负面情感分析流程后获得的强正负面搭配,进一步更新情感词汇搭配集合,该更新后的搭配集合将进入信息更新模块,用来更新情感分析模块的模型。
以上对更新情感分析模块306和情感分析器311的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解更新情感分析模块306和情感分析器311的基本原理后,可能在不背离这一原理的情况下,对更新情感分析模块306和情感分析器311进行各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
图6展示的是系统用户交互的一个流程示意图。需要说明的是,下面描述中的流程仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些描述将本发明应用于其它类似情景。系统首先获取用户输入(步骤601)。所述步骤可由输入输出模块103完成。其中输入方式包括但不限于键盘输入、定点装置输入(如指点杆输入、鼠标输入、触控板输入、轨迹球输入)、语音识别设备输入、图形图像识别设备输入等;输入形式包括但不限于数字、字符、符号、文字、声音、图形图像、视频等。系统可以将用户输入存储(步骤604)。系统可以将用户输入存储在输入输出模块103 的存储单元403,也可以将用户输入存储在其他模块的存储单元(如处理模块的存储模块315、数据库104等)。在一些实施例中,存储是必须的。在另一些实施例中,存储是可选的或者不必须的。用户输入的存储可以是永久的,也可以是暂时的;可以是全部存储,也可以是部分存储。在某些实施例中,系统可以利用存储的用户输入获取用户习惯,进行智能学习,提示候选词等。获取用户输入后,系统将根据用户输入检索信息(步骤602),然后根据检索到的信息生成输出结果(步骤603)。系统还可以将用户输入的信息直接生成输出结果(步骤603)。所述步骤603可以通过输入输出模块103完成。系统还可以将生成的输出结果通过外周设备展示给用户,也可以不展示。在一些实施例中,展示是必须的;在另一些实施例中,展示是可选的或者不必须的。系统可以根据用户输入检索数据库104中的信息,也可以根据用户输入检索其他模块存储单元的信息(如处理模块的存储模块315等),也可以根据用户输入通过网络105检索信息。上述信息可以被存储(步骤604)。可以存储在输入输出模块103的存储单元403,也可以存储在其他模块的存储单元(如处理模块的存储模块315、数据库104等)。存储可以是永久的,也可以是暂时的。可以是全部存储,也可以是部分存储。在一些实施例中,存储是必须的。在一些实施例中,存储是可选的或者不必须的。所存储的信息可以进一步分析,如进行歧义分析或情感分析,也可以不进行分析。在获取相关信息后,系统可以生成输出结果(步骤603)。所述步骤603可由输入输出模块103完成。系统还可以将生成的输出结果通过外周设备展示给用户。展示可以是实时的,也可以是延时的。展示可以是定期的,也可以是不定期的。在一些实施例中,用户输入中包含周期性指令(如订阅某种信息的指令),系统可以识别这些指令,并根据用户指令定时或不定时将符合用户输入条件的信息推送或展示给用户。
以上对系统用户交互流程的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解输入输出流程的基本原理后,可能在不背离这一原理的情况下,对所需要的 信息的内容进行各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
图7展示的是一种信息情感分类方法系统流程图。系统首先采集信息(步骤701)。所述步骤可由采集模块101完成。上述信息包括但不限于词典、新闻、有关公司的研究报告、公告、产品手册、以及相关网站网页等。上述信息所属行业包含但不限于体育、娱乐、经济、政治、文化等。上信息的形式包括但不限于文字的、图片的、音频的、视频的等。上述信息使用语言包括但不限于中文、英文、日文、韩文、法文、德文等。上述信息的来源可以是网络105,也可以是数据库104等模块。根据采集到的信息,系统可以分析信息是否歧义,得到歧义分析结果(步骤702)。所述步骤702可以由处理模块102中的歧义分析模块301完成。上述信息可以是采集到的全部信息,也可以是采集到的部分信息。上述分析歧义可以是人工进行的,也可以是歧义分析模型自动判断的,也可以是二者结合起来进行的。上述歧义分析模型包括但不限于决策树、Rocchio、朴素贝叶斯、神经网络、支持向量机、线性最小平方拟合、K-近邻、遗传算法、最大熵、等。系统可以分析步骤702得到的歧义分析结果的情感类别,得到包含情感类别的信息(步骤703)。所述步骤703可由处理模块102中的情感分析模块306完成。在一些实施例中,在进行情感分析时,可以只对非歧义的信息进行情感分析,也可以对歧义的信息也进行情感分析。系统还可以省略步骤702,将采集到的信息不经过歧义分析而直接进行情感分析(执行步骤701和步骤703)。上述情感分析可以是人工进行的,也可以是情感分析模型自动判断的,也可以是二者结合起来进行的。上述情感分析模型包括但不限于决策树、Rocchio、朴素贝叶斯、神经网络、支持向量机、线性最小平方拟合、K-近邻、遗传算法、最大熵等。上述情感分析可将信息分为包含但不限于正面信息、负面信息和中性信息等。另外,在某些实施例中,歧义分析及情感分类的顺序可以被调换,即先对采集到的信息进行情感分类,再对情感分析后的信息进行歧义分析(执行步骤701、703、702)。上述每个步骤的中间处理结果以及最终处理结果可依照特定的存储方式 进行存储(步骤704)。其中,上述存储方法包含但不限于顺序存储方法、链接存储方法、索引存储方法以及散列存储方法等。存储的位置可以是存储模块315、可以是存储单元203,也可以是数据库104等。
以上对基于歧义分析的信息情感分类方法的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解该基于歧义分析的信息情感分类方法的基本原理后,可能在不背离这一原理的情况下,对所需要的信息的内容进行各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
图8展示的是系统训练模型的流程图。系统通过具有采集功能的模块采集信息(步骤801)。其中,上述具有采集功能的模块可以是采集模块101,也可以是歧义分析模块301中的采集单元302,还可以是情感分析模块306的采集单元307等。上述信息的来源可以是存储模块315,也可以是数据库104,还可以是网络105。上述信息包括但不限于行业特定名称词汇、与特定名称词汇强相关的词汇、包含上述词汇的信息以及包含情感信息的词汇等。上述行业包含但不限于体育、娱乐、经济、政治、文化等。上述特定名称词汇包括但不限于特定领域的专有名词、全称、简称、代码、同义词、缩略词。上述与特定名称词汇强相关的词汇包括但不限于与上述特定名称词汇有关的名词、动词、形容词、短句、短语搭配、该领域特定词汇的行业词汇、近义词、反义词、常见搭配词、组成部分名词、派生词、复合词等。包含上述词汇的信息包括但不限于词典、新闻、有关公司的研究报告、公告、产品手册、以及相关网站网页。上述情感词汇的类别包括但不限于正面、负面、中性等。信息的形式包括但不限于文字的、图片的、音频的、视频的等。上述信息使用语言包括但不限于中文、英文、日文、韩文、法文、德文等。
系统在步骤802构建词库与信息库。步骤802可由处理单元102完成。其中,上述词库包含但不限于关键词词库501和情感词词库505。关键词词库501包含但不限于由特定名称词汇组成的关键词词典502、一个或多个与特定名称词汇相关的词汇组成的相关词词典503以及一个或多个审核关键词词典502得到的歧义列表504。上述情感词词库505 包含但不限于一个或多个情感词汇库506以及一个或多个情感词汇搭配库507。上述信息库中的信息包含关键词词典502中的特定名称词汇。根据步骤802的结果,系统可以通过歧义分析模块301的语料采集单元305和情感分析模块306的语料采集单元310采集语料(步骤803),步骤803可以由处理单元102来完成。采集语料的方式包含但不限于对采集到的信息进行匹配打分等处理过程。采集到的语料可用来训练模型(步骤804),上述模型包含但不限于歧义分析模型312和情感分析器311,歧义分析模型312包含但不限于决策树、Rocchio、朴素贝叶斯、神经网络、支持向量机、线性最小平方拟合、K-近邻、遗传算法、最大熵等。情感分析器311包含但不限于决策树、Rocchio、朴素贝叶斯、神经网络、支持向量机、线性最小平方拟合、K-近邻、遗传算法、最大熵等。另外,系统还可以对采集到的信息直接经人工审核作为歧义语料或情感语料(步骤801,步骤803),还可以对采集到的信息直接经人工审核来训练模型(步骤801,步骤804),而不经过步骤802和步骤803的过程。对于上述每个步骤的中间处理结果或最终处理结果可进行存储(步骤805)。其中,上述存储方法包含但不限于顺序存储方法、链接存储方法、索引存储方法以及散列存储方法等。存储的位置可以是存储模块315、可以是存储单元203,也可以是数据库104等。
以上对系统训练模型的流程图描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解该基于系统训练模型的流程图的基本原理后,可能在不背离这一原理的情况下,对所需要的信息的内容进行各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
实施例
图9是使用场景示意图。902为信息情感分类系统,所述系统通过网络903与用户901通信。信息情感分类系统902可以是一个服务器,也可以是一个服务器群组,其分布方式可以是集中式的,也可以是分布式的。网络903可以是有线的,也可以是无线的;可以是局域网,也可 以是广域网。
在本发明的一种使用示例中,用户901通过输入输出模块103(详见图1)键入对象名称,如:股票名称、期货名称、债券名称等。所述对象名称经由网络903被传输至信息情感分类系统902,并被信息情感分类系统902解析。所述对象名称经信息情感分类系统902解析后被识别。识别完成后,系统的处理模块102(详见图1)将开始搜索数据库104(详见图1),从而获取包含对象名称的文章集合。所述文章集合中的每篇文章针对所述对象名称有不同的情感类型,系统的处理模块102将依照所述情感类型将所述文章集合中的文章进行分类,如:正面文章以及每一篇正面文章的正面指数、负面新闻以及每一篇负面文章的负面指数、中性新闻等。完成分类后,经过分类的文章集合被传输给输入输出模块103,向用户901展示。
在本发明的另一种使用示例中,用户901键入对象名称,如:股票名称、期货名称、债券名称等。所述键入对象名称的操作可由输入输出模块103完成(详见图1)。所述对象名称经由网络903被传输至信息情感分类系统902,并被信息情感分类系统902解析。所述对象名称经信息情感分类系统902解析后被识别。识别完成后,系统采集包含用户输入的信息,所述采集包含用户输入的信息可由采集模块101完成,交由处理模块102(详见图1)进行歧义分析,筛选出非歧义信息,进行情感分析。系统也可以首先判断用户输入是否包含歧义信息,若不包含歧义信息,可直接进行情感分析。情感分类如:正面文章以及每一篇正面文章的正面指数、负面新闻以及每一篇负面文章的负面指数、中性新闻等。完成分类后,经过分类的文章集合被传输给输入输出模块103,向用户901展示。
在本发明的另一种使用示例中,用户901通过输入输出模块103(详见图1)键入两个对象名称,如:股票名称、期货名称、债券名称等。信息情感分类系统902解析并识别所述对象名称,之后将返回经过情感分类的包含对象名称的文章集合。所述文章集合将通过输入输出模块103展示给用户901。除了能获取文章的情感类型外,用户901还可以 获取例如,两个对象名称就同一情感类型拥有的文章数量,两个对象名称在一周内拥有正面文章数量的比较,两个对象名称在一个月内拥有正面文章数量的比较,两篇文章在一年内拥有负面文章数量的比较。通过以上数据的帮助,用户901得以做出有效的决策。
以上的描述仅仅是本发明的具体实施例,不应被视为是唯一的实施例。显然,对于本领域的专业人员来说,在了解本发明内容和原理后,都可能在不背离本发明原理、结构的情况下,进行形式和细节上的各种修正和改变,但是这些修正和改变仍在本发明的权利要求保护范围之内。比如:用户可以输入两个以上对象名称,返回结果将包含多个对象名称的数据比对。
图10展示的是系统采集流程的一个实施例的示意图。需要说明的是,下面描述中的流程仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些描述将本发明应用于其它类似情景。步骤1001为采集信息,所述步骤可由采集模块101的采集单元201完成。信息来源可以是本地的,例如存储在采集模块101的存储单元203中的信息,或存储在数据库104中的信息;也可以是来自网络105的,例如开放互联网或者局域网。信息内容包括但不限于现有词典、新闻、有关公司的研究报告、公告、产品手册及相关网站等资讯。采集单元201采集到的信息可以直接存储入采集模块101的存储单元203,也能够存储入数据库104的信息库511中(步骤1007)。采集单元201采集到的信息也可以交给处理单元202处理。在步骤1005中,特定词汇被从信息中提取出来,所述步骤可由处理单元202完成。在步骤1002中,情感词汇被从信息中提取出来,所述步骤可由处理单元202完成。在步骤1003中,情感词汇搭配被从信息中提取出来,所述步骤可由处理单元202完成。所述特定词汇包括关键词,包括但不限于特定领域的专有名词、全称、简称、代码、同义词、缩略词;以及与关键词相关的强相关词,包括但不限于与上述关键词有关的专有名词、名词、动词、形容词、短语搭配、短句、该领域特定词汇的行业词汇、近义词、反义词、常见搭配词、组成部分名词、派生词、复合词等。上 述提取可以是同时进行的;也可以是分步进行的。提取所采用的算法包括但不限于PMI算法、对数似然比算法等。上述提取步骤可以是同时进行的,也可以是分步进行的,可以以任意可能的顺序进行组合。此处对所描述的步骤可以在适当的情况下以任何合适的顺序,或同时实现。例如,在一个实施例中,可以首先提取特定词汇(步骤1005),之后提取情感词汇(步骤1002),提取情感词汇搭配(步骤1003);步骤1002与步骤1003可以是同时进行的,也可以是先后进行的;可以先进行步骤1002,再进行步骤1003,也可以先进行步骤1003再进行步骤1002。此外,在不偏离此处所描述的采集流程的主题的精神和范围的情况下,可以从任何一个方法中删除各单独的步骤。上文所描述的任何示例的各方面可以与所描述的其他示例中的任何示例的各方面相结合,以构成进一步的示例,而不会丢失寻求的效果。显然,对于本领域的专业人员来说,在了解采集模块的基本原理后,可能在不背离这一原理的情况下,对采集流程进行各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
处理单元202提取所得的特定词汇可以存入数据库104的关键词词库501(步骤1006),情感词汇与情感词汇搭配可以存入数据库104的情感词词库505(步骤1004)。此处对采集流程所描述的步骤可以在适当的情况下以任何合适的顺序,或同时实现。另外,在不偏离此处所描述的采集流程的主题的精神和范围的情况下,可以从任何一个方法中删除各单独的步骤。上文所描述的任何示例的各方面可以与所描述的其他示例中的任何示例的各方面相结合,以构成进一步的示例,而不会丢失寻求的效果。显然,对于本领域的专业人员来说,在了解采集模块的基本原理后,可能在不背离这一原理的情况下,对采集流程进行各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
图11是系统应用于股票新闻领域的一个实施例。系统采集日常新闻以及互联网开放词典、专业词典等(步骤1101,步骤1102),构建金融产品词汇源、金融产品相关词源以及情感词汇词库(步骤1103,步骤1104,步骤1108),步骤1101、步骤1103、步骤1104、步骤1108 可由采集模块101完成。系统还可将采集到的信息进行存储。存储的位置可以是数据库104,也可以是其他具有存储功能的单元或模块(例如存储单元203等)。接着,系统在步骤1111获取歧义列表。之后,系统对采集到的相关股票新闻进行歧义分析(步骤1106),步骤1106可由处理模块102中的歧义分析模块301完成,筛选出非歧义股票信息进入处理模块102的情感分析模块306进行情感分类(步骤1107)。其中,处理模块102中对新闻网站信息的歧义分析,可以是由系统自动完成的,也可以是由人工审核完成的(步骤1110),也可以是两者相结合完成的。系统得到非歧义股票信息后,将调取情感词汇词库(步骤1108),用情感词汇库对非歧义股票信息进行情感分析(步骤1107),对股票新闻的情感类别进行标记。步骤1108和步骤1107可由处理模块102中的情感分析模块306完成。处理模块情感分析模块对非歧义股票信息的判断,可以是系统自动完成的,也可以是由人工审核完成的(步骤1110),也可以是两者相结合完成的。标记了情感类别的股票新闻将被生成,并根据其情感标签分类展示给用户。
同时,采集模块101还可以通过定期采集日常新闻,从中提取股票词汇以及股票相关词汇,扩充股票词汇源以及股票相关词源。采集模块101还可以从日常新闻中采集包含股票信息的语句,对处理模块歧义分析模块301以及处理模块情感分析模块306的算法模型进行训练更新,这种训练更新可以是在人工审核监督下进行的,也可以是系统自发完成的,也可以是两者相结合完成的。
以上将系统应用在股票新闻领域的描述仅仅是本发明的具体实施例,不应被视为是唯一的实施例。显然,对于本领域的专业人员来说,在了解本发明内容和原理后,都可能在不背离本发明原理、结构的情况下,将系统应用于其他领域,或将系统在股票新闻领域应用的形式和细节进行的各种修正和改变,但是这些修正和改变仍在本发明的权利要求保护范围之内。
图12展示的是歧义分析的一个实施例。在该实施例中,采集单元302收集股票名称词汇、股票强相关词汇、歧义股票名称词汇和新闻网 站新闻(步骤1201、步骤1202和步骤1203)等信息,信息的来源可以是网络105,可以是存储模块315,也可以是直接检索数据库104等。系统在步骤1217获取歧义列表。步骤1217可以由歧义分析模块301完成。歧义分析模块301的匹配单元303和处理单元304根据股票名称词汇、股票强相关词汇和歧义股票名称词汇对股票新闻进行打分(步骤1204),根据打分结果可以将新闻分为非歧义新闻、强歧义新闻和其他新闻(步骤1205、步骤1207和步骤1206)。其中,非歧义新闻可以直接进入情感分析模块306进行处理(步骤1212);强歧义新闻可以通过歧义分析模块301语料采集单元305的提取其中的歧义搭配,即歧义词与相关词汇组成的搭配组合(步骤1213、步骤1214),进而经人工审核(步骤1215)得到强歧义搭配(1216);强歧义搭配可以用来训练歧义分析模型312(步骤1211),也可以用来直接判断信息是否具有歧义;包含有强歧义的信息是歧义信息。通过打分结果得到的非歧义新闻、强歧义新闻和其他新闻可以通过语料采集单元305收集包含股票的句子(步骤1208、步骤1209),经人工审核,上述包含股票的句子被标注为歧义或非歧义(步骤1210),从而用来训练歧义分析模型312(1211)。上述歧义分析模型包含但不限于最大熵模型。
以上对歧义分析的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解歧义分析的基本原理后,可能在不背离这一原理的情况下,对实施歧义分析的具体方式与步骤进行形式和细节上的各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
图13展示的是歧义分析的一个实施例。需要说明的是,下面描述中的流程仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些描述将本发明应用于其它类似情景。在进行歧义分析时,首先获取信息(步骤1301)。信息的获取可以是通过采集模块101,也是可以是通过其他具有信息采集功能的单元或模块(例如,歧义分析模块301中的收集单元302等),也可以是存储模块(如数据库104,其他模块的存储单元等)。上述信 息包括但不限于词典、新闻、有关公司的研究报告、公告、产品手册、以及相关网站网页等。上述信息所属行业包含但不限于体育、娱乐、经济、政治、文化等。上信息的形式包括但不限于文字的、图片的、音频的、视频的等。上述信息使用语言包括但不限于中文、英文、日文、韩文、法文、德文等。上述信息可以直接来自于网络105,也可以是对数据库104中信息库511中的信息的提取等。
信息经过采集之后可通过歧义分析模型312进行分析(步骤1302)。其中,上述歧义分析模型包含但不限于决策树、Rocchio、朴素贝叶斯、神经网络、支持向量机、线性最小平方拟合、K-近邻、遗传算法、最大熵等。通过歧义分析的信息就可以标注为包含但不限于歧义信息或非歧义信息(步骤1303)。另外,系统也可以对获取的信息直接进行人工标注(步骤1301,步骤1303),而无需经过歧义分析模型的分析。以上流程中的中间处理结果和最终处理结果可进行储存(步骤1304)。其中,上述存储方法包含但不限于顺序存储方法、链接存储方法、索引存储方法以及散列存储方法等。储存的位置可以是存储模块315、可以是存储单元203,也可以是数据库104等。
以上对歧义分析的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解歧义分析的基本原理后,可能在不背离这一原理的情况下,对实施歧义分析的具体方式与步骤进行形式和细节上的各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
图14展示的是歧义分析的另一个实施例,即有人工监督的情况下的歧义分析流程实施例。需要说明的是,下面描述中的流程仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些描述将本发明应用于其它类似情景。在进行歧义处理时,系统提取数据库104中的关键词词库和信息库(步骤1401、步骤1402),步骤1401和步骤1402可由采集单元302完成。上述关键词词库包含但不限于一个或多个关键词词典502,一个或多个相关词词典503和一个或多个歧义列表504。上述关键词词典502是由特 定名称词汇组成的词典,上述特定名称词汇包括但不限于特定领域的专有名词、全称、简称、代码、同义词、缩略词。上述相关词词典503可以是由与特定名称词汇相关的词汇组成的词典,其中与特定名称词汇相关的词汇可以包含但不限于,例如,行业词汇、高管姓名、主营产品名称、名词、动词、形容词、短语搭配、短句、领域特定词汇的行业词汇、近义词、反义词、常见搭配词、组成部分名词、派生词、复合词等类似的词汇,或者上述词汇的任意组合;上述歧义列表可以是由人工审核由关键词词典得到的;上述的信息库可以是包含特定名称词汇的信息。上述特定名称词汇所属行业可以包含但不限于,例如,体育、娱乐、经济、政治、文化等。包含上述信息库中的信息可以包括但不限于,例如,词典、新闻、有关公司的研究报告、公告、产品手册、以及相关网站网页等类似的信息,或者上述信息的任意组合。
步骤1403将关键词词库与信息库进行匹配,匹配方法包含但不限于正则表达式、双数组词典匹配等,步骤1403可由匹配单元303完成。在步骤1404中,系统对匹配结果进行处理分析,得到分析结果Score。步骤1404可由处理单元304完成。在一些实施例中,score可以用以下公式计算,
Score(news,stock)=Σ±(wetghti×counti)/doc_len    (001)
其中,news表示所述的某个信息,stock表示信息中涉及的某一个特定名称词汇,i表示stock的第i个名称词汇、强相关词或歧义名称词汇,weight表示该名称词汇、强相关词汇或歧义名称词汇的权重,count表示词i出现的次数,doc_len表示所述信息的文本长度。
然而,存在以下可能的情况,信息中仅能匹配特定名称词汇,未出现特定名称词汇强相关词。此时,根据特定名称词汇是否出现在歧义列表(即是否有歧义)给出固定分值:
匹配的特定名称词汇有歧义,Score(news,stock)=α;
匹配的特定名称词汇无歧义,Score(news,stock)=β;
同时,将α、β设为阈值(步骤1405)。这两个阈值可以是固定的, 也可以根据具体的情况做出一定的调整。比如,用户可以自定义这两个阈值以调整系统的敏感度。在收集信息的量非常大的情况下,用户可以通过增大β或者减小α来提高系统敏感度以确保歧义判定的准确率。相反地,在收集信息的量非常小的情况下,用户可以通过减小β或者增大α来降低系统的敏感度以确保信息的完备性。
如果步骤1404中得到的分析结果大于或等于β(步骤1405),则将该信息标记为非歧义信息(步骤1409);如果步骤1404中得到的分析结果小于或等于α(步骤1406),则将该新闻标记为歧义信息(步骤1408);如果步骤1404中得到的分析结果在α和β之间,则可以经过人工审核或模型审核的方法将该新闻标记为歧义信息或非歧义信息(步骤1408、步骤1409)。上述模型包含但不限于决策树、Rocchio、朴素贝叶斯、神经网络、支持向量机、线性最小平方拟合、K-近邻、遗传算法、最大熵等。上述标记方法可以是人工的,也可以是系统自动标记的,也可以是二者结合起来标记的。
在步骤1403中,可以用关键词词库中全部或部分信息与包含股票名称的新闻进行匹配,例如可以只采用相关词词典与新闻进行匹配,还可以将相关词词典与歧义列表组合起来与新闻进行匹配。另外,该流程中的有些步骤可以是顺序进行的,也可以是同步进行的,如步骤1401和步骤1402。另外,该流程中的有些步骤也不是必须的,例如对于一个新闻可以直接进行人工审核是否歧义而跳过其它中间环节。
以上对人工监督下的歧义分析流程实施例的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解人工监督下的歧义分析的基本原理后,可能在不背离这一原理的情况下,对实施歧义分析的具体方式与步骤进行形式和细节上的各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
图15展示的是训练歧义分析模型的一个实施例。需要说明的是,下面描述中的流程仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些描述将本发明应用于其它类似情景。在进行歧义处理时,系统提取数据库104中的 关键词词库和信息库(步骤1501、步骤1502)。步骤1501和步骤1502可由采集单元302完成。上述关键词词库包含但不限于一个或多个关键词词典502,一个或多个相关词词典503和一个或多个歧义列表504。上述关键词词典502可以由特定名称词汇组成的词典。上述特定名称词汇可以包括但不限于,例如,特定领域的专有名词、全称、简称、代码、同义词、缩略词等类似名称词汇,或者上述名称词汇的任意组合。上述相关词词典503可以由与特定名称词汇相关的词汇组成的词典。与特定名称词汇相关的词汇可以包含但不限于,例如,行业词汇、高管姓名、主营产品名称、名词、动词、形容词、短语搭配、短句、领域特定词汇的行业词汇、近义词、反义词、常见搭配词、组成部分名词、派生词、复合词等。上述歧义列表可以是由人工审核由关键词词典得到的。上述的信息库可以包含特定名称词汇的信息。上述特定名称词汇所属行业可以包含但不限于,例如,体育、娱乐、经济、政治、文化等。包含上述信息库中的信息可以包括但不限于词典、新闻、有关公司的研究报告、公告、产品手册、以及相关网站网页等类似信息,或者上述信息的任意组合。步骤1503将关键词词库与信息库进行匹配,匹配方法包含但不限于正则表达式、双数组词典匹配等,所述匹配可由匹配单元303完成。系统对匹配结果进行分析处理,得到分析结果Score。步骤1504可由处理单元304完成。在一些实施例中,score可以由下面的公式计算,
Score(news,stock)=Σ±(wetghti×counti)/doc_len    (002)
其中,news表示某个信息,stock表示新闻中涉及的某一个特定名称词汇,i表示stock的第i个名称词汇、强相关词或歧义名称词汇,weight表示该名称词汇、强相关词汇或歧义名称词汇的权重,count表示词i出现的次数,doc_len表示所述信息的文本长度。
然而,存在以下可能的情况,信息中仅能匹配特定名称词汇,未出现特定名称词汇强相关词。此时,根据特定名称词汇是否出现在歧义列表(即是否有歧义)给出固定分值:
匹配的特定名称词汇有歧义,Score(news,stock)=α;
匹配的特定名称词汇非歧义,Score(news,stock)=β;
同时,将α、β设为阈值(步骤1505)。得分大于或等于β语句集合将被标记为非歧义语句集合,得分小于或等于α的语句集合将被标记为歧义语句集合。这两个阈值可以是固定的,也可以根据具体的情况做出一定的调整。比如,用户可以自定义这两个阈值以调整系统的敏感度。在收集信息的量非常大的情况下,用户可以通过增大β或者减小α来提高系统敏感度以确保歧义判定的准确率。相反地,在收集信息的量非常小的情况下,用户可以通过减小β或者增大α来降低系统的敏感度以确保信息的完备性。
模型训练语料收集:
(1)如果步骤1504中得到的分析结果大于β(步骤1505),则将该信息标记为非歧义信息(步骤1509)。上述标记方法可以是人工的,也可以是系统自动标记的,也可以是二者结合起来标记的。在步骤1510中,系统对语料进行收集。步骤1510可由语料采集单元305完成。其中收集的语料可以是整个的非歧义信息,也可以是从该信息中提取的包含特定名称词汇的句子,也可以是其中的一些非歧义搭配。
(2)如果步骤1504中得到的分析结果小于α(步骤1506),则将该信息标记为歧义信息(步骤1508),上述标记方法可以是人工的,也可以是系统自动标记的,也可以是二者结合起来标记的。语料采集单元305可对语料进行收集(步骤1510)。其中收集的语料可以是整个的歧义信息,也可以是从该信息中提取的包含特定名称词汇的句子,也可以是其中的一些歧义搭配。
(3)如果步骤1504中得到的分析结果在α和β之间,则可以经过人工审核将信息标记为歧义信息或非歧义信息(步骤1507、步骤1508、步骤1509)。上述标记方法可以是人工的,也可以是系统自动标记的,也可以是二者结合起来标记的。在步骤1510中,系统对语料进行收集。步骤1510可由语料收集单元305完成。其中收集的语料可以是整个的歧义信息,也可以是从该信息中提取的包含特定名称词汇的句子,也可以是其中的一些歧义搭配和非歧义搭配。
在步骤1503中,可以用关键词词库中全部或部分信息与包含股票名称的新闻进行匹配,例如可以只采用相关词词典与新闻进行匹配,还可以将相关词词典与歧义列表组合起来与新闻进行匹配。另外,该流程中的有些步骤可以是顺序进行的,也可以是同步进行的。如步骤1501和步骤1502,可以是同时进行的,也可以是顺序进行的。另外,该流程中的有些步骤也不是必须的,例如对于一个新闻可以直接进行人工审核是否歧义而跳过其它中间环节。
根据上述(1),(2),(3)所获得的已标注为歧义、非歧义两个类别的信息中包含特定名称词汇的句子,将每个句子进行分词,从而获得一组由特定名称词汇、周围词汇以及相对位置信息形成的要素。将这些要素按照指定的格式形成要素集合,据此训练歧义分析模型Model(步骤1511):
Figure PCTCN2015086751-appb-000001
此歧义分析模型Model可以在歧义分析模块中自动判断某新闻中关于某股票名称的歧义性。
以上对训练歧义辨别模型的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解训练歧义辨别模型的基本原理后,可能在不背离这一原理的情况下,对实施训练歧义辨别模型的具体方式与步骤进行形式和细节上的各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。
图16为情感分析模块的一个实施例。系统收集情感种子词汇(步骤1601)。所述情感种子词汇可以包含但不限于正面情感词汇以及负面情感词汇,例如,佳、优、增、好、增长、盈、涨、补涨、赚、涨停、飙升盈利、减少、降、锐减、补跌、下降、亏损、赔、亏、跌停、减持、降低等词汇。同时,系统通过访问各个财经网站,搜集股票相关的新闻(步骤1602)。通过将所述情感种子词汇与所述股票相关新闻作处理,建立情感词汇搭配以及维持情感词汇搭配的扩充(步骤1603)。所述情感词汇搭配的扩充可以通过定期访问各大财经网站、提取股票相关新闻来完成。情感词汇与情感词汇搭配扩充完成之后,系统将得到情感词汇 搭配集合(步骤1604)。另一方面,系统通过人工或自动审核股票相关新闻,将相关性较低以及有歧义的句子滤除,从而得到非歧义股票句子集合(步骤1605)。系统将非歧义股票句子集合与情感词汇搭配集合进行匹配从而识别出非歧义股票句子集合的情感类型。匹配完成之后将得到正负面句子集合(步骤1606)。所述正负面句子集合可以被人工审核。经过人工审核后的句子将被标记为正面、负面以及中性等三个情感类型(步骤1607)。经过人工审核被标记为中性的句子集合将被送入情感分析器进行情感类别识别训练(步骤1608)。所属情感分析器可采用的算法包括但不限于最大熵模型、支持向量机算法、朴素贝叶斯等。情感分析器完成训练之后将可以判定经人工审核后被标记为中性的句子(步骤1611)。经人工审核被情感类型被标记为正面或负面的句子被送入正负面搭配打分引擎作进一步情感类型识别(步骤1609)。所述正负面搭配打分引擎将对非歧义股票句子集合及情感词汇搭配集合的匹配程度进行量化,并依据量化结果给出相应的分数。分数值高表明所述股票句子或股票句子集合中包含一个或多个强情感词汇搭配,可以直接判定该句子或集合的情感类型为正面或负面(步骤1610)。分数值低表明所述股票句子或股票句子集合中不包含强情感词汇搭配,因此分数值低的句子将被送入情感分析器进行情感类型的判定(步骤1611)。
以上的描述仅仅是本发明的具体实施例,不应被视为是唯一的实施例。显然,对于本领域的专业人员来说,在了解本发明内容和原理后,都可能在不背离本发明原理、结构的情况下,进行形式和细节上的各种修正和改变,但是这些修正和改变仍在本发明的权利要求保护范围之内。
图17为判定情感类型的一个示例。系统在步骤1701获取信息,所述信息可以是经过歧义分析之后的非歧义信息以及歧义信息,可以是经过情感分类但还未标记情感类型的信息,还可以是未经任何处理的初始信息。所述信息在被获取之后可以被存储到,例如数据库104中(步骤1704)。所述经过情感分类但未标记情感类型的信息将被直接标记情感类别(步骤1703)。所述非歧义信息以及歧义信息将被送入情感分析器 器进行情感分类。情感分析器收到所述信息之后将其存入数据库104中。所属情感分析器可采用的算法包括但不限于最大熵模型、支持向量机算法、朴素贝叶斯等。情感分析器首先判断所述歧义信息及非歧义信息是否包含强情感搭配,如果包含强情感搭配,则所述信息的情感类型可以直接被判定(步骤1702),然后根据判定的结果被标记为相应的情感类别(步骤1703)。对于不包含强情感搭配的信息,所述情感分析器中的打分引擎将对所述信息包含的情感类型进行打分,最终依据打分结果判定所述信息的情感类型。完成情感类型判定之后,将对所述信息的情感类型进行标记(1703),标记完成之后,将所述信息存入数据库104中。
需要注意的是,上述示例只是为了便于理解发明,不应被视为是本发明唯一的实施例。显然,对于本领域的专业人员来说,在了解本发明内容和原理后,都可能在不背离本发明原理、结构的情况下,进行形式和细节上的各种修正和改变,但是这些修正和改变仍在本发明的权利要求保护范围之内。
图18描述的是情感分类方法的一个实施例。系统获取非歧义语句集合G(步骤1801)。步骤1801可由情感分析模块306的采集单元307通过访问存储模块315来完成。同时,系统获取情感词汇搭配集合Ω(步骤1802)。步骤1802可由情感分析模块306的处理单元309通过访问数据库104中的情感词汇搭配库507来完成。系统将所获取的非歧义语句集合G与情感词汇搭配集合Ω进行匹配(步骤1803)。步骤1803可由情感分析模块306的匹配单元308完成。步骤1803是一个逻辑判断,如果非歧义语句集合G与情感词汇搭配集合Ω匹配,则得到包含情感搭配的句子集合H(步骤1806),反之,则得到不包含情感搭配的句子H’(步骤1805)。系统将集合H与强正负情感词汇搭配集合F匹配(步骤1807),所述强正负情感词汇搭配集合F包括但不限于经人工审核情感匹配准确率大于特定阈值的词汇集合(例如:准确率在90%以上)。步骤1807可由匹配单元308完成。步骤1808对匹配结果进行逻辑判断,将集合H分为包含强正负搭配的句子(步骤1809)以及不包含强正负情感搭配的句子(步骤1810)。情感分析模块306的情感分析器311将对 不包含强正负情感搭配的句子进行情感分类(步骤1811),所述情感分析器311可采用的算法包含但不限于最大熵模型、支持向量机模型、朴素贝叶斯、决策树等算法。完成分类后,系统得到正负情感的所有句子M’(步骤1812)。步骤1813将判断M’中是否所有句子都属于一种情感,如果所有句子都属于一种情感,系统将所述新闻标记为相应的正负情感类型(步骤1815)。步骤1815可由处理单元309完成。如果M’中的句子包含两种或两种以上的情感,则情感分析模块306的处理单元309将按照一定算法比较M’中的正负情感类别得分(步骤1814),然后将M’标记为得分高的情感类别(步骤1815)。所述算法需满足以下条件;第一,强搭配的正负面程度可以人为定义,正负面程度是得分的一个要素。第二,强搭配和股票的距离是得分要考虑的另外的因素。第三,如果最终的正负面判定是用模型判定的,其得分要小于任何强规则的得分。第四,标题出现的正负面搭配得分要高于其他地方(如新闻内容)正负面搭配的得分。在对包含情感搭配的句子集合H完成分类之后,将对所述句子集合进行情感类型的标记(步骤1815),向用户展示完成分类的相关新闻。对于不包含情感搭配的句子H’,系统将其标记为中性新闻(步骤1817),然后将完成情感标记的新闻作为中性新闻展示给用户。需要注意的是,在将句子标记为中性之后,系统也可以通过检索语义知识库512,来对已经被标记为中性的句子集合进行第二次情感判断(步骤1818)。步骤1818可由情感分析模块306完成。所述语义知识库512可识别自然语言中不包含情感词汇但是有情感表述的句子、短语或段落。例如:今天我和丈夫申请了离婚,他想从我身边拿走孩子的监护权。这句话没有任何情感词汇,所以通过普通的情感分析方法将无法识别句子的情感类别。但是通过检索语义知识库512,此句的情感类别将被识别。在通过语义知识库512完成第二次情感判断之后,系统将所述句子标记为相应的情感类型(步骤1815)。完成情感分类后,系统可以提供新闻作为一个整体的情感类型的展示方法,还可以提供同一篇新闻对涉及的多个同类或多个不同类金融产品中某一个或多个金融产品的情感类型的展示方法。
此处所描述的方法的步骤可以在适当的情况下以任何合适的顺序,或同时实现。另外,在不偏离此处所描述的主题的精神和范围的情况下,可以从任何一个方法中删除各单独的步骤。上文所描述的任何示例的各方面可以与所描述的其他示例中的任何示例的各方面相结合,以构成进一步的示例,而不会丢失寻求的效果。
图19描述的是训练情感分析器的一个实施例。首先,系统收集并构建种子情感词汇词典,所述收集及构建可由采集模块101完成;其来源包括但不限于文献(图书、报纸、期刊、专利文献、学位论文、公文等)、学术报告、市场报告、新闻、评论、网络词典、该领域现有词典、有关公司的研究报告、公告、产品手册及相关网站等;获取信息的方式可以是集中式的也可以是分布式的、可以是本地的也可以是远程的、可以是有线的也可以是无线的,可以是人工的也可以是自动的、也可以是多种方式相结合的。
在种子情感词汇词典的基础上,系统进一步收集信息扩充情感词汇词典及情感词汇搭配,所述进一步收集信息可由采集模块101的采集单元201来完成;其来源包括但不限于文献(图书、报纸、期刊、专利文献、学位论文、公文等)、学术报告、市场报告、新闻、评论、网络词典、该领域现有词典、有关公司的研究报告、公告、产品手册及相关网站等;获取信息的方式可以是集中式的也可以是分布式的、可以是本地的也可以是远程的、可以是有线的也可以是无线的,可以是人工的也可以是自动的、也可以是多种方式相结合的;采用的算法包括但不限于PMI算法、对数似然比算法、卡方检验、夹角余弦、戴斯系数和类F1measure等。
经过收集信息,获得情感词汇搭配集合Ω(步骤1901),以及获取非歧义语句(步骤1902)。需要注意的是,情感词汇搭配集合Ω的获得,可以是如本实施例所述分步的,也可以是一步完成的。
情感分析模块306将情感词汇搭配集合Ω与非歧义句子集合匹配(步骤1903),匹配得到的句子集合记为情感语句集合H(步骤1904)。匹配可以是人工的,也可以是自动的,可以采用的算法包括但不限于正 则表达式。
人工审核情感语句集合H,将句子集合中的句子标记为正面、负面、中性三种情感类别(步骤1905)。审核完成后,经过人工情感分类的句子集合将被存入语料采集单元310中(步骤1909)。系统自动统计情感语句集合H中每个情感词汇搭配所匹配到的句子的正面、负面、中性三种情感类别的数据,得到该情感词汇搭配的情感分类准确率R(步骤1906)。在一些实施例中,情感分类准确率可以由如下公式计算:
[根据细则26改正24.09.2016] 
Figure WO-DOC-MATHS-1
该情感词汇搭配的负面、中性情感分类准确率R2、R3以此类推。
将该情感词汇搭配三种情感分类的准确率R与预设阈值(在本实施例中阈值设置为90%)比较(步骤1907),若某一情感类别准确率大于90%,则判定该情感词汇搭配为强情感搭配。例如,若情感语句集合H中某情感词汇搭配的正面情感分类准确率R1>90%,直接判定该情感词汇搭配为强正面情感词汇搭配。收集所有强情感词汇搭配,得到强情感词汇搭配集合F(步骤1908)。所述强情感词汇搭配集合F将被存入语料采集单元310中(步骤1909)。强情感词汇搭配集合F定义如下:
Figure PCTCN2015086751-appb-000003
在语料收集单元完成语料收集后,所述语料收集单元中的要素集合将用来训练情感分析器。所述语料收集可以是实时的,也可以是周期性的。
将集合H标记为正面、负面、中性三种情感类别的情感语句(步骤1905),也可作为语料训练情感分析器(步骤1910)。所述情感分析器可以采用的算法模型Model’是一种监督学习(Supervised Learning)算法,包括但不限于最大熵(Maximum Entropy Model)、朴素贝叶斯(NaiveBayes)、支持向量机(Support Vector Machine)、非负矩阵三分解(Non-negative Matrix Tri-factorization)、遗传算法(Genetic Algorithm)、K最近邻(k-Nearest Neighbor).在监督算法模型中的特征 采用但不限于:词汇出现次数、词汇的词性、词汇相对位置、词汇间的依赖特征、词汇的抽象特征(如用无监督学习获得的词向量)。情感分析器算法模型Model’可以表示为:
[根据细则26改正24.09.2016] 
 
Figure WO-DOC-MATHS-2
图20描述的是分类展示的一个实施例。图20描述的是一个用于分类展示的用户交互界面,该用户交互界面可以在外周设备上展示,所述外周设备包括但不限于移动设备、手机、笔记本电脑、平板电脑、可穿戴设备、智能家电、智能交通工具、智能仪器设备。在本实施例中,分类展示在图形界面展示,依照正面、负面、中性三种情感类别将用户检索关键词涉及的相关信息依次列出。
以上的描述仅仅是本发明分类展示模块的具体实施例,不应被视为是唯一的实施例。显然,对于本领域的专业人员来说,在了解本发明内容和原理后,都可能在不背离本发明原理、结构的情况下,进行形式和细节上的各种修正和改变,但是这些修正和改变仍在本发明的权利要求保护范围之内。

Claims (25)

  1. 一种系统,包括:
    一种计算机可读的存储媒介,所述存储媒介存储可执行模块,包括:
    采集模块,所述采集模块能够采集信息,构建第一词库、第二词库与至少一个信息库;
    处理模块,所述处理模块能够对信息进行歧义分析,对歧义分析后的信息进行情感分析,并进行语料采集;
    一个数据库,所述数据库能够存储所述词库与所述信息库;
    一个处理器,所述处理器能够执行所述可执行模块。
  2. 根据权利要求1所述的系统,进一步包括一个更新模块,所述更新模块能够扩充所述第一词库、所述第二词库与所述至少一个信息库。
  3. 根据权利要求1所述的系统,所述采集模块能够进一步审核第一词库中词汇的歧义性,构建一个歧义列表。
  4. 根据权利要求1所述的系统,所述处理模块包括一个歧义分析模块,配置为对信息进行歧义分析。
  5. 根据权利要求4所述的系统,所述歧义分析模块包括一个匹配单元和一个处理单元。
  6. 根据权利要求1所述的系统,所述处理模块包括一个歧义分析模型。
  7. 根据权利要求6所述的系统,所述歧义分析模型包括最大熵算法训练模型。
  8. 根据权利要求1所述的系统,所述处理模块包括一个情感分析模块,配置为对信息进行情感分析。
  9. 根据权利要求8所述的系统,所述情感分析模块包括一个匹配单元和一个处理单元。
  10. 根据权利要求1所述的系统,所述处理模块进一步包括一个情感分析器。
  11. 根据权利要求10所述的系统,所述情感分析器包括最大熵算法训练模型、SVM模型中的一种。
  12. 一种方法,包括:
    获取用户输入;
    根据用户输入查询数据库,提取包含用户输入信息的已标记情感类别的非歧义信息;
    将所述标记过情感类别的非歧义信息依据情感类别分类。
  13. 根据权利要求12所述的方法,所述非歧义信息由以下步骤标记:
    提取第一个词库以及信息库;
    用所述第一个词库对所述信息库中的信息进行匹配、打分;
    根据分数识别出一组歧义信息以及一组非歧义信息。
  14. 根据权利要求12所述的方法,所述非歧义信息可以由一个歧义分析模型标记。
  15. 根据权利要求14所述的方法,所述歧义分析模型包括最大熵算法训练模型。
  16. 根据权利要求12所述的方法,所述情感类别由以下步骤标记:
    提取第二个词库以及信息库;
    用所述第二个词库对所述信息库中的信息进行匹配,得到一组情感语句集合;
    审核所述一组情感语句集合中语句的情感类别并标记;
  17. 根据权利要求12所述的方法,所述情感类别可以由一个情感分析器标记。
  18. 根据权利要求17所述的方法,所述情感分析器包括最大熵算法训练模型、SVM模型中的一种。
  19. 一种方法,包括:
    采集信息,构建和/或扩充第一个词库以及信息库;
    用所述第一个词库对所述信息库中的信息进行匹配、打分;
    根据分数识别出一组歧义信息以及一组非歧义信息;
    采集所述一组歧义信息以及一组非歧义信息中的语料;
    利用语料训练歧义分析模型。
  20. 根据权利要求19所述的方法,所述歧义分析模型包括最大熵 算法训练模型。
  21. 根据权利要求19所述的方法,所述第一词库进一步包括一个歧义列表。
  22. 根据权利要求19所述的方法,所述训练信息歧义分析模型的方法进一步包括人工审核信息是否具有歧义。
  23. 一种方法,包括:
    采集信息,构建和/或扩充第二个词库以及信息库;
    用所述第二个词库对所述信息库中的信息进行匹配,得到一组情感语句集合;
    审核所述一组情感语句集合中语句的情感类别并标记;
    计算上述情感语句集合中每个情感搭配的准确率;
    采集所述一组情感语句集合中的语料;
    利用语料训练情感分析模型。
  24. 根据权利要求23所述的方法,所述情感分析器包括最大熵算法训练模型、SVM模型中的一种。
  25. 根据权利要求23所述的方法,所述方法进一步包括人工审核信息情感类别。
PCT/CN2015/086751 2015-08-12 2015-08-12 一种信息情感分析方法和系统 WO2017024553A1 (zh)

Priority Applications (6)

Application Number Priority Date Filing Date Title
PCT/CN2015/086751 WO2017024553A1 (zh) 2015-08-12 2015-08-12 一种信息情感分析方法和系统
US15/752,184 US10437871B2 (en) 2015-08-12 2015-08-12 Method and system for sentiment analysis of information
US16/550,479 US10831808B2 (en) 2015-08-12 2019-08-26 Method and system for sentiment analysis of information
US17/086,469 US11481422B2 (en) 2015-08-12 2020-11-02 Method and system for sentiment analysis of information
US17/936,374 US11868386B2 (en) 2015-08-12 2022-09-28 Method and system for sentiment analysis of information
US18/523,978 US20240104127A1 (en) 2015-08-12 2023-11-30 Method and system for sentiment analysis of information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/086751 WO2017024553A1 (zh) 2015-08-12 2015-08-12 一种信息情感分析方法和系统

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US15/752,184 A-371-Of-International US10437871B2 (en) 2015-08-12 2015-08-12 Method and system for sentiment analysis of information
US16/550,479 Continuation US10831808B2 (en) 2015-08-12 2019-08-26 Method and system for sentiment analysis of information

Publications (1)

Publication Number Publication Date
WO2017024553A1 true WO2017024553A1 (zh) 2017-02-16

Family

ID=57982927

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/086751 WO2017024553A1 (zh) 2015-08-12 2015-08-12 一种信息情感分析方法和系统

Country Status (2)

Country Link
US (5) US10437871B2 (zh)
WO (1) WO2017024553A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108376133A (zh) * 2018-03-21 2018-08-07 北京理工大学 基于情感词扩充的短文本情感分类方法
CN108563635A (zh) * 2018-04-04 2018-09-21 北京理工大学 一种基于情感轮模型的情感词典快速构建方法
WO2020147395A1 (zh) * 2019-01-17 2020-07-23 平安科技(深圳)有限公司 基于情感的文本分类处理方法、装置和计算机设备
US10748644B2 (en) 2018-06-19 2020-08-18 Ellipsis Health, Inc. Systems and methods for mental health assessment
US11120895B2 (en) 2018-06-19 2021-09-14 Ellipsis Health, Inc. Systems and methods for mental health assessment

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9634855B2 (en) 2010-05-13 2017-04-25 Alexander Poltorak Electronic personal interactive device that determines topics of interest using a conversational agent
US10043343B1 (en) 2015-01-23 2018-08-07 Michael Todd Jordan Gaming machine with remote redemption options
US10187687B2 (en) * 2015-11-06 2019-01-22 Rovi Guides, Inc. Systems and methods for creating rated and curated spectator feeds
US10720014B1 (en) 2015-11-17 2020-07-21 Michael Todd Jordan Electronic gaming device with improved redemption options
US10832160B2 (en) * 2016-04-27 2020-11-10 International Business Machines Corporation Predicting user attentiveness to electronic notifications
US10642936B2 (en) * 2016-09-26 2020-05-05 International Business Machines Corporation Automated message sentiment analysis and aggregation
JP7464240B2 (ja) 2019-04-26 2024-04-09 Necソリューションイノベータ株式会社 予測モデル生成装置、旅行適合度予測装置、予測モデル生産方法、旅行適合度予測方法、プログラム及び記録媒体
JP2021068065A (ja) * 2019-10-18 2021-04-30 富士ゼロックス株式会社 クエリ生成システム、検索システム及びプログラム
US11194971B1 (en) 2020-03-05 2021-12-07 Alexander Dobranic Vision-based text sentiment analysis and recommendation system
CN112445913B (zh) * 2020-11-25 2022-09-27 重庆邮电大学 一种基于大数据的金融信息负面主体判定分类方法
CN112364605A (zh) * 2020-11-27 2021-02-12 智业软件股份有限公司 一种基于双数组Trie的文本标注方法、终端设备及存储介质
CN112948541B (zh) * 2021-02-01 2022-09-20 华南理工大学 基于图卷积网络的金融新闻文本情感倾向分析方法
CN113157920B (zh) * 2021-04-08 2023-01-03 西安交通大学 一种基于机器阅读理解范式的方面级情感分析方法及系统
CN113408269B (zh) * 2021-07-20 2024-06-28 北京百度网讯科技有限公司 文本情感分析方法和装置
CN113742452B (zh) * 2021-09-08 2023-07-18 平安科技(深圳)有限公司 基于文本分类的舆情监测方法、装置、设备及介质
CN115544226B (zh) * 2022-08-31 2023-06-09 华南师范大学 一种基于多模态情感分析的相似识别方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101763352A (zh) * 2008-12-24 2010-06-30 张霄凯 一种基于web挖掘的非规范语言处理方法
CN102999485A (zh) * 2012-11-02 2013-03-27 北京邮电大学 一种基于公众汉语网络文本的现实情感分析方法
CN103324700A (zh) * 2013-06-08 2013-09-25 同济大学 一种基于Web信息的本体概念属性学习方法
CN103823859A (zh) * 2014-02-21 2014-05-28 安徽博约信息科技有限责任公司 基于决策树规则和多种统计模型相结合的人名识别算法
CN104331498A (zh) * 2014-11-19 2015-02-04 亚信科技(南京)有限公司 一种对互联网用户访问的网页内容自动分类的方法

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5887120A (en) * 1995-05-31 1999-03-23 Oracle Corporation Method and apparatus for determining theme for discourse
US6964023B2 (en) * 2001-02-05 2005-11-08 International Business Machines Corporation System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input
US20030233230A1 (en) * 2002-06-12 2003-12-18 Lucent Technologies Inc. System and method for representing and resolving ambiguity in spoken dialogue systems
US7996210B2 (en) * 2007-04-24 2011-08-09 The Research Foundation Of The State University Of New York Large-scale sentiment analysis
CN101201980B (zh) * 2007-12-19 2010-06-02 北京交通大学 一种基于语音情感识别的远程汉语教学系统
US20120278253A1 (en) * 2011-04-29 2012-11-01 Gahlot Himanshu Determining sentiment for commercial entities
US20150112753A1 (en) * 2013-10-17 2015-04-23 Adobe Systems Incorporated Social content filter to enhance sentiment analysis
US10235470B2 (en) * 2013-12-06 2019-03-19 Here Global B.V. User retrieval enhancement
US10061842B2 (en) * 2014-12-09 2018-08-28 International Business Machines Corporation Displaying answers in accordance with answer classifications
US20160162582A1 (en) * 2014-12-09 2016-06-09 Moodwire, Inc. Method and system for conducting an opinion search engine and a display thereof
US9336268B1 (en) * 2015-04-08 2016-05-10 Pearson Education, Inc. Relativistic sentiment analyzer
US10997226B2 (en) * 2015-05-21 2021-05-04 Microsoft Technology Licensing, Llc Crafting a response based on sentiment identification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101763352A (zh) * 2008-12-24 2010-06-30 张霄凯 一种基于web挖掘的非规范语言处理方法
CN102999485A (zh) * 2012-11-02 2013-03-27 北京邮电大学 一种基于公众汉语网络文本的现实情感分析方法
CN103324700A (zh) * 2013-06-08 2013-09-25 同济大学 一种基于Web信息的本体概念属性学习方法
CN103823859A (zh) * 2014-02-21 2014-05-28 安徽博约信息科技有限责任公司 基于决策树规则和多种统计模型相结合的人名识别算法
CN104331498A (zh) * 2014-11-19 2015-02-04 亚信科技(南京)有限公司 一种对互联网用户访问的网页内容自动分类的方法

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108376133A (zh) * 2018-03-21 2018-08-07 北京理工大学 基于情感词扩充的短文本情感分类方法
CN108563635A (zh) * 2018-04-04 2018-09-21 北京理工大学 一种基于情感轮模型的情感词典快速构建方法
US10748644B2 (en) 2018-06-19 2020-08-18 Ellipsis Health, Inc. Systems and methods for mental health assessment
US11120895B2 (en) 2018-06-19 2021-09-14 Ellipsis Health, Inc. Systems and methods for mental health assessment
US11942194B2 (en) 2018-06-19 2024-03-26 Ellipsis Health, Inc. Systems and methods for mental health assessment
WO2020147395A1 (zh) * 2019-01-17 2020-07-23 平安科技(深圳)有限公司 基于情感的文本分类处理方法、装置和计算机设备

Also Published As

Publication number Publication date
US20210049197A1 (en) 2021-02-18
US20230020599A1 (en) 2023-01-19
US10831808B2 (en) 2020-11-10
US20180239815A1 (en) 2018-08-23
US10437871B2 (en) 2019-10-08
US11481422B2 (en) 2022-10-25
US11868386B2 (en) 2024-01-09
US20240104127A1 (en) 2024-03-28
US20190377748A1 (en) 2019-12-12

Similar Documents

Publication Publication Date Title
WO2017024553A1 (zh) 一种信息情感分析方法和系统
Cui et al. Survey on sentiment analysis: evolution of research methods and topics
Kang et al. Natural language processing (NLP) in management research: A literature review
Welbers et al. Text analysis in R
US10515125B1 (en) Structured text segment indexing techniques
US20210191925A1 (en) Methods and apparatus for using machine learning to securely and efficiently retrieve and present search results
Moussa et al. A survey on opinion summarization techniques for social media
US20140280314A1 (en) Dimensional Articulation and Cognium Organization for Information Retrieval Systems
CN111324771B (zh) 视频标签的确定方法、装置、电子设备及存储介质
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
CN111339295A (zh) 用于展示信息的方法、装置、电子设备和计算机可读介质
CN107798622B (zh) 一种识别用户意图的方法和装置
CN111160007B (zh) 基于bert语言模型的搜索方法、装置、计算机设备及存储介质
CN112926308B (zh) 匹配正文的方法、装置、设备、存储介质以及程序产品
US20220121668A1 (en) Method for recommending document, electronic device and storage medium
Lin et al. A simple but effective method for Indonesian automatic text summarisation
EP3762876A1 (en) Intelligent knowledge-learning and question-answering
US20220365956A1 (en) Method and apparatus for generating patent summary information, and electronic device and medium
Wei et al. Sentiment classification of Chinese Weibo based on extended sentiment dictionary and organisational structure of comments
Fisk et al. Controversial new sciences in the media: Content analysis of global reporting of nanotechnology during the last decade
CN112182239B (zh) 信息检索方法和装置
Fei et al. GFMRC: A machine reading comprehension model for named entity recognition
CN112445959A (zh) 检索方法、检索装置、计算机可读介质及电子设备
US11762916B1 (en) User interface for identifying unmet technical needs and/or technical problems
Mahadik et al. Aspect based opinion mining and ranking: Survey

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15900751

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 15752184

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15900751

Country of ref document: EP

Kind code of ref document: A1