WO2019196208A1 - 文本情感分析方法、可读存储介质、终端设备及装置 - Google Patents

文本情感分析方法、可读存储介质、终端设备及装置 Download PDF

Info

Publication number
WO2019196208A1
WO2019196208A1 PCT/CN2018/093344 CN2018093344W WO2019196208A1 WO 2019196208 A1 WO2019196208 A1 WO 2019196208A1 CN 2018093344 W CN2018093344 W CN 2018093344W WO 2019196208 A1 WO2019196208 A1 WO 2019196208A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
vector
input
training
emotion
Prior art date
Application number
PCT/CN2018/093344
Other languages
English (en)
French (fr)
Inventor
张依
汪伟
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019196208A1 publication Critical patent/WO2019196208A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present application belongs to the field of computer technology, and in particular, to a text sentiment analysis method, a computer readable storage medium, a terminal device and a device.
  • Text sentiment analysis refers to the technique of dividing text into two or more types of emotions according to the meaning and emotional information expressed by the text.
  • the current text sentiment analysis method mainly focuses on the number of adjectives representing different emotions in the statistical text, and conducts a quantitative analysis on this. This method has a higher accuracy rate for sentiment analysis of sentence texts containing only a single emotional subject, but When performing sentiment analysis on the sentence texts containing multiple emotional subjects, it is difficult to reflect the complex emotions of multiple emotional subjects. For example, the text of a certain sentence is “A company's sales performance greatly exceeds that of Company B”, among which two are included. The emotional subjects are “A company” and “B company”.
  • the text of the statement should be a positive emotion type, but for the emotional subject “B company”, the text of the statement is The negative emotion type, and the analysis result obtained by the current text sentiment analysis method is independent of the emotional subject, and can only obtain a unique emotional type that does not distinguish the emotional subject.
  • the embodiments of the present application provide a text sentiment analysis method, a computer readable storage medium, a terminal device, and a device, to solve the problem that the current text sentiment analysis method is difficult to reflect the complex emotions of multiple emotional subjects.
  • a first aspect of the embodiments of the present application provides a text sentiment analysis method, which may include:
  • the database is a database that records the correspondence between words and column vectors;
  • the input matrix and the input vector are input into a preset text sentiment analysis neural network model to obtain an emotion type of the emotion subject in the sentence text, and the input vector is a column vector of the emotion subject.
  • a second aspect of embodiments of the present application provides a computer readable storage medium storing computer readable instructions that, when executed by a processor, implement the text sentiment analysis method described above step.
  • a third aspect of embodiments of the present application provides a text sentiment analysis terminal device including a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor executing The computer readable instructions implement the steps of the text sentiment analysis method described above.
  • a fourth aspect of the embodiments of the present application provides a text sentiment analysis apparatus, which may include a module for implementing the steps of the text sentiment analysis method described above.
  • the embodiment of the present application has the beneficial effects that: in addition to considering the overall sentence text, the column vector of the emotional subject is treated as a separate input through the processing of the neural network model. What is obtained is the type of emotion of the emotional subject in the sentence text, and also the selection of the emotional subject as a decision condition that affects the final type of emotion, so that the sentiment analysis of the sentence text containing multiple emotional subjects is performed. At the time, through the selection of different emotional subjects, the corresponding emotional types can be obtained, which perfectly reflects the complex emotions of multiple emotional subjects.
  • FIG. 1 is a flow chart of an embodiment of a text sentiment analysis method according to an embodiment of the present application
  • FIG. 2 is a schematic flowchart of searching for a column vector of a current participle in a word vector database according to an embodiment of the present application
  • FIG. 3 is a schematic flow chart of a data processing process of a text sentiment analysis neural network model in an embodiment of the present application
  • FIG. 4 is a schematic flow chart of a training process of a text sentiment analysis neural network model in an embodiment of the present application
  • FIG. 5 is a structural diagram of an embodiment of a text sentiment analysis apparatus according to an embodiment of the present application.
  • FIG. 6 is a schematic block diagram of a text sentiment analysis terminal device according to an embodiment of the present application.
  • an embodiment of a text sentiment analysis method in an embodiment of the present application may include:
  • step S101 the sentence text to be analyzed is subjected to word segmentation processing, and each word segment constituting the sentence text is obtained.
  • the word processing refers to dividing a sentence text into a single word, that is, each of the word segments.
  • the sentence text can be segmented according to the general dictionary to ensure that the words that are separated are normal words. If the words are not in the dictionary, the words are separated.
  • the current rear direction can be a word, for example, "require God”, it will be divided according to the size of the statistical word frequency. For example, if the word "required” is high, the word “requirement/god” will be separated. / Ask God.”
  • these binary combination words can also be screened according to the word frequency.
  • a screening frequency threshold is set in advance to obtain a frequency of occurrence of each binary combination word. If a frequency of occurrence of a binary combination word is greater than or equal to the frequency threshold, the binary combination word is retained, if any two If the frequency of occurrence of the meta-combination word is less than the frequency threshold, the binary combination word is removed, and it is regarded as two independent unary words. If we set the frequency threshold to 5, then all binary combinations with 5 occurrences or less are eliminated.
  • Step S102 searching for a column vector of each of the word segments in a preset word vector database.
  • the word vector database is a database that records the correspondence between words and column vectors.
  • the column vector may be a corresponding word vector obtained by training words according to the word2vec model. That is, the probability of occurrence of the word is expressed according to the context information of the word.
  • the training of word vectors is still in accordance with the idea of word2vec, first express each word as a one-hot form, then use the word vector to train the word2vec model, and use n-1 words to predict the nth word.
  • the intermediate process obtained after the neural network model is predicted as a word vector.
  • the one-hot vector of "Celebration" is set to [1, 0, 0, 0, ..., 0]
  • the one-hot vector of "Convention” is [0, 1, 0, 0,...
  • the "smooth" one-hot vector is [0,0,1,0,...,0], predicting the "closing" vector [0,0,0,1,...,0],
  • the model is trained to generate a coefficient matrix W of the hidden layer.
  • the product of the one-hot vector and the coefficient matrix of each word is the word vector of the word, and the final form will be similar to "celebration [-0.28, 0.34, -0.02, ..., 0.92]" such a multidimensional vector.
  • the word vector database may be a K-level tree-sliced storage structure, and step S102 may include the steps shown in FIG. 2:
  • step S1021 the current participle is hashed using a plurality of mutually independent hash functions.
  • the current participle is any one of the participles.
  • the current participles may be hashed using K independent hash functions according to the following formula:
  • HashKey k HASH k (BasicWord)
  • HASH k is a hash function with the sequence number k
  • HashKey k is the hash value of the operation number k, 1 ⁇ k ⁇ K
  • K is an integer greater than 1.
  • Step S1022 Calculate the serial number of the storage fragment of each level to which the current word segment belongs.
  • sequence number of the kth-level storage fragment to which the current word segment belongs may be calculated according to the following formula:
  • MaxHashKey k is the maximum value of the hash function HASH k
  • FragNum k is the number of storage fragments of the kth subtree
  • Ceil is the rounding function
  • Floor is the rounding function
  • WordRoute is the record storage path.
  • the array, WordRoute[k-1] is the sequence number of the kth slice to which the current participle belongs, and is the kth element of WordRoute.
  • Step S1023 searching for the column vector of the current participle under the recorded storage path.
  • the column vector of the current word segment is searched under the storage path recorded by the array WordRoute.
  • the storage path is: the storage slice of the first-level subtree number 1 -> the storage slice of the second-level subtree number 2 - > Storage slice of sequence 3 subtree 1 -> Storage slice of sequence 4 subtree 3 -> Storage slice of 5th subtree sequence number 5, look for the storage path under the storage path
  • the column vector of the current participle is: the storage slice of the first-level subtree number 1 -> the storage slice of the second-level subtree number 2 - > Storage slice of sequence 3 subtree 1 -> Storage slice of sequence 4 subtree 3 -> Storage slice of 5th subtree sequence number 5.
  • Step S103 the column vectors of the respective participles are composed into an input matrix.
  • each column of the input matrix corresponds to a column vector, that is, the column vector of the first segmentation forms the first column of the input matrix, and the column vector of the second segmentation constitutes the second column of the input matrix shown.
  • the column vector of the Nth participle constitutes the Nth column of the input matrix shown.
  • N is the number of the participles.
  • Step S104 Select a participle corresponding to the preset analysis object from the sentence text as the emotional subject of the text sentiment analysis.
  • the text of a statement is “A company's sales performance greatly exceeds B company”, in which there are two emotional subjects to choose from, namely “A company” and “B company”, if you want to analyze “A company”
  • the analysis object is "A company”
  • “A company” is selected as the emotional subject of the text sentiment analysis
  • the type of emotion, that is, the analysis object is "B company”
  • “B company” is selected as the emotional subject of text sentiment analysis.
  • Step S105 inputting the input matrix and the input vector into a preset text sentiment analysis neural network model, and obtaining an emotion type of the emotional subject in the sentence text.
  • the input vector is a column vector of the emotional subject.
  • the data processing process of the text sentiment analysis neural network model may include the steps shown in FIG. 3:
  • Step S1051 calculating a coupling vector between the input matrix and the input vector.
  • the coupling vector between the input matrix and the input vector may be calculated according to the following formula:
  • CoupVec (CoupFactor 1 , CoupFactor 2 , ..., CoupFactor n , ..., CoupFactor N ) T ,
  • N is the number of columns of the input matrix
  • T is a transpose symbol
  • WordVec n is the nth column of the input matrix
  • MainVec is the input vector
  • WeightMatrix and WeightMatrix' are preset weight matrixes.
  • CoupVec is the coupling vector.
  • Step S1052 calculating a composite vector of the sentence text.
  • the composite vector of the statement text can be calculated according to the following formula:
  • CompVec is the composite vector
  • WordMatrix is the input matrix
  • WordMatrix (WordVec 1 , WordVec 2 , ..., WordVec n , ..., WordVec N ).
  • step S1053 the probability values of the respective emotion types are respectively calculated.
  • the probability values of the respective emotion types can be separately calculated according to the following formula:
  • M is the number of emotion types
  • WeightMatrix m is a preset weight matrix corresponding to the mth emotion type
  • Prob m is the probability value of the mth emotion type.
  • the specific emotion type classification can be set according to the actual situation. For example, it can be divided into positive emotion type and negative emotion type, or it can be divided into positive emotion type, negative emotion type and neutral emotion type. Divide it into more types.
  • step S1054 the emotion type with the highest probability value is determined as the emotion type of the emotion subject in the sentence text.
  • the training process of the text sentiment analysis neural network model may include the steps as shown in FIG. 4:
  • step S401 a preset number of training samples are selected.
  • Each sample includes an input matrix, an input vector, and an expected output emotion type.
  • the training samples may be selected in pairs in the form of training sample pairs, each training sample pair includes two training samples, and the input matrix of the two training samples in the same training sample pair is the same, and each participle of the same sentence text is A matrix consisting of column vectors, the input vectors of two training samples in the same training sample pair are different, respectively, the column vectors of two different emotional subjects of the same sentence text, and the expected output of two training samples in the same training sample pair Different types of emotions, one is a positive emotion type and the other is a negative emotion type.
  • Step S402 each of the training samples is separately input into the text sentiment analysis neural network model for processing.
  • step S105 The specific processing procedure is similar to the step S105. For details, refer to the description in step S105, and details are not described herein again.
  • Step S403 calculating a global error of the current training.
  • the global error of the current training can be calculated according to the following formula:
  • CalcProb l,m is the probability value of the mth emotion type in the first training sample
  • ExpProb l,m is the expected probability value of the mth emotion type in the first training sample
  • ExpSeq is the sequence number of the expected output sentiment type of the first training sample, 1 ⁇ l ⁇ L, L is the number of the training samples, 1 ⁇ m ⁇ M, M is the number of emotion types, and ln is the natural logarithm function , LOSS l is the training error of the first training sample, and LOSS is the global error.
  • Step S404 determining whether the global error is less than a preset error threshold.
  • step S405 is performed, and if the global error is less than the error threshold, step S406 is performed.
  • Step S405 adjusting parameters of the text sentiment analysis neural network model.
  • the specifically adjusted parameters may include the above parameters such as WeightMatrix, WeightMatrix', WeightMatrix m, and the like.
  • step S405 the training is ended.
  • the global error is less than the error threshold, it indicates that the text sentiment analysis neural network model has reached the expected analysis precision, and at this time, the training process thereof can be ended, and the actual text sentiment analysis is used.
  • the embodiment of the present application also takes the column vector of the emotional subject as a separate input, and through the processing of the neural network model, the emotional subject is obtained in the statement.
  • the type of emotion in the text also takes the choice of the emotional subject as a decision condition that affects the final type of emotion.
  • FIG. 5 is a structural diagram of an embodiment of a text sentiment analysis apparatus provided by an embodiment of the present application.
  • a text sentiment analysis apparatus may include:
  • a text word-cutting module 501 configured to perform word-cutting processing on the sentence text to be analyzed, to obtain each word segment constituting the text of the sentence;
  • a column vector searching module 502 configured to separately search a column vector of each of the word segments in a preset word vector database, where the word vector database is a database for recording a correspondence between words and column vectors;
  • An input matrix component module 503 configured to form a column vector of each of the word segments into an input matrix, wherein each column of the input matrix corresponds to one column vector;
  • the emotion subject selection module 504 is configured to select a participle corresponding to the preset analysis object from the sentence text as the emotional subject of the text sentiment analysis;
  • a text sentiment analysis module 505 configured to input the input matrix and the input vector into a preset text sentiment analysis neural network model, to obtain an emotion type of the emotional subject in the sentence text, where the input vector is The column vector of the emotional subject.
  • the text sentiment analysis module may include:
  • a coupling vector calculation unit for calculating a coupling vector between the input matrix and the input vector according to the following formula:
  • CoupVec (CoupFactor 1 , CoupFactor 2 , ..., CoupFactor n , ..., CoupFactor N ) T ,
  • N is the number of columns of the input matrix
  • T is a transpose symbol
  • WordVec n is the nth column of the input matrix
  • MainVec is the input vector
  • WeightMatrix and WeightMatrix' are preset weight matrixes.
  • CoupVec is the coupling vector
  • a composite vector calculation unit for calculating a composite vector of the sentence text according to the following formula:
  • CompVec is the composite vector
  • WordMatrix is the input matrix
  • WordMatrix (WordVec 1 , WordVec 2 , ..., WordVec n , ..., WordVec N );
  • the emotion type probability value calculation unit is configured to separately calculate probability values of each emotion type according to the following formula:
  • M is the number of emotion types
  • WeightMatrix m is a preset weight matrix corresponding to the mth emotion type
  • Prob m is a probability value of the mth emotion type
  • An emotion type determining unit is configured to determine an emotion type having a maximum probability value as an emotion type of the emotion subject in the sentence text.
  • the text sentiment analysis apparatus may further include:
  • a training sample selection module for selecting a preset number of training samples, each sample comprising an input matrix, an input vector, and an expected output emotion type
  • the global error calculation module is configured to input each of the training samples into the text sentiment analysis neural network model for processing, and calculate a global error of the current training according to the following formula:
  • CalcProb l,m is the probability value of the mth emotion type in the first training sample
  • ExpProb l,m is the expected probability value of the mth emotion type in the first training sample
  • ExpSeq is the sequence number of the expected output sentiment type of the first training sample, 1 ⁇ l ⁇ L, L is the number of the training samples, 1 ⁇ m ⁇ M, M is the number of emotion types, and ln is the natural logarithm function , LOSS l is the training error of the first training sample, and LOSS is the global error;
  • a parameter adjustment module configured to adjust a parameter of the text sentiment analysis neural network model if the global error is greater than or equal to a preset error threshold
  • training sample selection module may include:
  • a first selecting unit configured to select training samples in pairs in the form of training sample pairs, each training sample pair includes two training samples, and the input matrix of two training samples in the same training sample pair is the same, and is the same sentence text
  • the matrix consisting of the column vectors of each segmentation is different from the input vectors of the two training samples in the same training sample pair, respectively, the column vectors of two different emotional subjects of the same sentence text, and the two training samples in the same training sample pair.
  • the expected output emotion types are different, one for positive emotion types and the other for negative emotion types.
  • the column vector lookup module may include:
  • a hash operation unit configured to hash the current participle by using K mutually independent hash functions according to the following formula, wherein the current participle is any one of the participles:
  • HashKey k HASH k (BasicWord)
  • HASH k is a hash function with sequence number k
  • HashKey k is a hash value of k obtained by operation, 1 ⁇ k ⁇ K
  • K is an integer greater than 1.
  • a storage slice number calculation unit configured to calculate, according to the following formula, a sequence number of the kth-level storage slice to which the current word segment belongs:
  • MaxHashKey k is the maximum value of the hash function HASH k
  • FragNum k is the number of storage fragments of the kth subtree
  • Ceil is the rounding function
  • Floor is the rounding function
  • WordRoute is the record storage path.
  • the array, WordRoute[k-1] is the serial number of the kth slice to which the current participle belongs, and is the kth element of WordRoute;
  • a column vector search unit for finding a column vector of the current participle under a storage path recorded by the array WordRoute.
  • FIG. 6 is a schematic block diagram of a text sentiment analysis terminal device provided by an embodiment of the present application.
  • the text sentiment analysis terminal device 6 may include a processor 60, a memory 61, and computer readable instructions 62 stored in the memory 61 and operable on the processor 60, for example, executing Computer readable instructions for the text sentiment analysis method described above.
  • the processor 60 executes the steps of the various text sentiment analysis method embodiments described above when the computer readable instructions 62 are executed.
  • the functional units in the various embodiments of the present application may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, the technical solution of the present application, in essence or the contribution to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium.
  • a number of computer readable instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请属于计算机技术领域,尤其涉及一种文本情感分析方法、计算机可读存储介质、终端设备及装置。所述方法对待分析的语句文本进行切词处理,得到构成所述语句文本的各个分词;在预设的词向量数据库中分别查找各个所述分词的列向量,并将各个所述分词的列向量组成输入矩阵,其中,所述输入矩阵的每一列均对应一个列向量,所述词向量数据库为记录词语与列向量之间的对应关系的数据库;从所述语句文本中选取一个与预设的分析对象对应的分词作为文本情感分析的情感主体;将所述输入矩阵和输入向量输入到预设的文本情感分析神经网络模型中,得到所述情感主体在所述语句文本中的情感类型,所述输入向量为所述情感主体的列向量。

Description

文本情感分析方法、可读存储介质、终端设备及装置
本申请要求于2018年4月9日提交中国专利局、申请号为201810309676.7、发明名称为“一种文本情感分析方法、计算机可读存储介质及终端设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于计算机技术领域,尤其涉及一种文本情感分析方法、计算机可读存储介质、终端设备及装置。
背景技术
文本情感分析是指根据文本所表达的含义和情感信息将文本分为正面或负面的两种或多种情感类型的技术。目前的文本情感分析方法主要是统计文本中代表不同情感的形容词的数量,并对此进行一个定量分析,这种方法对只包含单一情感主体的语句文本进行情感分析时准确率较高,但在对包含多个情感主体的语句文本进行情感分析时,则难以反映多个情感主体的复杂情感,例如,某一语句文本为“A公司销售业绩大幅超越B公司”,其中,共包含了两个情感主体,分别为“A公司”和“B公司”,对于情感主体“A公司”而言,该语句文本应为正面情感类型,但是对于情感主体“B公司”而言,该语句文本却为负面情感类型,而目前的文本情感分析方法所得到的分析结果是与情感主体无关的,只能得到一个唯一的不区分情感主体的情感类型。
技术问题
有鉴于此,本申请实施例提供了一种文本情感分析方法、计算机可读存储介质、终端设备及装置,以解决目前的文本情感分析方法难以反映多个情感主体的复杂情感的问题。
技术解决方案
本申请实施例的第一方面提供了一种文本情感分析方法,可以包括:
对待分析的语句文本进行切词处理,得到构成所述语句文本的各个分词;
在预设的词向量数据库中分别查找各个所述分词的列向量,并将各个所述分词的列向量组成输入矩阵,其中,所述输入矩阵的每一列均对应一个列向量,所述词向量数据库为记录词语与列向量之间的对应关系的数据库;
从所述语句文本中选取一个与预设的分析对象对应的分词作为文本情感分析的情感主体;
将所述输入矩阵和输入向量输入到预设的文本情感分析神经网络模型中,得到所述情感主体在所述语句文本中的情感类型,所述输入向量为所述情感主体的列向量。
本申请实施例的第二方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现上述文本情感分析方法的步骤。
本申请实施例的第三方面提供了一种文本情感分析终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上述文本情感分析方法的步骤。
本申请实施例的第四方面提供了一种文本情感分析装置,可以包括用于实现上述文本情感分析方法的步骤的模块。
有益效果
本申请实施例与现有技术相比存在的有益效果是:本申请实施例中除了考虑整体的语句文本外,还将情感主体的列向量作为了一个单独的输入,通过神经网络模型的处理,得到的是所述情感主体在所述语句文本中的情感类型,也即将情感主体的选择作为了影响最终的情感类型的一个决定条件,这样,在对包含多个情感主体的语句文本进行情感分析时,通过对不同的情感主体的选择,可以得到与之对应的情感类型,极好地反映出多个情感主体的复杂情感。
附图说明
图1为本申请实施例中一种文本情感分析方法的一个实施例流程图;
图2为本申请实施例中在词向量数据库中查找当前分词的列向量的示意流程图;
图3为本申请实施例中文本情感分析神经网络模型的数据处理过程的示意流程图;
图4为本申请实施例中文本情感分析神经网络模型的训练过程的示意流程图;
图5为本申请实施例中一种文本情感分析装置的一个实施例结构图;
图6为本申请实施例中一种文本情感分析终端设备的示意框图。
本发明的实施方式
请参阅图1,本申请实施例中一种文本情感分析方法的一个实施例可以包括:
步骤S101,对待分析的语句文本进行切词处理,得到构成所述语句文本的各个分词。
切词处理是指将一个语句文本切分成一个一个单独的词,也即各个所述分词,在本实施例中,可以根据通用词典对语句文本进行切分,保证分出的词语都是正常词汇,如词语不在词典内则分出单字。当前后方向都可以成词时,例如“要求神”,会根据统计词频的大小划分,如“要求”词频高则分出“要求/神”,如“求神”词频高则分出“要/求神”。
在拆分出每个分词后,如考虑二元组合词的话,则可将邻近的单词两两组合,增 加“庆祝大会”,“大会顺利”,“顺利闭幕”等二元组合词语。优选地,还可以再根据词频对这些二元组合词进行筛选。具体地,预先设置一个筛选的频率阈值,获取各个二元组合词出现的频率,若某个二元组合词出现的频率大于或等于该频率阈值,则保留该二元组合词,若某个二元组合词出现的频率小于该频率阈值,则剔除掉该二元组合词,也即将其视为两个独立的一元词。若我们设定的频率阈值为5,则剔除所有出现次数在5以下的二元组合词。
步骤S102,在预设的词向量数据库中分别查找各个所述分词的列向量。
所述词向量数据库为记录词语与列向量之间的对应关系的数据库。所述列向量可以是根据word2vec模型训练词语所得到对应的词向量。即根据词语的上下文信息来表示该词出现的概率。词向量的训练依然按照word2vec的思想,先将每个词表示成一个0-1向量(one-hot)形式,再用词向量进行word2vec模型训练,用n-1个词来预测第n个词,神经网络模型预测后得到的中间过程作为词向量。具体地,如“庆祝”的one-hot向量假设定为[1,0,0,0,……,0],“大会”的one-hot向量为[0,1,0,0,……,0],“顺利”的one-hot向量为[0,0,1,0,……,0],预测“闭幕”的向量[0,0,0,1,……,0],模型经过训练会生成隐藏层的系数矩阵W,每个词的one-hot向量和系数矩阵的乘积为该词的词向量,最后的形式将是类似于“庆祝[-0.28,0.34,-0.02,…...,0.92]”这样的一个多维向量。
在本实施例中,所述词向量数据库可以为K级树状分片存储结构,则步骤S102可以包括如图2所示的步骤:
步骤S1021,使用多个相互独立的哈希函数对当前分词进行哈希运算。
所述当前分词为任意一个所述分词。
具体地,可以根据下式分别使用K个相互独立的哈希函数对当前分词进行哈希运算:
HashKey k=HASH k(BasicWord)
其中,BasicWord为所述当前分词,HASH k为序号为k的哈希函数,HashKey k为运算得到的序号为k的哈希值,1≤k≤K,K为大于1的整数。
步骤S1022,计算所述当前分词所属的各级存储分片的序号。
具体地,可以根据下式计算所述当前分词所属的第k级存储分片的序号:
Figure PCTCN2018093344-appb-000001
其中,MaxHashKey k为哈希函数HASH k的最大取值,FragNum k为第k级子树的存储分片的数目,Ceil为向上取整函数,Floor为向下取整函数,WordRoute为记录存储路径的数组,WordRoute[k-1]为所述当前分词所属的第k级分片的序号,且为WordRoute 的第k个元素。
步骤S1023,在记录的存储路径下查找所述当前分词的列向量。
具体地,即在数组WordRoute所记录的存储路径下查找所述当前分词的列向量。例如,若数组WordRoute=[1,2,1,3,5],则存储路径为:第1级子树序号为1的存储分片—>第2级子树序号为2的存储分片—>第3级子树序号为1的存储分片—>第4级子树序号为3的存储分片—>第5级子树序号为5的存储分片,在该存储路径下查找所述当前分词的列向量。
步骤S103,将各个所述分词的列向量组成输入矩阵。
其中,所述输入矩阵的每一列均对应一个列向量,即第一个分词的列向量组成所示输入矩阵的第一列,第二个分词的列向量组成所示输入矩阵的第二列,……,第N个分词的列向量组成所示输入矩阵的第N列。N为所述分词的数目。
步骤S104,从所述语句文本中选取一个与预设的分析对象对应的分词作为文本情感分析的情感主体。
例如,某一语句文本为“A公司销售业绩大幅超越B公司”,其中,共有两个情感主体可供选择,分别为“A公司”和“B公司”,若当前想要分析“A公司”在所述语句文本中的情感类型,即所述分析对象为“A公司”,则选取“A公司”作为文本情感分析的情感主体,若当前想要分析“B公司”在所述语句文本中的情感类型,即所述分析对象为“B公司”,则选取“B公司”作为文本情感分析的情感主体。
步骤S105,将所述输入矩阵和输入向量输入到预设的文本情感分析神经网络模型中,得到所述情感主体在所述语句文本中的情感类型。
所述输入向量为所述情感主体的列向量。
所述文本情感分析神经网络模型的数据处理过程可以包括如图3所示的步骤:
步骤S1051,计算所述输入矩阵和所述输入向量之间的耦合向量。
具体地,可以根据下式计算所述输入矩阵和所述输入向量之间的耦合向量:
CoupVec=(CoupFactor 1,CoupFactor 2,......,CoupFactor n,......,CoupFactor N) T
其中,1≤n≤N,N为所述输入矩阵的列数,T为转置符号,
Figure PCTCN2018093344-appb-000002
WordVec n为所述输入矩阵的第n列,MainVec为所述输入向量,WeightMatrix、WeightMatrix′均为预设的权值矩阵,
Figure PCTCN2018093344-appb-000003
CoupVec为所述耦合向量。
步骤S1052,计算所述语句文本的复合向量。
具体地,可以根据下式计算所述语句文本的复合向量:
CompVec=WordMatrix*CoupVec,
其中,CompVec为所述复合向量,WordMatrix为所述输入矩阵,
且WordMatrix=(WordVec 1,WordVec 2,......,WordVec n,......,WordVec N)。
步骤S1053,分别计算各个情感类型的概率值。
具体地,可以根据下式分别计算各个情感类型的概率值:
Figure PCTCN2018093344-appb-000004
其中,1≤m≤M,M为情感类型的个数,WeightMatrix m为预设的与第m个情感类型对应的权值矩阵,Prob m为第m个情感类型的概率值。
具体的情感类型分类可以根据实际情况设置,例如可以将其分为正面情感类型和负面情感类型两类,也可以将其分为正面情感类型、负面情感类型和中性情感类型三类,还可以将其分为更多的类型。
步骤S1054,将概率值最大的情感类型确定为所述情感主体在所述语句文本中的情感类型。
优选地,所述文本情感分析神经网络模型的训练过程可以包括如图4所示的步骤:
步骤S401,选取预设数目的训练样本。
每个样本包括一个输入矩阵、一个输入向量和一个预期输出情感类型。
优选地,可以以训练样本对的形式成对选取训练样本,每个训练样本对包括两个训练样本,同一训练样本对中的两个训练样本的输入矩阵相同,为同一语句文本的各个分词的列向量所组成的矩阵,同一训练样本对中的两个训练样本的输入向量不同,分别为同一语句文本的两个不同情感主体的列向量,同一训练样本对中的两个训练样本的预期输出情感类型不同,一个为正面情感类型,另一个为负面情感类型。
步骤S402,将各个所述训练样本分别输入到所述文本情感分析神经网络模型中进行处理。
具体的处理过程与步骤S105类似,具体可参照步骤S105中的说明,在此不再赘述。
步骤S403,计算本轮训练的全局误差。
具体的,可以根据下式计算本轮训练的全局误差:
Figure PCTCN2018093344-appb-000005
其中,CalcProb l,m为第m个情感类型在第l个训练样本中的概率值,ExpProb l,m为 第m个情感类型在第l个训练样本中的预期概率值,
Figure PCTCN2018093344-appb-000006
ExpSeq为第l个训练样本的预期输出情感类型的序号,1≤l≤L,L为所述训练样本的数目,1≤m≤M,M为情感类型的个数,ln为自然对数函数,LOSS l为第l个训练样本的训练误差,LOSS为所述全局误差。
步骤S404,判断所述全局误差是否小于预设的误差阈值。
若所述全局误差大于或等于所述误差阈值,则执行步骤S405,若所述全局误差小于所述误差阈值,则执行步骤S406。
步骤S405,对所述文本情感分析神经网络模型的参数进行调整。
具体调整的参数可以包括上述的WeightMatrix、WeightMatrix′、WeightMatrix m等参数。在完成参数调整后,返回执行步骤S402,直至所述全局误差小于所述误差阈值为止。
步骤S405,结束训练。
当所述全局误差小于所述误差阈值时,即说明所述文本情感分析神经网络模型已经达到了预期的分析精度,此时可结束对其的训练过程,使用其进行实际的文本情感分析。
综上所述,本申请实施例中除了考虑整体的语句文本外,还将情感主体的列向量作为了一个单独的输入,通过神经网络模型的处理,得到的是所述情感主体在所述语句文本中的情感类型,也即将情感主体的选择作为了影响最终的情感类型的一个决定条件,这样,在对包含多个情感主体的语句文本进行情感分析时,通过对不同的情感主体的选择,可以得到与之对应的情感类型,极好地反映出多个情感主体的复杂情感。
对应于上文实施例所述的一种文本情感分析方法,图5示出了本申请实施例提供的一种文本情感分析装置的一个实施例结构图。
本实施例中,一种文本情感分析装置可以包括:
文本切词模块501,用于对待分析的语句文本进行切词处理,得到构成所述语句文本的各个分词;
列向量查找模块502,用于在预设的词向量数据库中分别查找各个所述分词的列向量,所述词向量数据库为记录词语与列向量之间的对应关系的数据库;
输入矩阵组成模块503,用于将各个所述分词的列向量组成输入矩阵,其中,所述输入矩阵的每一列均对应一个列向量;
情感主体选取模块504,用于从所述语句文本中选取一个与预设的分析对象对应的分词作为文本情感分析的情感主体;
文本情感分析模块505,用于将所述输入矩阵和输入向量输入到预设的文本情感分析神经网络模型中,得到所述情感主体在所述语句文本中的情感类型,所述输入向量为所述情感主体的列向量。
进一步地,所述文本情感分析模块可以包括:
耦合向量计算单元,用于根据下式计算所述输入矩阵和所述输入向量之间的耦合向量:
CoupVec=(CoupFactor 1,CoupFactor 2,......,CoupFactor n,......,CoupFactor N) T
其中,1≤n≤N,N为所述输入矩阵的列数,T为转置符号,
Figure PCTCN2018093344-appb-000007
WordVec n为所述输入矩阵的第n列,MainVec为所述输入向量,WeightMatrix、WeightMatrix′均为预设的权值矩阵,
Figure PCTCN2018093344-appb-000008
CoupVec为所述耦合向量;
复合向量计算单元,用于根据下式计算所述语句文本的复合向量:
CompVec=WordMatrix*CoupVec,
其中,CompVec为所述复合向量,WordMatrix为所述输入矩阵,
且WordMatrix=(WordVec 1,WordVec 2,......,WordVec n,......,WordVec N);
情感类型概率值计算单元,用于根据下式分别计算各个情感类型的概率值:
Figure PCTCN2018093344-appb-000009
其中,1≤m≤M,M为情感类型的个数,WeightMatrix m为预设的与第m个情感类型对应的权值矩阵,Prob m为第m个情感类型的概率值;
情感类型确定单元,用于将概率值最大的情感类型确定为所述情感主体在所述语句文本中的情感类型。
进一步地,所述文本情感分析装置还可以包括:
训练样本选取模块,用于选取预设数目的训练样本,每个样本包括一个输入矩阵、一个输入向量和一个预期输出情感类型;
全局误差计算模块,用于将各个所述训练样本分别输入到所述文本情感分析神经网络模型中进行处理,并根据下式计算本轮训练的全局误差:
Figure PCTCN2018093344-appb-000010
其中,CalcProb l,m为第m个情感类型在第l个训练样本中的概率值,ExpProb l,m为 第m个情感类型在第l个训练样本中的预期概率值,
Figure PCTCN2018093344-appb-000011
ExpSeq为第l个训练样本的预期输出情感类型的序号,1≤l≤L,L为所述训练样本的数目,1≤m≤M,M为情感类型的个数,ln为自然对数函数,LOSS l为第l个训练样本的训练误差,LOSS为所述全局误差;
参数调整模块,用于若所述全局误差大于或等于预设的误差阈值,则对所述文本情感分析神经网络模型的参数进行调整;
结束训练模块,用于若所述全局误差小于所述误差阈值,则结束训练。
进一步地,所述训练样本选取模块可以包括:
第一选取单元,用于以训练样本对的形式成对选取训练样本,每个训练样本对包括两个训练样本,同一训练样本对中的两个训练样本的输入矩阵相同,为同一语句文本的各个分词的列向量所组成的矩阵,同一训练样本对中的两个训练样本的输入向量不同,分别为同一语句文本的两个不同情感主体的列向量,同一训练样本对中的两个训练样本的预期输出情感类型不同,一个为正面情感类型,另一个为负面情感类型。
进一步地,所述列向量查找模块可以包括:
哈希运算单元,用于根据下式分别使用K个相互独立的哈希函数对当前分词进行哈希运算,所述当前分词为任意一个所述分词:
HashKey k=HASH k(BasicWord)
其中,BasicWord为所述当前分词,HASH k为序号为k的哈希函数,HashKey k为运算得到的序号为k的哈希值,1≤k≤K,K为大于1的整数;
存储分片序号计算单元,用于根据下式计算所述当前分词所属的第k级存储分片的序号:
Figure PCTCN2018093344-appb-000012
其中,MaxHashKey k为哈希函数HASH k的最大取值,FragNum k为第k级子树的存储分片的数目,Ceil为向上取整函数,Floor为向下取整函数,WordRoute为记录存储路径的数组,WordRoute[k-1]为所述当前分词所属的第k级分片的序号,且为WordRoute的第k个元素;
列向量查找单元,用于在数组WordRoute所记录的存储路径下查找所述当前分词的列向量。
图6示出了本申请实施例提供的一种文本情感分析终端设备的示意框图。
在本实施例中,所述文本情感分析终端设备6可包括:处理器60、存储器61以及存储在所述存储器61中并可在所述处理器60上运行的计算机可读指令62,例如执行上述的文本情感分析方法的计算机可读指令。所述处理器60执行所述计算机可读指令62时实现上述各个文本情感分析方法实施例中的步骤。
在本申请各个实施例中的各功能单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干计算机可读指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。

Claims (20)

  1. 一种文本情感分析方法,其特征在于,包括:
    对待分析的语句文本进行切词处理,得到构成所述语句文本的各个分词;
    在预设的词向量数据库中分别查找各个所述分词的列向量,并将各个所述分词的列向量组成输入矩阵,其中,所述输入矩阵的每一列均对应一个列向量,所述词向量数据库为记录词语与列向量之间的对应关系的数据库;
    从所述语句文本中选取一个与预设的分析对象对应的分词作为文本情感分析的情感主体;
    将所述输入矩阵和输入向量输入到预设的文本情感分析神经网络模型中,得到所述情感主体在所述语句文本中的情感类型,所述输入向量为所述情感主体的列向量。
  2. 根据权利要求1所述的文本情感分析方法,其特征在于,所述文本情感分析神经网络模型的数据处理过程包括:
    根据下式计算所述输入矩阵和所述输入向量之间的耦合向量:
    CoupVec=(CoupFactor 1,CoupFactor 2,......,CoupFactor n,......,CoupFactor N) T
    其中,1≤n≤N,N为所述输入矩阵的列数,T为转置符号,
    Figure PCTCN2018093344-appb-100001
    WordVec n为所述输入矩阵的第n列,MainVec为所述输入向量,WeightMatrix、WeightMatrix′均为预设的权值矩阵,
    Figure PCTCN2018093344-appb-100002
    CoupVec为所述耦合向量;
    根据下式计算所述语句文本的复合向量:
    CompVec=WordMatrix*CoupVec,
    其中,CompVec为所述复合向量,WordMatrix为所述输入矩阵,
    且WordMatrix=(WordVec 1,WordVec 2,......,WordVec n,......,WordVec N);
    根据下式分别计算各个情感类型的概率值:
    Figure PCTCN2018093344-appb-100003
    其中,1≤m≤M,M为情感类型的个数,WeightMatrix m为预设的与第m个情感类型对应的权值矩阵,Prob m为第m个情感类型的概率值;
    将概率值最大的情感类型确定为所述情感主体在所述语句文本中的情感类型。
  3. 根据权利要求1所述的文本情感分析方法,其特征在于,所述文本情感分析神经网络模型的训练过程包括:
    选取预设数目的训练样本,每个样本包括一个输入矩阵、一个输入向量和一个预期输出情感类型;
    将各个所述训练样本分别输入到所述文本情感分析神经网络模型中进行处理,并根据下式计算本轮训练的全局误差:
    Figure PCTCN2018093344-appb-100004
    其中,CalcProb l,m为第m个情感类型在第l个训练样本中的概率值,ExpProb l,m为第m个情感类型在第l个训练样本中的预期概率值,
    Figure PCTCN2018093344-appb-100005
    ExpSeq为第l个训练样本的预期输出情感类型的序号,1≤l≤L,L为所述训练样本的数目,1≤m≤M,M为情感类型的个数,ln为自然对数函数,LOSS l为第l个训练样本的训练误差,LOSS为所述全局误差;
    若所述全局误差大于或等于预设的误差阈值,则对所述文本情感分析神经网络模型的参数进行调整,并返回执行所述将各个所述训练样本分别输入到所述文本情感分析神经网络模型中进行处理的步骤,直至所述全局误差小于所述误差阈值为止;
    若所述全局误差小于所述误差阈值,则结束训练。
  4. 根据权利要求3所述的文本情感分析方法,其特征在于,所述选取预设数目的训练样本包括:
    以训练样本对的形式成对选取训练样本,每个训练样本对包括两个训练样本,同一训练样本对中的两个训练样本的输入矩阵相同,为同一语句文本的各个分词的列向量所组成的矩阵,同一训练样本对中的两个训练样本的输入向量不同,分别为同一语句文本的两个不同情感主体的列向量,同一训练样本对中的两个训练样本的预期输出情感类型不同,一个为正面情感类型,另一个为负面情感类型。
  5. 根据权利要求1至4中任一项所述的文本情感分析方法,其特征在于,所述词向量数据库为K级树状分片存储结构,所述在预设的词向量数据库中分别查找各个所述分词的列向量包括:
    根据下式分别使用K个相互独立的哈希函数对当前分词进行哈希运算,所述当前分词为任意一个所述分词:
    HashKey k=HASH k(BasicWord)
    其中,BasicWord为所述当前分词,HASH k为序号为k的哈希函数,HashKey k为运算得到的序号为k的哈希值,1≤k≤K,K为大于1的整数;
    根据下式计算所述当前分词所属的第k级存储分片的序号:
    Figure PCTCN2018093344-appb-100006
    其中,MaxHashKey k为哈希函数HASH k的最大取值,FragNum k为第k级子树的存储分片的数目,Ceil为向上取整函数,Floor为向下取整函数,WordRoute为记录存储路径的数组,WordRoute[k-1]为所述当前分词所属的第k级分片的序号,且为WordRoute的第k个元素;
    在数组WordRoute所记录的存储路径下查找所述当前分词的列向量。
  6. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现如下步骤:
    对待分析的语句文本进行切词处理,得到构成所述语句文本的各个分词;
    在预设的词向量数据库中分别查找各个所述分词的列向量,并将各个所述分词的列向量组成输入矩阵,其中,所述输入矩阵的每一列均对应一个列向量,所述词向量数据库为记录词语与列向量之间的对应关系的数据库;
    从所述语句文本中选取一个与预设的分析对象对应的分词作为文本情感分析的情感主体;
    将所述输入矩阵和输入向量输入到预设的文本情感分析神经网络模型中,得到所述情感主体在所述语句文本中的情感类型,所述输入向量为所述情感主体的列向量。
  7. 根据权利要求6所述的计算机可读存储介质,其特征在于,所述文本情感分析神经网络模型的数据处理过程包括:
    根据下式计算所述输入矩阵和所述输入向量之间的耦合向量:
    CoupVec=(CoupFactor 1,CoupFactor 2,......,CoupFactor n,......,CoupFactor N) T
    其中,1≤n≤N,N为所述输入矩阵的列数,T为转置符号,
    Figure PCTCN2018093344-appb-100007
    WordVec n为所述输入矩阵的第n列,MainVec为所述输入向量,WeightMatrix、WeightMatrix′均为预设的权值矩阵,
    Figure PCTCN2018093344-appb-100008
    CoupVec为所述耦合向量;
    根据下式计算所述语句文本的复合向量:
    CompVec=WordMatrix*CoupVec,
    其中,CompVec为所述复合向量,WordMatrix为所述输入矩阵,
    且WordMatrix=(WordVec 1,WordVec 2,......,WordVec n,......,WordVec N);
    根据下式分别计算各个情感类型的概率值:
    Figure PCTCN2018093344-appb-100009
    其中,1≤m≤M,M为情感类型的个数,WeightMatrix m为预设的与第m个情感类型对应的权值矩阵,Prob m为第m个情感类型的概率值;
    将概率值最大的情感类型确定为所述情感主体在所述语句文本中的情感类型。
  8. 根据权利要求6所述的计算机可读存储介质,其特征在于,所述文本情感分析神经网络模型的训练过程包括:
    选取预设数目的训练样本,每个样本包括一个输入矩阵、一个输入向量和一个预期输出情感类型;
    将各个所述训练样本分别输入到所述文本情感分析神经网络模型中进行处理,并根据下式计算本轮训练的全局误差:
    Figure PCTCN2018093344-appb-100010
    其中,CalcProb l,m为第m个情感类型在第l个训练样本中的概率值,ExpProb l,m为第m个情感类型在第l个训练样本中的预期概率值,
    Figure PCTCN2018093344-appb-100011
    ExpSeq为第l个训练样本的预期输出情感类型的序号,1≤l≤L,L为所述训练样本的数目,1≤m≤M,M为情感类型的个数,ln为自然对数函数,LOSS l为第l个训练样本的训练误差,LOSS为所述全局误差;
    若所述全局误差大于或等于预设的误差阈值,则对所述文本情感分析神经网络模型的参数进行调整,并返回执行所述将各个所述训练样本分别输入到所述文本情感分析神经网络模型中进行处理的步骤,直至所述全局误差小于所述误差阈值为止;
    若所述全局误差小于所述误差阈值,则结束训练。
  9. 根据权利要求8所述的计算机可读存储介质,其特征在于,所述选取预设数目的训练样本包括:
    以训练样本对的形式成对选取训练样本,每个训练样本对包括两个训练样本,同一训练样本对中的两个训练样本的输入矩阵相同,为同一语句文本的各个分词的列向量所组成的矩阵,同一训练样本对中的两个训练样本的输入向量不同,分别为同一语句文本的两个不同情感主体的列向量,同一训练样本对中的两个训练样本的预期输出情感类型不同,一个为正面情感类型,另一个为负面情感类型。
  10. 根据权利要求6至9中任一项所述的计算机可读存储介质,其特征在于,所述词向量数据库为K级树状分片存储结构,所述在预设的词向量数据库中分别查找各个所述分词的列向量包括:
    根据下式分别使用K个相互独立的哈希函数对当前分词进行哈希运算,所述当前分词为任意一个所述分词:
    HashKey k=HASH k(BasicWord)
    其中,BasicWord为所述当前分词,HASH k为序号为k的哈希函数,HashKey k为运算得到的序号为k的哈希值,1≤k≤K,K为大于1的整数;
    根据下式计算所述当前分词所属的第k级存储分片的序号:
    Figure PCTCN2018093344-appb-100012
    其中,MaxHashKey k为哈希函数HASH k的最大取值,FragNum k为第k级子树的存储分片的数目,Ceil为向上取整函数,Floor为向下取整函数,WordRoute为记录存储路径的数组,WordRoute[k-1]为所述当前分词所属的第k级分片的序号,且为WordRoute的第k个元素;
    在数组WordRoute所记录的存储路径下查找所述当前分词的列向量。
  11. 一种文本情感分析终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:
    对待分析的语句文本进行切词处理,得到构成所述语句文本的各个分词;
    在预设的词向量数据库中分别查找各个所述分词的列向量,并将各个所述分词的列向量组成输入矩阵,其中,所述输入矩阵的每一列均对应一个列向量,所述词向量数据库为记录词语与列向量之间的对应关系的数据库;
    从所述语句文本中选取一个与预设的分析对象对应的分词作为文本情感分析的情感主体;
    将所述输入矩阵和输入向量输入到预设的文本情感分析神经网络模型中,得到所述情感主体在所述语句文本中的情感类型,所述输入向量为所述情感主体的列向量。
  12. 根据权利要求11所述的文本情感分析终端设备,其特征在于,所述文本情感分析神经网络模型的数据处理过程包括:
    根据下式计算所述输入矩阵和所述输入向量之间的耦合向量:
    CoupVec=(CoupFactor 1,CoupFactor 2,......,CoupFactor n,......,CoupFactor N) T
    其中,1≤n≤N,N为所述输入矩阵的列数,T为转置符号,
    Figure PCTCN2018093344-appb-100013
    WordVec n为所述输入矩阵的第n列,MainVec为所述输入向量,WeightMatrix、 WeightMatrix′均为预设的权值矩阵,
    Figure PCTCN2018093344-appb-100014
    CoupVec为所述耦合向量;
    根据下式计算所述语句文本的复合向量:
    CompVec=WordMatrix*CoupVec,
    其中,CompVec为所述复合向量,WordMatrix为所述输入矩阵,
    且WordMatrix=(WordVec 1,WordVec 2,......,WordVec n,......,WordVec N);
    根据下式分别计算各个情感类型的概率值:
    Figure PCTCN2018093344-appb-100015
    其中,1≤m≤M,M为情感类型的个数,WeightMatrix m为预设的与第m个情感类型对应的权值矩阵,Prob m为第m个情感类型的概率值;
    将概率值最大的情感类型确定为所述情感主体在所述语句文本中的情感类型。
  13. 根据权利要求11所述的文本情感分析终端设备,其特征在于,所述文本情感分析神经网络模型的训练过程包括:
    选取预设数目的训练样本,每个样本包括一个输入矩阵、一个输入向量和一个预期输出情感类型;
    将各个所述训练样本分别输入到所述文本情感分析神经网络模型中进行处理,并根据下式计算本轮训练的全局误差:
    Figure PCTCN2018093344-appb-100016
    其中,CalcProb l,m为第m个情感类型在第l个训练样本中的概率值,ExpProb l,m为第m个情感类型在第l个训练样本中的预期概率值,
    Figure PCTCN2018093344-appb-100017
    ExpSeq为第l个训练样本的预期输出情感类型的序号,1≤l≤L,L为所述训练样本的数目,1≤m≤M,M为情感类型的个数,ln为自然对数函数,LOSS l为第l个训练样本的训练误差,LOSS为所述全局误差;
    若所述全局误差大于或等于预设的误差阈值,则对所述文本情感分析神经网络模型的参数进行调整,并返回执行所述将各个所述训练样本分别输入到所述文本情感分析神经网络模型中进行处理的步骤,直至所述全局误差小于所述误差阈值为止;
    若所述全局误差小于所述误差阈值,则结束训练。
  14. 根据权利要求13所述的文本情感分析终端设备,其特征在于,所述选取预设数目的训练样本包括:
    以训练样本对的形式成对选取训练样本,每个训练样本对包括两个训练样本,同 一训练样本对中的两个训练样本的输入矩阵相同,为同一语句文本的各个分词的列向量所组成的矩阵,同一训练样本对中的两个训练样本的输入向量不同,分别为同一语句文本的两个不同情感主体的列向量,同一训练样本对中的两个训练样本的预期输出情感类型不同,一个为正面情感类型,另一个为负面情感类型。
  15. 根据权利要求11至14中任一项所述的文本情感分析终端设备,其特征在于,所述词向量数据库为K级树状分片存储结构,所述在预设的词向量数据库中分别查找各个所述分词的列向量包括:
    根据下式分别使用K个相互独立的哈希函数对当前分词进行哈希运算,所述当前分词为任意一个所述分词:
    HashKey k=HASH k(BasicWord)
    其中,BasicWord为所述当前分词,HASH k为序号为k的哈希函数,HashKey k为运算得到的序号为k的哈希值,1≤k≤K,K为大于1的整数;
    根据下式计算所述当前分词所属的第k级存储分片的序号:
    Figure PCTCN2018093344-appb-100018
    其中,MaxHashKey k为哈希函数HASH k的最大取值,FragNum k为第k级子树的存储分片的数目,Ceil为向上取整函数,Floor为向下取整函数,WordRoute为记录存储路径的数组,WordRoute[k-1]为所述当前分词所属的第k级分片的序号,且为WordRoute的第k个元素;
    在数组WordRoute所记录的存储路径下查找所述当前分词的列向量。
  16. 一种文本情感分析装置,其特征在于,包括:
    文本切词模块,用于对待分析的语句文本进行切词处理,得到构成所述语句文本的各个分词;
    列向量查找模块,用于在预设的词向量数据库中分别查找各个所述分词的列向量,所述词向量数据库为记录词语与列向量之间的对应关系的数据库;
    输入矩阵组成模块,用于将各个所述分词的列向量组成输入矩阵,其中,所述输入矩阵的每一列均对应一个列向量;
    情感主体选取模块,用于从所述语句文本中选取一个与预设的分析对象对应的分词作为文本情感分析的情感主体;
    文本情感分析模块,用于将所述输入矩阵和输入向量输入到预设的文本情感分析神经网络模型中,得到所述情感主体在所述语句文本中的情感类型,所述输入向量为所述情感主体的列向量。
  17. 根据权利要求16所述的文本情感分析装置,其特征在于,所述文本情感分析模块包括:
    耦合向量计算单元,用于根据下式计算所述输入矩阵和所述输入向量之间的耦合向量:
    CoupVec=(CoupFactor 1,CoupFactor 2,......,CoupFactor n,......,CoupFactor N) T
    其中,1≤n≤N,N为所述输入矩阵的列数,T为转置符号,
    Figure PCTCN2018093344-appb-100019
    WordVec n为所述输入矩阵的第n列,MainVec为所述输入向量,WeightMatrix、WeightMatrix′均为预设的权值矩阵,
    Figure PCTCN2018093344-appb-100020
    CoupVec为所述耦合向量;
    复合向量计算单元,用于根据下式计算所述语句文本的复合向量:
    CompVec=WordMatrix*CoupVec,
    其中,CompVec为所述复合向量,WordMatrix为所述输入矩阵,
    且WordMatrix=(WordVec 1,WordVec 2,......,WordVec n,......,WordVec N);
    情感类型概率值计算单元,用于根据下式分别计算各个情感类型的概率值:
    Figure PCTCN2018093344-appb-100021
    其中,1≤m≤M,M为情感类型的个数,WeightMatrix m为预设的与第m个情感类型对应的权值矩阵,Prob m为第m个情感类型的概率值;
    情感类型确定单元,用于将概率值最大的情感类型确定为所述情感主体在所述语句文本中的情感类型。
  18. 根据权利要求16所述的文本情感分析装置,其特征在于,所述文本情感分析装置还包括:
    训练样本选取模块,用于选取预设数目的训练样本,每个样本包括一个输入矩阵、一个输入向量和一个预期输出情感类型;
    全局误差计算模块,用于将各个所述训练样本分别输入到所述文本情感分析神经网络模型中进行处理,并根据下式计算本轮训练的全局误差:
    Figure PCTCN2018093344-appb-100022
    其中,CalcProb l,m为第m个情感类型在第l个训练样本中的概率值,ExpProb l,m为第m个情感类型在第l个训练样本中的预期概率值,
    Figure PCTCN2018093344-appb-100023
    ExpSeq为第l个训练样本的预期输出情感类型的序号,1≤l≤L,L为所述训练样本的数目,1≤m≤M,M为情感类型的个数,ln为自然对数函数,LOSS l为第l个训练样本的训练误差,LOSS为所述全局误差;
    参数调整模块,用于若所述全局误差大于或等于预设的误差阈值,则对所述文本情感分析神经网络模型的参数进行调整;
    结束训练模块,用于若所述全局误差小于所述误差阈值,则结束训练。
  19. 根据权利要求18所述的文本情感分析装置,其特征在于,所述训练样本选取模块包括:
    第一选取单元,用于以训练样本对的形式成对选取训练样本,每个训练样本对包括两个训练样本,同一训练样本对中的两个训练样本的输入矩阵相同,为同一语句文本的各个分词的列向量所组成的矩阵,同一训练样本对中的两个训练样本的输入向量不同,分别为同一语句文本的两个不同情感主体的列向量,同一训练样本对中的两个训练样本的预期输出情感类型不同,一个为正面情感类型,另一个为负面情感类型。
  20. 根据权利要求16至19中任一项所述的文本情感分析装置,其特征在于,所述列向量查找模块包括:
    哈希运算单元,用于根据下式分别使用K个相互独立的哈希函数对当前分词进行哈希运算,所述当前分词为任意一个所述分词:
    HashKey k=HASH k(BasicWord)
    其中,BasicWord为所述当前分词,HASH k为序号为k的哈希函数,HashKey k为运算得到的序号为k的哈希值,1≤k≤K,K为大于1的整数;
    存储分片序号计算单元,用于根据下式计算所述当前分词所属的第k级存储分片的序号:
    Figure PCTCN2018093344-appb-100024
    其中,MaxHashKey k为哈希函数HASH k的最大取值,FragNum k为第k级子树的存储分片的数目,Ceil为向上取整函数,Floor为向下取整函数,WordRoute为记录存储路径的数组,WordRoute[k-1]为所述当前分词所属的第k级分片的序号,且为WordRoute的第k个元素;
    列向量查找单元,用于在数组WordRoute所记录的存储路径下查找所述当前分词的列向量。
PCT/CN2018/093344 2018-04-09 2018-06-28 文本情感分析方法、可读存储介质、终端设备及装置 WO2019196208A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810309676.7A CN108733644B (zh) 2018-04-09 2018-04-09 一种文本情感分析方法、计算机可读存储介质及终端设备
CN201810309676.7 2018-04-09

Publications (1)

Publication Number Publication Date
WO2019196208A1 true WO2019196208A1 (zh) 2019-10-17

Family

ID=63941208

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/093344 WO2019196208A1 (zh) 2018-04-09 2018-06-28 文本情感分析方法、可读存储介质、终端设备及装置

Country Status (2)

Country Link
CN (1) CN108733644B (zh)
WO (1) WO2019196208A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390002A (zh) * 2019-06-18 2019-10-29 深圳壹账通智能科技有限公司 通话资源配置方法、装置、计算机可读存储介质及服务器
CN112445898B (zh) * 2019-08-16 2024-06-14 阿里巴巴集团控股有限公司 对话情感分析方法及装置、存储介质及处理器
CN110717022A (zh) * 2019-09-18 2020-01-21 平安科技(深圳)有限公司 一种机器人对话生成方法、装置、可读存储介质及机器人
EP3839763A1 (en) 2019-12-16 2021-06-23 Tata Consultancy Services Limited System and method to quantify subject-specific sentiment
CN111191438B (zh) * 2019-12-30 2023-03-21 北京百分点科技集团股份有限公司 一种情感分析方法、装置和电子设备
CN112214576B (zh) * 2020-09-10 2024-02-06 深圳价值在线信息科技股份有限公司 舆情分析方法、装置、终端设备及计算机可读存储介质
CN112818681B (zh) * 2020-12-31 2023-11-10 北京知因智慧科技有限公司 文本情感分析方法、系统及电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106469145A (zh) * 2016-09-30 2017-03-01 中科鼎富(北京)科技发展有限公司 文本情感分析方法及装置
CN106776581A (zh) * 2017-02-21 2017-05-31 浙江工商大学 基于深度学习的主观性文本情感分析方法
CN106844330A (zh) * 2016-11-15 2017-06-13 平安科技(深圳)有限公司 文章情感的分析方法和装置
CN107066449A (zh) * 2017-05-09 2017-08-18 北京京东尚科信息技术有限公司 信息推送方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106469145A (zh) * 2016-09-30 2017-03-01 中科鼎富(北京)科技发展有限公司 文本情感分析方法及装置
CN106844330A (zh) * 2016-11-15 2017-06-13 平安科技(深圳)有限公司 文章情感的分析方法和装置
CN106776581A (zh) * 2017-02-21 2017-05-31 浙江工商大学 基于深度学习的主观性文本情感分析方法
CN107066449A (zh) * 2017-05-09 2017-08-18 北京京东尚科信息技术有限公司 信息推送方法和装置

Also Published As

Publication number Publication date
CN108733644B (zh) 2019-07-19
CN108733644A (zh) 2018-11-02

Similar Documents

Publication Publication Date Title
WO2019196208A1 (zh) 文本情感分析方法、可读存储介质、终端设备及装置
KR102019194B1 (ko) 문서 내 핵심 키워드 추출 시스템 및 방법
CN107480143B (zh) 基于上下文相关性的对话话题分割方法和系统
US20170330054A1 (en) Method And Apparatus Of Establishing Image Search Relevance Prediction Model, And Image Search Method And Apparatus
CN108228541B (zh) 生成文档摘要的方法和装置
CN109189767B (zh) 数据处理方法、装置、电子设备及存储介质
CN111221962B (zh) 一种基于新词扩展与复杂句式扩展的文本情感分析方法
CN107229610A (zh) 一种情感数据的分析方法及装置
JP2012524314A (ja) データ検索およびインデクシングの方法および装置
KR102217248B1 (ko) 텍스트 문서 요약을 위한 자질 추출 및 학습 방법
US10915707B2 (en) Word replaceability through word vectors
CN110674865B (zh) 面向软件缺陷类分布不平衡的规则学习分类器集成方法
CN104361037B (zh) 微博分类方法及装置
US11645447B2 (en) Encoding textual information for text analysis
CN111104555A (zh) 基于注意力机制的视频哈希检索方法
KR20200007713A (ko) 감성 분석에 의한 토픽 결정 방법 및 장치
CN110097096B (zh) 一种基于tf-idf矩阵和胶囊网络的文本分类方法
CN112115716A (zh) 一种基于多维词向量下文本匹配的服务发现方法、系统及设备
Pota et al. A subword-based deep learning approach for sentiment analysis of political tweets
CN112667979A (zh) 密码生成方法及装置、密码识别方法及装置、电子设备
CN115795030A (zh) 文本分类方法、装置、计算机设备和存储介质
CN106681986A (zh) 一种多维度情感分析系统
CN111079011A (zh) 一种基于深度学习的信息推荐方法
WO2017215244A1 (zh) 提供相关词的方法和装置
Rao et al. Result prediction for political parties using Twitter sentiment analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18914768

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 26.01.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18914768

Country of ref document: EP

Kind code of ref document: A1