WO2020134008A1 - 一种将语义文本数据与标签匹配的方法、装置以及一种储存指令的计算机可读存储介质 - Google Patents

一种将语义文本数据与标签匹配的方法、装置以及一种储存指令的计算机可读存储介质 Download PDF

Info

Publication number
WO2020134008A1
WO2020134008A1 PCT/CN2019/094646 CN2019094646W WO2020134008A1 WO 2020134008 A1 WO2020134008 A1 WO 2020134008A1 CN 2019094646 W CN2019094646 W CN 2019094646W WO 2020134008 A1 WO2020134008 A1 WO 2020134008A1
Authority
WO
WIPO (PCT)
Prior art keywords
text data
topic
label
semantic text
clustered
Prior art date
Application number
PCT/CN2019/094646
Other languages
English (en)
French (fr)
Inventor
王宇
邱雪涛
万四爽
佘萧寒
王阳
张琦
费志军
Original Assignee
中国银联股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国银联股份有限公司 filed Critical 中国银联股份有限公司
Priority to JP2021501074A priority Critical patent/JP7164701B2/ja
Priority to US17/260,177 priority patent/US11586658B2/en
Priority to KR1020207028156A priority patent/KR20200127020A/ko
Publication of WO2020134008A1 publication Critical patent/WO2020134008A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention relates to a data processing method, in particular to a method and device for matching semantic text data with tags, and a computer-readable storage medium storing instructions.
  • the traditional short text classification method is mainly based on a large number of user-labeled sample corpora to train the classification model. Its main features include: the user analyzes the sample corpus and manually defines a fixed sample classification label system. Based on the defined business classification label system, manually screen each sample in the sample corpus, label the samples appropriately, and construct a sample data set for classification model training. Train a classification model on the constructed sample data set. Based on the vector space model, "frequent word set extraction” or word frequency-inverse file word frequency (TF-IDF) method to extract the features of short text, and then based on the extracted text features using classification algorithms, such as SVM for training to form the final classification model .
  • TF-IDF word frequency-inverse file word frequency
  • the present invention proposes a method and apparatus for matching semantic text data with tags and a computer-readable storage medium storing instructions.
  • a method for matching semantic text data with tags including: preprocessing a plurality of semantic text data to obtain original corpus data including a plurality of semantically independent members; The recurrence relationship of the semantic independent members in the natural text determines the degree of association between any two of the plurality of semantic independent members, and determines the theme corresponding to the association according to the degree of association between the two Determining a mapping probability relationship between the plurality of semantic text data and the topic; selecting one of the plurality of semantic independent members corresponding to the association as the label of the topic, based on the determined plurality of semantic text data Mapping probability relationship with the topic, mapping the plurality of semantic text data to the tags; and using the determined mapping relationship between the plurality of semantic text data and the tags as supervising materials, and according to the The supervisory material matches the unmapped semantic text data to the label.
  • the preprocessing includes one or more of segmenting the plurality of semantic text data, removing stop words, removing non-Chinese characters, removing numeric signs, and performing word error correction.
  • the preprocessing includes extracting only the plurality of semantic text data containing negative semantics and/or interrogative semantics.
  • the recurrence relationship in natural text is the degree of contextual relevance in the original corpus data and/or in the natural text corpus.
  • the determining the degree of association between any two of the plurality of semantically independent members includes: indexing all semantically independent members in the original corpus data; determining that the semantically independent members are in the original corpus The word vector in the data, and determine the similarity between any two of the semantically independent members; construct a similarity matrix of pairs of semantically independent members according to the index and the similarity.
  • the determining the theme corresponding to the association according to the degree of association between the two includes: performing Gibbs iterative sampling on the similarity matrix to obtain the original corpus data and the theme A mapping relationship and a mapping relationship between the topic and the semantically independent member pair, and then a mapping probability relationship between the plurality of semantic text data and the topic and a mapping probability relationship between the topic and the plurality of semantic independent members .
  • the selecting one of the plurality of semantically independent members corresponding to the association as the label of the topic includes: clustering the plurality of semantic text data, and according to the plurality of semantic text data Determining the clustered theme of the plurality of semantic text data by the mapping relationship with the theme; clustering the plurality of semantic text data according to the mapping probability relationship between the theme and the plurality of semantically independent members
  • the topics are mapped to semantically independent members as the tags corresponding to the clustered topics.
  • the determining the clustered theme of the plurality of semantic text data according to the mapping probability relationship between the plurality of semantic text data and the theme includes: determining each of the plurality of semantic text data The maximum probability topic of each; determine the number of the maximum probability topics in each cluster; take the maximum probability topic in the cluster as the clustered topic.
  • a predetermined number of semantically independent members with the highest probability value corresponding to the clustered theme are determined according to the mapping probability relationship between the theme and the plurality of semantically independent members as the label of the clustered theme .
  • the tags of different clustered topics include the same tags
  • the probability values of the same tags in the different clustered topics are compared, and the tag with the largest probability value is retained as the The label of the clustered topic to which the label with the highest probability value belongs; for the clustered topics to which the label with the highest probability value belongs, use semantic independence with a lower probability value than the same label probability value
  • the member serves as a label for the clustered topic.
  • an apparatus for matching semantic text data with tags includes: a preprocessing unit configured to preprocess a plurality of semantic text data to obtain Original corpus data of multiple semantically independent members; a topic model unit, which is used to determine the degree of association between any two of the plurality of semantically independent members according to the recurring relationship of the plurality of semantically independent members in natural text And determine the theme corresponding to the association according to the degree of association between the two, and then determine the mapping probability relationship between the plurality of semantic text data and the theme; a label determination unit, which is used to select the corresponding One of the associated multiple semantic independent members is used as a label of the topic, and the multiple semantic text data is mapped to the topic according to the determined mapping probability relationship between the multiple semantic text data and the topic A label; and a label matching unit for using the determined mapping relationship between the plurality of semantic text data and the label as supervising material, and matching unmapped semantic text data to the label according to the supervising material .
  • the preprocessing includes one or more of segmenting the plurality of semantic text data, removing stop words, removing non-Chinese characters, removing numeric signs, and performing word error correction.
  • the preprocessing includes extracting only the plurality of semantic text data containing negative semantics and/or interrogative semantics.
  • the recurrence relationship in natural text is the degree of contextual relevance in the original corpus data and/or in the natural text corpus.
  • the topic model unit is used to determine the degree of association between any two of the plurality of semantically independent members, including: indexing all semantically independent members in the original corpus data; determining the semantically independent The word vectors of the members in the original corpus data, and determine the similarity between any two of the semantically independent members; construct a similarity matrix of pairs of semantically independent members according to the index and the similarity.
  • the theme model unit is used to determine the theme corresponding to the association according to the degree of association between the two, including: performing Gibbs iterative sampling on the similarity matrix to obtain the original corpus The mapping relationship between the data and the topic and the mapping relationship between the topic and the semantically independent member pair, thereby determining the mapping probability relationship between the multiple semantic text data and the topic and the topic and the multiple semantics The mapping probability relationship of independent members.
  • the label determining unit is used to select one of the plurality of semantically independent members corresponding to the association as the label of the topic, including: clustering the plurality of semantic text data, and according to The mapping relationship between the plurality of semantic text data and the topic determines the clustered topic of the semantic text data; according to the mapping probability relationship between the topic and the plurality of semantic independent members, the clustered results
  • the topics of the plurality of semantic text data are mapped to semantically independent members as the tags corresponding to the clustered topics.
  • the label determining unit is configured to determine the clustered theme of the plurality of semantic text data according to the mapping probability relationship between the plurality of semantic text data and the theme, including: determining the plurality of semantics The maximum probability topic of each of the text data; determining the number of the maximum probability topics in each cluster; taking the maximum probability topic in the cluster as the clustered topic.
  • the label determining unit is configured to determine a predetermined number of semantically independent members with the highest probability value corresponding to the clustered theme according to the mapping probability relationship between the topic and the plurality of semantically independent members as The label of the topic after clustering.
  • the tags of different clustered topics include the same tags
  • the probability values of the same tags in the different clustered topics are compared, and the tag with the largest probability value is retained as the The label of the clustered topic to which the label with the highest probability value belongs; for the clustered topics to which the label with the highest probability value belongs, use semantic independence with a lower probability value than the same label probability value
  • the member serves as a label for the clustered topic.
  • a computer-readable storage medium storing instructions that when the instructions are executed by a processor, the processor is configured to perform the method described herein.
  • FIG. 1 shows a flowchart of a method for matching semantic text data with tags according to an embodiment of the present invention.
  • FIG. 2 shows a flowchart of pre-processing according to an embodiment of the present invention.
  • FIG. 3 shows a flowchart of constructing a theme model according to an embodiment of the present invention.
  • FIG. 4 shows a flowchart of classification label learning according to an embodiment of the present invention.
  • FIG. 5 shows a flowchart of classification model training according to an embodiment of the present invention.
  • FIG. 6 shows a schematic diagram of K-means clustering according to an embodiment of the present invention.
  • FIG 7 shows the prediction result of each category label of the SVM classifier according to an embodiment of the present invention.
  • the user review data is preprocessed.
  • the purpose of pre-processing is to process semantic text data such as user comments to obtain semantically independent members (such as English words, Chinese vocabulary and other morphemes) and raw corpus data.
  • semantically independent members such as English words, Chinese vocabulary and other morphemes
  • Each semantically independent member is an independent unit for semantic analysis.
  • the semantically independent member can also be the smallest unit for semantic analysis.
  • word segmentation may be achieved through a Chinese word segmentation toolkit such as jieba (step 202). Then perform operations such as removing stop words, removing non-Chinese characters, removing digital symbols, and performing word error correction on the independent members after word segmentation (step 204).
  • a sentence (not shown in the figure) containing the user's key intentions can also be extracted. For example, a user review content data platform user feedback information can be extracted only sentence that includes the negative word or question word as kernel sentences of the original sample and further obtain semantic independent members, the original corpus data, if it is difficult to extract directly jump Go through this step.
  • multiple semantically independent members are used to form original corpus data.
  • the topic model is determined.
  • the relevance of any two morphemes is determined according to the recurrence relationship of the morpheme in the natural text, and the topic corresponding to the relevance is determined according to the relevance, and then the mapping probability relationship between the morpheme and the topic is determined.
  • the recurrence relationship reflects the degree of semantic connection between morphemes. For example, in a sentence (or a paragraph of text, etc.), the correlation between "payment” and context semantics reaches a certain value X, and the correlation between "swipe” and context semantics reaches a certain value Y, and X ⁇ Y, Then it can be considered that there is a strong semantic correlation between "payment” and "swipe".
  • the relevance of "payment” and context semantics can be derived from statistics, for example, so the relevance of "payment” and context semantics is statistically determined based on its reproduction in natural text.
  • the natural text can be the target text (original corpus data in this article) for investigation and processing, or any meaningful natural text library, such as Baidu Encyclopedia, Wikipedia, Sogou Internet corpus and other natural text corpora.
  • step 104 may be implemented in the embodiment shown in FIG. 3.
  • a word vector is trained.
  • the gensim toolkit is used to implement training word vectors for subsequent short text modeling. If less data is collected, the word vector training effect is general, consider introducing large Chinese corpora such as Sogou Internet corpus as Supplement, or can directly use Google open source Chinese vector model. Word vectors can make up for the defect that TF-IDF cannot measure the semantic similarity between words.
  • a word pair similarity matrix is created. Create an index of different words in the text. The index exists as the label of the word.
  • a probability distribution matrix of word pair-topic may be first generated based on the Chinese restaurant process (CRP). Then count the number of word pairs that appear in each document according to the set of word pairs, and use a 1 ⁇ N-dimensional matrix to store the number of all word pairs that appear in the document.
  • a word pair is a pairing of any two words as basic morphemes.
  • the word pair similarity matrix Sim is created for subsequent processing.
  • step 408 the Sim matrix is used to perform Gibbs iterative sampling, and the overall corpus-topic matrix and topic-word pair matrix are obtained by Gibbs sampling in the word pair topic model, and a text model is established.
  • the specific process is as follows:
  • d i represents the number of word pairs already in subject i
  • n-1 represents the total number of word pairs already existing before the current word pair
  • d 0 is the initial parameter
  • p(D n k
  • D -n ) represents the probability that the word pair D n will be assigned to the topic k.
  • step 106 the classification label is learned. Specifically, as shown in FIG. 4, a user comment-topic probability distribution matrix (step 604) and a topic-word probability distribution matrix (step 602) are generated through inference.
  • the short text topic matrix is used to represent short text, that is, the probability distribution of the topic is used to represent short text features:
  • d i ) represents the probability of a short text z i d i of the subject matter
  • k is the number of the whole short text corpus subject.
  • step 606 methods such as K-Means clustering can be used to cluster the entire corpus, and the JS distance is used in the clustering algorithm to measure the similarity of the text:
  • step 608 traverse all the user review corpus in the cluster, find the maximum probability topic of each review data according to the user review-topic matrix, and count the number of different maximum probability topics, extract the largest number of topics to cluster Cluster topics (step 610).
  • step 612 the top n words with the highest probability value are selected from the topic-word probability matrix as the clustered tag information. Check the weight of each clustered label keyword. If different clusters have repeated keywords, re-select the keywords under the corresponding theme of each cluster. See the probability value of the same keyword under the respective theme. Keywords with small values are replaced by the next word or phrase with probability.
  • step 108 the classification model is trained. Specifically, the embodiment shown in FIG. 5.
  • the user comment corpus is automatically labeled with a category label, and then the mapping relationship between the user comment and the label is obtained.
  • the user review corpus is obtained according to the user reviews after the clustered topic.
  • TF-IDF and word vectors are extracted as text features for each user review corpus.
  • two classification algorithms SVM and bidirectional LSTM, are used to train a classification model (step 808), and then vote aggregation is performed by a voting classifier to construct a user comment classification model (step 810).
  • This embodiment mainly analyzes the feedback message of the data platform user, first extracts the semantic feature information of the data platform user feedback message based on the short text feature extraction method proposed by the present invention, and then builds a classification model to realize automatic classification of the user feedback message.
  • the data source is the feedback message data of the data platform APP users of a certain month.
  • the original data is mainly stored in the form of text. For specific examples, please refer to Table 1:
  • the automatic classification of feedback messages from users of the data platform can be performed as follows, for example.
  • Step 1 Preprocess the feedback message data
  • Step 2 Characteristic representation of short text feedback from data platform users
  • step one For the corpus preprocessed in step one, the Skip-gram model in the Word2Vec method proposed by Google is adopted, and the word2vec function in the gensim library is used for training, where the set word vector dimension is set to 200, and the Skip-gram model The window size is 5.
  • Table 2 shows exemplary results.
  • Baidu Encyclopedia Vector Private domain word vector ('Flash Payment', 0.8876532316207886) ('Cloud Payment', 0.7113977074623108) ('Tianyi mobile phone', 0.8041104674339294) ('UnionPay Wallet', 0.6253437995910645) ('Dual network dual standby', 0.7926369905471802) ('Cloud Flash', 0.5981202125549316) ('Dual standby', 0.7770497798919678) ('Flash Payment', 0.5895633101463318) ('Mobile payment', 0.7767471075057983) ('QR code', 0.5603029727935791) ('Card reader', 0.7745838761329651) ('Mobile phone', 0.5016968250274658) ('Referring to pay pass', 0.7724637985229492) ('App', 0.49683672189712524) ('WeChat end', 0.7695549
  • Word vectors can more accurately express payment domain knowledge, which provides more accurate semantic information for subsequent classifications.
  • Step 3 Data platform users feedback the extraction of message classification labels
  • Step 4 Automatic classification of data platform user messages
  • FIG. 7 shows an example of the label prediction result according to this configuration.
  • the probability threshold of classification prediction can be set, and the data with a low prediction probability category is manually processed. Taking into account the model accuracy and recall rate, the threshold can be set to 0.6.
  • the automatic reply method of APP user comments proposed in this article can effectively mine hot topic categories in short text data such as user comments, grasp the main consultation hotspots of users during the use of products, and on the other hand, can realize automatic classification of user comments. It can greatly improve the operating efficiency of APP.
  • the classification label system mentioned in the present invention is based on a self-learning method, which does not require business personnel to manually analyze all text information in the short text corpus, and subsequent update and maintenance of the label system is also automatically completed, which can greatly reduce manual participation.
  • the workload is easier to apply to the actual scene.
  • the classification training corpus of the present invention is also generated in the process of classification labeling, so there is no need to manually mark the corpus.
  • the present invention merges the entire short text corpus for topic modeling to effectively alleviate the problem of text sparseness.
  • the similarity of word pairs is merged, so the text is considered
  • the contextual relationship of different word pairs can extract wider semantic features in the text, and the semantic expression ability is stronger.
  • the features of each short text include the features calculated by the TF-IDF and also the features extracted by the topic model, which not only achieves statistical considerations, but also incorporates the features of context information.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

一种将语义文本数据与标签匹配的方法,其特征在于,方法包括:将多个语义文本数据进行预处理,以获得包括多个语义独立成员的原始语料数据;根据多个语义独立成员在自然文本中的重现关系确定多个语义独立成员中任意两者之间的关联度,并根据任意两者之间的关联度确定对应于该关联的主题,进而确定多个语义文本数据与主题的映射概率关系;选取对应于该关联的多个语义独立成员中的一者作为主题的标签,根据已经确定的多个语义文本数据与主题的映射概率关系,将多个语义文本数据映射到标签;以及将已经确定的多个语义文本数据与标签的映射关系作为监督材料,并且根据监督材料将未经映射的语义文本数据匹配标签。

Description

一种将语义文本数据与标签匹配的方法、装置以及一种储存指令的计算机可读存储介质 技术领域
本发明涉及一种数据处理方法,具体涉及将语义文本数据与标签匹配的方法、装置以及一种储存指令的计算机可读存储介质。
背景技术
随着移动互联网的发展,人们越来越倾向在移动设备上表达观点或者寻求相关的咨询。例如,通过APP的自助服务进行留言咨询、使用微博等社交网络表达想法等。在这一背景下,将会产生大量的非结构化的短文本数据,而这些数据中往往蕴含了用户的核心诉求或者对产品和服务的优化建议。
对于这些价值巨大的数据,相关部门在日常的分析工作中,往往首先进行文本分类,传统做法主要通过人工标记,效率低下。因此,提升这类数据的分析与挖掘能力,特别是自动挖掘水平,会显著降低日常运营成本。此外,目前移动网络上用户的评论数据文本短促、口语化严重、信息价值零散、语言风格不规范、不同性格的用户表达方式各异,这些都给传统的语义分析特征提取带来很大的挑战。
传统的短文本分类方法主要基于大量的用户标记样本语料训练分类模型,其主要特征包括:由用户分析样本语料库,人工定义一个固定的样本分类标签体系。基于定义的业务分类标签体系,人工逐条筛查样本语料库中的每一个样本,给样本打上合适的标签,构建分类模型训练的样本数据集。针对构建的样本数据集训练分类模型。基于向量空间模型、“频繁词集抽取”或词频-逆向文件词频(TF-IDF)的方法提取短文本的特征,再基于提取的文本特征利用分类算法,譬如SVM进行训练,形成最终的分类模型。
发明内容
为了将诸如用户评论的语义文本数据进行分类,本发明提出了一种将语义文本数据与标签匹配的方法、装置以及一种储存指令的计算机可读存储介质。
根据本发明的一方面,提供一种将语义文本数据与标签匹配的方法,包括:将多个语义文本数据进行预处理,以获得包括多个语义独立成员的原始语料数据;根据多个所述语义独立成员在自然文本中的重现关系确定多个所述语义独立成员中任意两者之间的关联度,并根据所述任意两者之间的关联度确定对应于该关联的主题,进而确定所述多个语义文本数据与所述主题的映射概率关系;选取对应于该关联的多个语义独立成员中的一者作为所述主题的标签,根据已经确定的所述多个语义文本数据与所述主题的映射概率关系,将所述多个语义文本数据映射到所述标签;以及将已经确定的所述多个语义文本数据与所述标签的映射关系作为监督材料,并且根据所述监督材料将未经映射的语义文本数据匹配所述标签。
可选地,所述预处理包括将所述多个语义文本数据进行分词、去除停用词、去除非中文字符、去除数字符号以及进行词语纠错中的一者或多者。
可选地,所述预处理包括只抽取包含否定语义和/或疑问语义的所述多个语义文本数据。
可选地,所述在自然文本中的重现关系为在所述原始语料数据中和/或在自然文本语料库中的上下文重现关联程度。
可选地,所述确定多个所述语义独立成员中任意两者之间的关联度包括:将所述原始语料数据中所有语义独立成员进行索引;确定所述语义独立成员在所述原始语料数据中的词向量,并确定所述语义独立成员任意两者之间的相似性;根据所述索引以及所述相似性构建语义独立成员对的相似性矩阵。
可选地,所述根据所述任意两者之间的关联度确定对应于该关联的主题包括:对所述相似性矩阵进行吉布斯迭代采样以获得所述原始语料数据与所述主题的映射关系以及所述主题与所述语义独 立成员对的映射关系,进而确定所述多个语义文本数据与所述主题的映射概率关系以及所述主题与多个所述语义独立成员的映射概率关系。
可选地,所述选取对应于该关联的多个语义独立成员中的一者作为所述主题的标签包括:将所述多个语义文本数据进行聚类,以及根据所述多个语义文本数据与所述主题的映射关系确定聚类后的所述多个语义文本数据的主题;根据所述主题与多个所述语义独立成员的映射概率关系将聚类后的所述多个语义文本数据的主题映射为语义独立成员,以作为对应于聚类后的主题的所述标签。
可选地,所述根据所述多个语义文本数据与所述主题的映射概率关系确定聚类后的所述多个语义文本数据的主题包括:确定所述多个语义文本数据中的每一个的最大概率主题;确定每个聚类中的所述最大概率主题的数目;将聚类中数目最大的所述最大概率主题作为聚类后的主题。
可选地,根据所述主题与多个所述语义独立成员的映射概率关系确定对应于聚类后的主题的概率值最高的预定数量的语义独立成员以作为所述聚类后的主题的标签。
可选地,若不同的聚类后的主题的标签包括相同的标签,则比较所述相同的标签在所述不同的聚类后的主题中的概率值,保留概率值最大的标签作为所述概率值最大的标签所属的所述聚类后的主题的标签;对于除了概率值最大的标签所属的所述聚类后的主题,使用概率值比所述相同的标签概率值更低的语义独立成员作为所述聚类后的主题的标签。
根据本发明的另一方面,提供一种将语义文本数据与标签匹配的装置,其特征在于,所述装置包括:预处理单元,其用于将多个语义文本数据进行预处理,以获得包括多个语义独立成员的原始语料数据;主题模型单元,其用于根据多个所述语义独立成员在自然文本中的重现关系确定多个所述语义独立成员中任意两者之间的关联度,并根据所述任意两者之间的关联度确定对应于该关联的主题,进而确定所述多个语义文本数据与所述主题的映射概率关系; 标签确定单元,其用于选取对应于该关联的多个语义独立成员中的一者作为所述主题的标签,根据已经确定的所述多个语义文本数据与所述主题的映射概率关系,将所述多个语义文本数据映射到所述标签;以及标签匹配单元,其用于将已经确定的所述多个语义文本数据与所述标签的映射关系作为监督材料,并且根据所述监督材料将未经映射的语义文本数据匹配所述标签。
可选地,所述预处理包括将所述多个语义文本数据进行分词、去除停用词、去除非中文字符、去除数字符号以及进行词语纠错中的一者或多者。
可选地,所述预处理包括只抽取包含否定语义和/或疑问语义的所述多个语义文本数据。
可选地,所述在自然文本中的重现关系为在所述原始语料数据中和/或在自然文本语料库中的上下文重现关联程度。
可选地,所述主题模型单元用于确定多个所述语义独立成员中任意两者之间的关联度,包括:将所述原始语料数据中所有语义独立成员进行索引;确定所述语义独立成员在所述原始语料数据中的词向量,并确定所述语义独立成员任意两者之间的相似性;根据所述索引以及所述相似性构建语义独立成员对的相似性矩阵。
可选地,所述主题模型单元用于根据所述任意两者之间的关联度确定对应于该关联的主题,包括:对所述相似性矩阵进行吉布斯迭代采样以获得所述原始语料数据与所述主题的映射关系以及所述主题与所述语义独立成员对的映射关系,进而确定所述多个语义文本数据与所述主题的映射概率关系以及所述主题与多个所述语义独立成员的映射概率关系。
可选地,所述标签确定单元用于选取对应于该关联的多个语义独立成员中的一者作为所述主题的标签,包括:将所述多个语义文本数据进行聚类,以及根据所述多个语义文本数据与所述主题的映射关系确定聚类后的所述多个语义文本数据的主题;根据所述主题与多个所述语义独立成员的映射概率关系将聚类后的所述多个语义文本数据的主题映射为语义独立成员,以作为对应于聚类后的主 题的所述标签。
可选地,所述标签确定单元用于根据所述多个语义文本数据与所述主题的映射概率关系确定聚类后的所述多个语义文本数据的主题,包括:确定所述多个语义文本数据中的每一个的最大概率主题;确定每个聚类中的所述最大概率主题的数目;将聚类中数目最大的所述最大概率主题作为聚类后的主题。
可选地,所述标签确定单元用于:根据所述主题与多个所述语义独立成员的映射概率关系确定对应于聚类后的主题的概率值最高的预定数量的语义独立成员以作为所述聚类后的主题的标签。
可选地,若不同的聚类后的主题的标签包括相同的标签,则比较所述相同的标签在所述不同的聚类后的主题中的概率值,保留概率值最大的标签作为所述概率值最大的标签所属的所述聚类后的主题的标签;对于除了概率值最大的标签所属的所述聚类后的主题,使用概率值比所述相同的标签概率值更低的语义独立成员作为所述聚类后的主题的标签。
根据本发明的其他方面,提供一种储存指令的计算机可读存储介质,当所述指令当所述指令由处理器执行时,将所述处理器配置为执行本文中所述的方法。
附图说明
从结合附图的以下详细说明中,将会使本发明的上述和其他目的及优点更加完整清楚,其中,相同或相似的要素采用相同的标号表示。
图1示出了根据本发明的一个实施例的将语义文本数据与标签匹配的方法的流程图。
图2示出了根据本发明的一个实施例的预处理的流程图。
图3示出了根据本发明的一个实施例的构造主题模型的流程图。
图4示出了根据本发明的一个实施例的分类标签学习的流程图。
图5示出了根据本发明的一个实施例的分类模型训练的流程图。
图6示出了根据本发明的一个实施例的K-means聚类的示意图。
图7示出了根据本发明的一个实施例的SVM分类器各个类别标签预测结果。
具体实施方式
出于简洁和说明性目的,本文主要参考其示范实施例来描述本发明的原理。但是,本领域技术人员将容易地认识到相同的原理可等效地应用于所有类型的用于视觉感知系统的性能测试系统和/或性能测试方法,并且可以在其中实施这些相同或相似的原理,任何此类变化不背离本专利申请的真实精神和范围。
实施例一
参见图1,其示出了根据本发明的一个实施例的将语义文本数据与标签匹配的方法的流程图。在步骤102中,对用户评论数据进行预处理。预处理的目的在于对诸如用户评论等的语义文本数据进行加工,以获得语义独立的成员(诸如英文单词、中文词汇等语素)和原始语料数据。每个语义独立的成员是进行语义分析的独立单元,特别地,语义独立的成员还可以为进行语义分析的最小单元。
在图2所述的实施例中,为了获得语义独立成员,可以通过jieba等中文分词工具包实现分词(步骤202)。然后对分词后的独立成员进行去除停用词、去除非中文字符、去除数字符号以及进行词语纠错等操作(步骤204)。其次,作为可选的预处理,还可以抽取包含用户关键意图的语句(图中未示出)。例如,在作为用户评论内容的 数据平台用户反馈信息中,可只抽取包含否定词或者疑问词的句子,作为原样本的核心句并进一步获得语义独立成员、原始语料数据,若难以抽取可直接跳过该步。最后,在步骤206中,利用多个语义独立成员形成原始语料数据。
在步骤104中,确定主题模型。根据语素在自然文本中的重现关系确定任意两个语素之间的关联度,并根据关联度确定对应于该关联的主题,进而确定语素与主题的映射概率关系。重现关系体现了语素之间的语义关联程度。例如,在一句话(或者一段文字等)中,“支付”与上下文语义的关联性达到某个特定值X,“刷卡”和上 下文语义的关联性达到某个特定值Y,而X≈Y,那么可以认为“支付”与“刷卡”存在较强的语义关联度。其中,“支付”与上下文语义的关联性可以为诸如统计得出的,因而“支付”与上下文语义的关联性在统计中是基于其在自然文本的重现确定。自然文本可以为用于考察、处理的目标文本(本文中的原始语料数据),也可以是任何有意义的自然文本库,诸如百度百科、维基百科、搜狗互联网语料等自然文本语料库。
具体地,步骤104可以以图3中示出的实施例实现。在步骤402中训练词向量。针对预处理过的语料,通过gensim工具包实现训练词向量,用于后续的短文本建模,若采集到的数据较少,词向量训练效果一般,可考虑引入搜狗互联网语料等大型中文语料库作为补充,或者可直接采用Google开源的中文向量模型。词向量可以弥补TF-IDF无法衡量词之间的语义相似性的缺陷。
在步骤404中,创建词对相似性矩阵。建立文本中不同词汇的索引,索引是作为词汇的标号存在的。
在步骤406中,可以先基于中国餐馆过程(CRP)生成词对-主题的概率分布矩阵。然后根据词对集合统计出每个文档中出现的词对的个数,用一个1×N维的矩阵存储所有词对出现在文档中的个数。词对是作为基本语素的任意两个词汇的配对。最后创建词对相似矩阵Sim,以用于后续处理。
在步骤408中,利用Sim矩阵,进行吉布斯迭代采样,通过词对主题模型中的吉布斯采样获得整体语料库-主题矩阵和主题-词对矩阵,建立文本模型。具体流程如下:
首先,设置词对主题模型的初始化参数:狄利克雷分布的先验参数α=0.5,β=0.1,迭代最大次数iteration=100,保存中间结果的步长savestep=10等。
其次,循环遍历语料库的词对集合,在每次采样过程中考虑到词对间的相似性,分配词对的主题,其中,词对相似性主要基于中国餐馆过程来生成:
Figure PCTCN2019094646-appb-000001
其中d i表示主题i已有的词对数,n-1表示在当前词对之前已经具有的词对总数,d 0为初始参数。p(D n=k|D -n)表示词对D n的分配给主题k概率。
再次,根据词对的主题分配,更新语料库-主题矩阵和主题-词对矩阵,再判断迭代次数是否达到savestep的整数倍,如果没有达到,继续遍历语料库的词对集合。
最后,保存语料库-主题矩阵与主题-词对矩阵,判断迭代次数是否达到最大迭代次数(100次),如果没有达到,继续遍历语料库的词对集合;保存最终生成的语料库-主题矩阵与主题-词对矩阵。
回到图1,在步骤106中进行分类标签的学习。具体地,如图4所示出的,通过推理生成用户评论-主题概率分布矩阵(步骤604)以及主题-词概率分布矩阵(步骤602)。利用短文本主题矩阵来表示短文本,即使用主题的概率分布来表示短文本特征:
d i=(p(z 0|d i),p(z 1|d i),...,p(z k-1|d i))
其中,p(z i|d i)表示短文本d i中主题z i的概率,k为整个短文本语料上主题的个数。
在步骤606中,可以采用诸如K-Means聚类等方法对整个语料库进行聚类,聚类算法中采用JS距离来测量文本的相似度:
Figure PCTCN2019094646-appb-000002
其中
Figure PCTCN2019094646-appb-000003
在步骤608中,遍历聚簇中的所有用户评论语料,根据用户评论-主题矩阵,找到每个评论数据的最大概率主题,并统计不同最大概率主题的个数,提取个数最大的主题以聚簇主题(步骤610)。在步骤612中,再从该主题-词语的概率矩阵中,挑选概率值最高的前n个词语作为该聚簇的标签信息。针对每个聚簇的标签关键词进行验重,若不同聚簇又关键词重复,在各自聚簇对应的主题下重新 选取关键词,看该相同的关键词在各自主题下的概率值,概率值小的关键词被概率值下一个的词汇或短语替换。
回到图1,在步骤108中进行分类模型的训练。具体地,如图5所示的实施例。在步骤802中,根据步骤106学习得出的分类类别信息,给用户评论语料自动打分类标签,进而获得用户评论与标签的映射关系。在步骤804中,根据聚簇主题后的用户评论得到用户评论语料。在步骤806中,针对每个用户评论语料提取TF-IDF以及词向量作为文本的特征。然后,采用SVM与双向LSTM两种分类算法训练分类模型(步骤808),再通过投票分类器进行投票聚合,构建用户评论分类模型(步骤810)。
实施例二
本实施例主要分析数据平台用户反馈留言,首先基于本发明提出的短文本特征提取方法抽取数据平台用户反馈留言的语义特征信息,然后构建分类模型,实现用户反馈留言的自动分类。数据来源为某月的数据平台APP用户反馈留言数据。原始数据主要以文本的形式保存的,具体样例可以参见表1:
Figure PCTCN2019094646-appb-000004
表1
数据平台用户反馈留言的自动分类可以诸如按照如下示例进行。
步骤一、反馈留言数据预处理
通过对大量数据分析,用户大多数情况下都会借助否定词或者疑问词来提出遇到的问题,因此为了进一步的提炼关键信息,我们采取以下方法来提取用户反馈留言的否定窗口:
1.1利用常见的中英符号(如全、半角的逗号、句号等)将句子分为若干个短句;
1.2找到第一个否定词或者疑问词所在的短句作为窗口;
1.3设置指定的窗口大小(本文设定的前后步长均是1),提取否定窗口。
步骤二、数据平台用户反馈短文本的特征表示
2.1针对步骤一预处理过的语料,采取Google提出的Word2Vec方法中的Skip-gram模型,利用gensim库中的word2vec函数进行训练,其中设定的词向量维度设定为200,Skip-gram模型的窗口大小为5。表2示出了示例性的结果。
Figure PCTCN2019094646-appb-000005
表2
2.2对比百度百科与专用领域词向量,进行词向量对比:
百度百科词向量 专用领域词向量
(′闪付′,0.8876532316207886) (′云支付′,0.7113977074623108)
(′天翼手机′,0.8041104674339294) (′银联钱包′,0.6253437995910645)
(′双网双待′,0.7926369905471802) (′云闪′,0.5981202125549316)
(′双待′,0.7770497798919678) (′闪付′,0.5895633101463318)
(′手机支付′,0.7767471075057983) (′二维码′,0.5603029727935791)
(′刷卡器′,0.7745838761329651) (′手机′,0.5016968250274658)
(′指付通′,0.7724637985229492) (′app′,0.49683672189712524)
(′微信端′,0.7695549130439758) (′单独′,0.4926530122756958)
(′双模双待′,0.7687188386917114) (′闪′,0.490323543548584)
(′智能电话′,0.7658545970916748) (′扫码′,0.4879230260848999)
表3
词向量可以更精准的表达支付领域知识,这对后面的分类提供了更准确的语义信息。
采用吉布斯采样获得整体用户评论语料库-主题矩阵和主题-词对矩阵:其中狄克雷分布的先验参数α=0.5,β=0.1,迭代最大次数为500,保存中间结果的步长为10。
步骤三、数据平台用户反馈留言分类标签的提取
3.1将上述得到的特征矩阵作为输入,利用scikit-learn机器学习工具包进行K-means聚类(图6)。需要注意的是,为了与后续的聚类合并方法配合使用,在这一场景下,我们将初始聚类个数设为60,最终的聚类个数由轮廓系数和S_Dbw来共同决定。
3.2遍历聚簇中的文本,根据文本-主题概率分布矩阵找到该文本下的最大主题概率值的主题;统计该聚簇下的各主题所占的比例, 找到出现最多次数的主题;在主题-词矩阵中,找到上一步中统计的出现次数最多的主题,然后找到该主题下概率值大小排在前十的词汇或短语作为聚簇描述(如表4、表5所示)。
Figure PCTCN2019094646-appb-000006
交易查询类问题
表4
Figure PCTCN2019094646-appb-000007
功能咨询炎问题
表5
步骤四、数据平台用户留言的自动分类
4.1使用sklearn包进行机器学习算法的分类实验,主要采用SVM算法,分类指标准确率,采用5折交叉验证保证结果的稳定性。
分类模型构建过程使用网格搜索(GridSearch)得到了最优SVM参数,即参数设置为C=3.276,kernel=′rbf′,gamma=0.01。图7示出了根据该配置的标签预测结果的示例。
4.2在实际应用场景,譬如数据平台场景,为了提升模型的可用性,可设置分类预测的概率阈值,对于预测概率类别不高的数据,交由人工处理。综合考虑模型准确率与召回率,阈值可以设置0.6。
利用本文提出的APP用户评论自动回复方法一方面可以有效地挖掘用户评论等短文本数据中的热点话题类别,掌握用户在使用产品过程中主要咨询热点,另一方面可以实现用户评论的自动分类,能够大大提升APP的运营服务效率。
本发明提及的分类标签体系是基于自学习的方法,不需要业务人员人工分析短文本语料库中的所有文本信息,并且后续标签体系的更新与维护也是自动完成的,可以大大较低人工参与的工作量,更易在实际场景中应用落地。本发明的分类训练语料库也是在分类标签过程中产生的,因此不需要对语料库进行人工标记。本发明在分类标签提取过程中,将整个短文本语料合并进行主题建模,有效缓解文本语义稀疏的问题,在主题-词对采样过程中,融合了词对的相似性,因此考虑到文本中不同词对的上下文关联关系,能够提取文本中更广泛的语义特征,语义表达能力更强。在文本分类过程中,每篇短文本的特征包含了TF-IDF计算的特征之外也包含了主题模型提取的特征,不仅做到了从统计角度考虑,而且融合了上下文信息的特征。
以上例子主要说明了本公开的将语义文本数据与标签匹配的方法、将语义文本数据与标签匹配的装置以及一种储存指令的计算机可读存储介质。尽管只对其中一些本发明的实施方式进行了描述,但是本领域普通技术人员应当了解,本发明可以在不偏离其主旨与范围内以许多其他的形式实施。因此,所展示的例子与实施方式被视为示意性的而非限制性的,在不脱离如所附各权利要求所定义的本发明精神及范围的情况下,本发明可能涵盖各种的修改与替换。

Claims (21)

  1. 一种将语义文本数据与标签匹配的方法,其特征在于,所述方法包括:
    将多个语义文本数据进行预处理,以获得包括多个语义独立成员的原始语料数据;
    根据多个所述语义独立成员在自然文本中的重现关系确定多个所述语义独立成员中任意两者之间的关联度,并根据所述任意两者之间的关联度确定对应于该关联的主题,进而确定所述多个语义文本数据与所述主题的映射概率关系;
    选取对应于该关联的多个语义独立成员中的一者作为所述主题的标签,根据已经确定的所述多个语义文本数据与所述主题的映射概率关系,将所述多个语义文本数据映射到所述标签;以及
    将已经确定的所述多个语义文本数据与所述标签的映射关系作为监督材料,并且根据所述监督材料将未经映射的语义文本数据匹配所述标签。
  2. 根据权利要求1所述的方法,其特征在于:
    所述预处理包括将所述多个语义文本数据进行分词、去除停用词、去除非中文字符、去除数字符号以及进行词语纠错中的一者或多者。
  3. 根据权利要求1所述的方法,其特征在于:
    所述预处理包括只抽取包含否定语义和/或疑问语义的所述多个语义文本数据。
  4. 根据权利要求1所述的方法,其特征在于:
    所述在自然文本中的重现关系为在所述原始语料数据中和/或在自然文本语料库中的上下文重现关联程度。
  5. 根据权利要求1或4所述的方法,其特征在于,所述确定多个所述语义独立成员中任意两者之间的关联度包括:
    将所述原始语料数据中所有语义独立成员进行索引;
    确定所述语义独立成员在所述原始语料数据中的词向量,并确 定所述语义独立成员任意两者之间的相似性;
    根据所述索引以及所述相似性构建语义独立成员对的相似性矩阵。
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述任意两者之间的关联度确定对应于该关联的主题包括:
    对所述相似性矩阵进行吉布斯迭代采样以获得所述原始语料数据与所述主题的映射关系以及所述主题与所述语义独立成员对的映射关系,进而确定所述多个语义文本数据与所述主题的映射概率关系以及所述主题与多个所述语义独立成员的映射概率关系。
  7. 根据权利要求6所述的方法,其特征在于,所述选取对应于该关联的多个语义独立成员中的一者作为所述主题的标签包括:
    将所述多个语义文本数据进行聚类,以及根据所述多个语义文本数据与所述主题的映射关系确定聚类后的所述多个语义文本数据的主题;
    根据所述主题与多个所述语义独立成员的映射概率关系将聚类后的所述多个语义文本数据的主题映射为语义独立成员,以作为对应于聚类后的主题的所述标签。
  8. 根据权利要求7所述的方法,其特征在于,所述根据所述多个语义文本数据与所述主题的映射概率关系确定聚类后的所述多个语义文本数据的主题包括:
    确定所述多个语义文本数据中的每一个的最大概率主题;
    确定每个聚类中的所述最大概率主题的数目;
    将聚类中数目最大的所述最大概率主题作为聚类后的主题。
  9. 根据权利要求8所述的方法,其特征在于,
    根据所述主题与多个所述语义独立成员的映射概率关系确定对应于聚类后的主题的概率值最高的预定数量的语义独立成员以作为所述聚类后的主题的标签。
  10. 根据权利要求9所述的方法,其特征在于,
    若不同的聚类后的主题的标签包括相同的标签,则比较所述相同的标签在所述不同的聚类后的主题中的概率值,保留概率值最大 的标签作为所述概率值最大的标签所属的所述聚类后的主题的标签;
    对于除了概率值最大的标签所属的所述聚类后的主题,使用概率值比所述相同的标签概率值更低的语义独立成员作为所述聚类后的主题的标签。
  11. 一种将语义文本数据与标签匹配的装置,其特征在于,所述装置包括:
    预处理单元,其用于将多个语义文本数据进行预处理,以获得包括多个语义独立成员的原始语料数据;
    主题模型单元,其用于根据多个所述语义独立成员在自然文本中的重现关系确定多个所述语义独立成员中任意两者之间的关联度,并根据所述任意两者之间的关联度确定对应于该关联的主题,进而确定所述多个语义文本数据与所述主题的映射概率关系;
    标签确定单元,其用于选取对应于该关联的多个语义独立成员中的一者作为所述主题的标签,根据已经确定的所述多个语义文本数据与所述主题的映射概率关系,将所述多个语义文本数据映射到所述标签;以及
    标签匹配单元,其用于将已经确定的所述多个语义文本数据与所述标签的映射关系作为监督材料,并且根据所述监督材料将未经映射的语义文本数据匹配所述标签。
  12. 根据权利要求11所述的装置,其特征在于,
    所述预处理包括将所述多个语义文本数据进行分词、去除停用词、去除非中文字符、去除数字符号以及进行词语纠错中的一者或多者。
  13. 根据权利要求11所述的装置,其特征在于,
    所述预处理包括只抽取包含否定语义和/或疑问语义的所述多个语义文本数据。
  14. 根据权利要求11所述的装置,其特征在于,
    所述在自然文本中的重现关系为在所述原始语料数据中和/或在自然文本语料库中的上下文重现关联程度。
  15. 根据权利要求11或14所述的装置,其特征在于,所述主题模型单元用于确定多个所述语义独立成员中任意两者之间的关联度,包括:
    将所述原始语料数据中所有语义独立成员进行索引;
    确定所述语义独立成员在所述原始语料数据中的词向量,并确定所述语义独立成员任意两者之间的相似性;
    根据所述索引以及所述相似性构建语义独立成员对的相似性矩阵。
  16. 根据权利要求15所述的装置,其特征在于,所述主题模型单元用于根据所述任意两者之间的关联度确定对应于该关联的主题,包括:
    对所述相似性矩阵进行吉布斯迭代采样以获得所述原始语料数据与所述主题的映射关系以及所述主题与所述语义独立成员对的映射关系,进而确定所述多个语义文本数据与所述主题的映射概率关系以及所述主题与多个所述语义独立成员的映射概率关系。
  17. 根据权利要求16所述的装置,其特征在于,所述标签确定单元用于选取对应于该关联的多个语义独立成员中的一者作为所述主题的标签,包括:
    将所述多个语义文本数据进行聚类,以及根据所述多个语义文本数据与所述主题的映射关系确定聚类后的所述多个语义文本数据的主题;
    根据所述主题与多个所述语义独立成员的映射概率关系将聚类后的所述多个语义文本数据的主题映射为语义独立成员,以作为对应于聚类后的主题的所述标签。
  18. 根据权利要求17所述的装置,其特征在于,所述标签确定单元用于根据所述多个语义文本数据与所述主题的映射概率关系确定聚类后的所述多个语义文本数据的主题,包括:
    确定所述多个语义文本数据中的每一个的最大概率主题;
    确定每个聚类中的所述最大概率主题的数目;
    将聚类中数目最大的所述最大概率主题作为聚类后的主题。
  19. 根据权利要求18所述的装置,其特征在于,所述标签确定单元用于:
    根据所述主题与多个所述语义独立成员的映射概率关系确定对应于聚类后的主题的概率值最高的预定数量的语义独立成员以作为所述聚类后的主题的标签。
  20. 根据权利要求19所述的装置,其特征在于,所述标签确定单元用于:
    若不同的聚类后的主题的标签包括相同的标签,则比较所述相同的标签在所述不同的聚类后的主题中的概率值,保留概率值最大的标签作为所述概率值最大的标签所属的所述聚类后的主题的标签;
    对于除了概率值最大的标签所属的所述聚类后的主题,使用概率值比所述相同的标签概率值更低的语义独立成员作为所述聚类后的主题的标签。
  21. 一种储存指令的计算机可读存储介质,当所述指令当所述指令由处理器执行时,将所述处理器配置为执行如权利要求1-10中任一项所述的方法。
PCT/CN2019/094646 2018-12-27 2019-07-04 一种将语义文本数据与标签匹配的方法、装置以及一种储存指令的计算机可读存储介质 WO2020134008A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2021501074A JP7164701B2 (ja) 2018-12-27 2019-07-04 セマンティックテキストデータをタグとマッチングさせる方法、装置、及び命令を格納するコンピュータ読み取り可能な記憶媒体
US17/260,177 US11586658B2 (en) 2018-12-27 2019-07-04 Method and device for matching semantic text data with a tag, and computer-readable storage medium having stored instructions
KR1020207028156A KR20200127020A (ko) 2018-12-27 2019-07-04 의미 텍스트 데이터를 태그와 매칭시키는 방법, 장치 및 명령을 저장하는 컴퓨터 판독 가능한 기억 매체

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811610544.4A CN110032639B (zh) 2018-12-27 2018-12-27 将语义文本数据与标签匹配的方法、装置及存储介质
CN201811610544.4 2018-12-27

Publications (1)

Publication Number Publication Date
WO2020134008A1 true WO2020134008A1 (zh) 2020-07-02

Family

ID=67235412

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/094646 WO2020134008A1 (zh) 2018-12-27 2019-07-04 一种将语义文本数据与标签匹配的方法、装置以及一种储存指令的计算机可读存储介质

Country Status (5)

Country Link
US (1) US11586658B2 (zh)
JP (1) JP7164701B2 (zh)
KR (1) KR20200127020A (zh)
CN (1) CN110032639B (zh)
WO (1) WO2020134008A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114281928A (zh) * 2020-09-28 2022-04-05 中国移动通信集团广西有限公司 基于文本数据的模型生成方法、装置及设备

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110515895B (zh) * 2019-08-30 2023-06-23 北京燕山电子设备厂 大数据存储系统中对数据文件进行关联存储的方法及系统
CN111274798B (zh) * 2020-01-06 2023-08-18 北京大米科技有限公司 一种文本主题词确定方法、装置、存储介质及终端
CN111310438B (zh) * 2020-02-20 2021-06-08 齐鲁工业大学 基于多粒度融合模型的中文句子语义智能匹配方法及装置
CN111311450B (zh) * 2020-02-28 2024-03-29 重庆百事得大牛机器人有限公司 用于法律咨询服务的大数据管理平台及方法
CN111695358B (zh) * 2020-06-12 2023-08-08 腾讯科技(深圳)有限公司 生成词向量的方法、装置、计算机存储介质和电子设备
CN112989971B (zh) * 2021-03-01 2024-03-22 武汉中旗生物医疗电子有限公司 一种不同数据源的心电数据融合方法及装置
CN112926339B (zh) * 2021-03-09 2024-02-09 北京小米移动软件有限公司 文本相似度确定方法、系统、存储介质以及电子设备
CN113934819A (zh) * 2021-10-14 2022-01-14 陈鹏 基于context的标签管理方法、装置、服务器及存储介质
CN114398968B (zh) * 2022-01-06 2022-09-20 北京博瑞彤芸科技股份有限公司 基于文件相似度对同类获客文件进行标注的方法和装置
CN114896398A (zh) * 2022-05-05 2022-08-12 南京邮电大学 一种基于特征选择的文本分类系统及方法
CN116151542A (zh) * 2022-11-30 2023-05-23 上海韵达高新技术有限公司 物流订单实时监控方法、装置、设备及存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100030780A1 (en) * 2008-07-30 2010-02-04 Kave Eshghi Identifying related objects in a computer database
CN105975475A (zh) * 2016-03-31 2016-09-28 华南理工大学 基于中文短语串的细粒度主题信息抽取方法
CN106033445A (zh) * 2015-03-16 2016-10-19 北京国双科技有限公司 获取文章关联度数据的方法和装置

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2391967A (en) * 2002-08-16 2004-02-18 Canon Kk Information analysing apparatus
JP4521343B2 (ja) * 2005-09-29 2010-08-11 株式会社東芝 文書処理装置及び文書処理方法
US10536728B2 (en) * 2009-08-18 2020-01-14 Jinni Content classification system
GB2488925A (en) 2009-12-09 2012-09-12 Ibm Method of searching for document data files based on keywords,and computer system and computer program thereof
JP5252593B2 (ja) 2010-08-12 2013-07-31 Necビッグローブ株式会社 最適タグ提案装置、最適タグ提案システム、最適タグ提案方法、およびプログラム
JP2014153977A (ja) 2013-02-12 2014-08-25 Mitsubishi Electric Corp コンテンツ解析装置、コンテンツ解析方法、コンテンツ解析プログラム、およびコンテンツ再生システム
US9311386B1 (en) * 2013-04-03 2016-04-12 Narus, Inc. Categorizing network resources and extracting user interests from network activity
KR101478016B1 (ko) 2013-09-04 2015-01-02 한국과학기술정보연구원 공기 정보를 이용한 문장 클러스터 기반의 정보 검색 장치 및 방법
US10510018B2 (en) * 2013-09-30 2019-12-17 Manyworlds, Inc. Method, system, and apparatus for selecting syntactical elements from information as a focus of attention and performing actions to reduce uncertainty
US10509814B2 (en) * 2014-12-19 2019-12-17 Universidad Nacional De Educacion A Distancia (Uned) System and method for the indexing and retrieval of semantically annotated data using an ontology-based information retrieval model
CN106156204B (zh) * 2015-04-23 2020-05-29 深圳市腾讯计算机系统有限公司 文本标签的提取方法和装置
CN104850650B (zh) * 2015-05-29 2018-04-10 清华大学 基于类标关系的短文本扩充方法
EP3151131A1 (en) 2015-09-30 2017-04-05 Hitachi, Ltd. Apparatus and method for executing an automated analysis of data, in particular social media data, for product failure detection
CN106055538B (zh) * 2016-05-26 2019-03-08 达而观信息科技(上海)有限公司 主题模型和语义分析相结合的文本标签自动抽取方法
KR101847847B1 (ko) 2016-11-15 2018-04-12 주식회사 와이즈넛 딥러닝을 이용한 비정형 텍스트 데이터의 문서 군집화 방법
CN107301199B (zh) * 2017-05-17 2021-02-12 北京融数云途科技有限公司 一种数据标签生成方法和装置
US10311454B2 (en) * 2017-06-22 2019-06-04 NewVoiceMedia Ltd. Customer interaction and experience system using emotional-semantic computing
CN107798043B (zh) * 2017-06-28 2022-05-03 贵州大学 基于狄利克雷多项混合模型的长文本辅助短文本的文本聚类方法
US10678816B2 (en) * 2017-08-23 2020-06-09 Rsvp Technologies Inc. Single-entity-single-relation question answering systems, and methods
CN107818153B (zh) * 2017-10-27 2020-08-21 中航信移动科技有限公司 数据分类方法和装置
CN108399228B (zh) * 2018-02-12 2020-11-13 平安科技(深圳)有限公司 文章分类方法、装置、计算机设备及存储介质
CN108763539B (zh) * 2018-05-31 2020-11-10 华中科技大学 一种基于词性分类的文本分类方法和系统
CN108959431B (zh) * 2018-06-11 2022-07-05 中国科学院上海高等研究院 标签自动生成方法、系统、计算机可读存储介质及设备
US11397859B2 (en) * 2019-09-11 2022-07-26 International Business Machines Corporation Progressive collocation for real-time discourse

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100030780A1 (en) * 2008-07-30 2010-02-04 Kave Eshghi Identifying related objects in a computer database
CN106033445A (zh) * 2015-03-16 2016-10-19 北京国双科技有限公司 获取文章关联度数据的方法和装置
CN105975475A (zh) * 2016-03-31 2016-09-28 华南理工大学 基于中文短语串的细粒度主题信息抽取方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114281928A (zh) * 2020-09-28 2022-04-05 中国移动通信集团广西有限公司 基于文本数据的模型生成方法、装置及设备

Also Published As

Publication number Publication date
KR20200127020A (ko) 2020-11-09
US20210286835A1 (en) 2021-09-16
JP2021518027A (ja) 2021-07-29
CN110032639A (zh) 2019-07-19
US11586658B2 (en) 2023-02-21
CN110032639B (zh) 2023-10-31
JP7164701B2 (ja) 2022-11-01

Similar Documents

Publication Publication Date Title
WO2020134008A1 (zh) 一种将语义文本数据与标签匹配的方法、装置以及一种储存指令的计算机可读存储介质
US11403680B2 (en) Method, apparatus for evaluating review, device and storage medium
CN113011533A (zh) 文本分类方法、装置、计算机设备和存储介质
Duwairi et al. Sentiment analysis for Arabizi text
US20180025121A1 (en) Systems and methods for finer-grained medical entity extraction
US20230385549A1 (en) Systems and methods for colearning custom syntactic expression types for suggesting next best corresponence in a communication environment
Diamantini et al. A negation handling technique for sentiment analysis
CN113961685A (zh) 信息抽取方法及装置
CN107133212B (zh) 一种基于集成学习和词句综合信息的文本蕴涵识别方法
US11347944B2 (en) Systems and methods for short text identification
CN111753082A (zh) 基于评论数据的文本分类方法及装置、设备和介质
CN111310467B (zh) 一种在长文本中结合语义推断的主题提取方法及系统
Sazzed A hybrid approach of opinion mining and comparative linguistic analysis of restaurant reviews
CN114398943B (zh) 样本增强方法及其装置
CN115062621A (zh) 标签提取方法、装置、电子设备和存储介质
KR20220074576A (ko) 마케팅 지식 그래프 구축을 위한 딥러닝 기반 신조어 추출 방법 및 그 장치
US20210117448A1 (en) Iterative sampling based dataset clustering
Chen et al. Learning the chinese sentence representation with LSTM autoencoder
CN113051396B (zh) 文档的分类识别方法、装置和电子设备
Kang et al. Sentiment analysis on Malaysian airlines with BERT
WO2021189291A1 (en) Methods and systems for extracting self-created terms in professional area
CN114238586A (zh) 基于联邦学习框架的Bert结合卷积神经网络的情感分类方法
Nsaif et al. Political Post Classification based on Firefly and XG Boost
JP2024518458A (ja) テキスト内の自動トピック検出のシステム及び方法
KR20220074572A (ko) 딥러닝 기반 신조어 추출 방법 및 그 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19902446

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20207028156

Country of ref document: KR

Kind code of ref document: A

Ref document number: 2021501074

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19902446

Country of ref document: EP

Kind code of ref document: A1