WO2020134008A1 - 一种将语义文本数据与标签匹配的方法、装置以及一种储存指令的计算机可读存储介质 - Google Patents
一种将语义文本数据与标签匹配的方法、装置以及一种储存指令的计算机可读存储介质 Download PDFInfo
- Publication number
- WO2020134008A1 WO2020134008A1 PCT/CN2019/094646 CN2019094646W WO2020134008A1 WO 2020134008 A1 WO2020134008 A1 WO 2020134008A1 CN 2019094646 W CN2019094646 W CN 2019094646W WO 2020134008 A1 WO2020134008 A1 WO 2020134008A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text data
- topic
- label
- semantic text
- clustered
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000003860 storage Methods 0.000 title claims description 6
- 238000013507 mapping Methods 0.000 claims abstract description 50
- 238000007781 pre-processing Methods 0.000 claims abstract description 15
- 239000000463 material Substances 0.000 claims abstract description 10
- 239000011159 matrix material Substances 0.000 claims description 31
- 239000013598 vector Substances 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 11
- 238000005070 sampling Methods 0.000 claims description 9
- 238000012937 correction Methods 0.000 claims description 5
- 238000013145 classification model Methods 0.000 description 10
- 238000012552 review Methods 0.000 description 9
- 238000009826 distribution Methods 0.000 description 7
- 238000012549 training Methods 0.000 description 7
- 238000004458 analytical method Methods 0.000 description 6
- 230000009977 dual effect Effects 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 239000000284 extract Substances 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000003064 k means clustering Methods 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 238000007635 classification algorithm Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 239000000047 product Substances 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 206010061218 Inflammation Diseases 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 238000013506 data mapping Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000004054 inflammatory process Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000016776 visual perception Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the invention relates to a data processing method, in particular to a method and device for matching semantic text data with tags, and a computer-readable storage medium storing instructions.
- the traditional short text classification method is mainly based on a large number of user-labeled sample corpora to train the classification model. Its main features include: the user analyzes the sample corpus and manually defines a fixed sample classification label system. Based on the defined business classification label system, manually screen each sample in the sample corpus, label the samples appropriately, and construct a sample data set for classification model training. Train a classification model on the constructed sample data set. Based on the vector space model, "frequent word set extraction” or word frequency-inverse file word frequency (TF-IDF) method to extract the features of short text, and then based on the extracted text features using classification algorithms, such as SVM for training to form the final classification model .
- TF-IDF word frequency-inverse file word frequency
- the present invention proposes a method and apparatus for matching semantic text data with tags and a computer-readable storage medium storing instructions.
- a method for matching semantic text data with tags including: preprocessing a plurality of semantic text data to obtain original corpus data including a plurality of semantically independent members; The recurrence relationship of the semantic independent members in the natural text determines the degree of association between any two of the plurality of semantic independent members, and determines the theme corresponding to the association according to the degree of association between the two Determining a mapping probability relationship between the plurality of semantic text data and the topic; selecting one of the plurality of semantic independent members corresponding to the association as the label of the topic, based on the determined plurality of semantic text data Mapping probability relationship with the topic, mapping the plurality of semantic text data to the tags; and using the determined mapping relationship between the plurality of semantic text data and the tags as supervising materials, and according to the The supervisory material matches the unmapped semantic text data to the label.
- the preprocessing includes one or more of segmenting the plurality of semantic text data, removing stop words, removing non-Chinese characters, removing numeric signs, and performing word error correction.
- the preprocessing includes extracting only the plurality of semantic text data containing negative semantics and/or interrogative semantics.
- the recurrence relationship in natural text is the degree of contextual relevance in the original corpus data and/or in the natural text corpus.
- the determining the degree of association between any two of the plurality of semantically independent members includes: indexing all semantically independent members in the original corpus data; determining that the semantically independent members are in the original corpus The word vector in the data, and determine the similarity between any two of the semantically independent members; construct a similarity matrix of pairs of semantically independent members according to the index and the similarity.
- the determining the theme corresponding to the association according to the degree of association between the two includes: performing Gibbs iterative sampling on the similarity matrix to obtain the original corpus data and the theme A mapping relationship and a mapping relationship between the topic and the semantically independent member pair, and then a mapping probability relationship between the plurality of semantic text data and the topic and a mapping probability relationship between the topic and the plurality of semantic independent members .
- the selecting one of the plurality of semantically independent members corresponding to the association as the label of the topic includes: clustering the plurality of semantic text data, and according to the plurality of semantic text data Determining the clustered theme of the plurality of semantic text data by the mapping relationship with the theme; clustering the plurality of semantic text data according to the mapping probability relationship between the theme and the plurality of semantically independent members
- the topics are mapped to semantically independent members as the tags corresponding to the clustered topics.
- the determining the clustered theme of the plurality of semantic text data according to the mapping probability relationship between the plurality of semantic text data and the theme includes: determining each of the plurality of semantic text data The maximum probability topic of each; determine the number of the maximum probability topics in each cluster; take the maximum probability topic in the cluster as the clustered topic.
- a predetermined number of semantically independent members with the highest probability value corresponding to the clustered theme are determined according to the mapping probability relationship between the theme and the plurality of semantically independent members as the label of the clustered theme .
- the tags of different clustered topics include the same tags
- the probability values of the same tags in the different clustered topics are compared, and the tag with the largest probability value is retained as the The label of the clustered topic to which the label with the highest probability value belongs; for the clustered topics to which the label with the highest probability value belongs, use semantic independence with a lower probability value than the same label probability value
- the member serves as a label for the clustered topic.
- an apparatus for matching semantic text data with tags includes: a preprocessing unit configured to preprocess a plurality of semantic text data to obtain Original corpus data of multiple semantically independent members; a topic model unit, which is used to determine the degree of association between any two of the plurality of semantically independent members according to the recurring relationship of the plurality of semantically independent members in natural text And determine the theme corresponding to the association according to the degree of association between the two, and then determine the mapping probability relationship between the plurality of semantic text data and the theme; a label determination unit, which is used to select the corresponding One of the associated multiple semantic independent members is used as a label of the topic, and the multiple semantic text data is mapped to the topic according to the determined mapping probability relationship between the multiple semantic text data and the topic A label; and a label matching unit for using the determined mapping relationship between the plurality of semantic text data and the label as supervising material, and matching unmapped semantic text data to the label according to the supervising material .
- the preprocessing includes one or more of segmenting the plurality of semantic text data, removing stop words, removing non-Chinese characters, removing numeric signs, and performing word error correction.
- the preprocessing includes extracting only the plurality of semantic text data containing negative semantics and/or interrogative semantics.
- the recurrence relationship in natural text is the degree of contextual relevance in the original corpus data and/or in the natural text corpus.
- the topic model unit is used to determine the degree of association between any two of the plurality of semantically independent members, including: indexing all semantically independent members in the original corpus data; determining the semantically independent The word vectors of the members in the original corpus data, and determine the similarity between any two of the semantically independent members; construct a similarity matrix of pairs of semantically independent members according to the index and the similarity.
- the theme model unit is used to determine the theme corresponding to the association according to the degree of association between the two, including: performing Gibbs iterative sampling on the similarity matrix to obtain the original corpus The mapping relationship between the data and the topic and the mapping relationship between the topic and the semantically independent member pair, thereby determining the mapping probability relationship between the multiple semantic text data and the topic and the topic and the multiple semantics The mapping probability relationship of independent members.
- the label determining unit is used to select one of the plurality of semantically independent members corresponding to the association as the label of the topic, including: clustering the plurality of semantic text data, and according to The mapping relationship between the plurality of semantic text data and the topic determines the clustered topic of the semantic text data; according to the mapping probability relationship between the topic and the plurality of semantic independent members, the clustered results
- the topics of the plurality of semantic text data are mapped to semantically independent members as the tags corresponding to the clustered topics.
- the label determining unit is configured to determine the clustered theme of the plurality of semantic text data according to the mapping probability relationship between the plurality of semantic text data and the theme, including: determining the plurality of semantics The maximum probability topic of each of the text data; determining the number of the maximum probability topics in each cluster; taking the maximum probability topic in the cluster as the clustered topic.
- the label determining unit is configured to determine a predetermined number of semantically independent members with the highest probability value corresponding to the clustered theme according to the mapping probability relationship between the topic and the plurality of semantically independent members as The label of the topic after clustering.
- the tags of different clustered topics include the same tags
- the probability values of the same tags in the different clustered topics are compared, and the tag with the largest probability value is retained as the The label of the clustered topic to which the label with the highest probability value belongs; for the clustered topics to which the label with the highest probability value belongs, use semantic independence with a lower probability value than the same label probability value
- the member serves as a label for the clustered topic.
- a computer-readable storage medium storing instructions that when the instructions are executed by a processor, the processor is configured to perform the method described herein.
- FIG. 1 shows a flowchart of a method for matching semantic text data with tags according to an embodiment of the present invention.
- FIG. 2 shows a flowchart of pre-processing according to an embodiment of the present invention.
- FIG. 3 shows a flowchart of constructing a theme model according to an embodiment of the present invention.
- FIG. 4 shows a flowchart of classification label learning according to an embodiment of the present invention.
- FIG. 5 shows a flowchart of classification model training according to an embodiment of the present invention.
- FIG. 6 shows a schematic diagram of K-means clustering according to an embodiment of the present invention.
- FIG 7 shows the prediction result of each category label of the SVM classifier according to an embodiment of the present invention.
- the user review data is preprocessed.
- the purpose of pre-processing is to process semantic text data such as user comments to obtain semantically independent members (such as English words, Chinese vocabulary and other morphemes) and raw corpus data.
- semantically independent members such as English words, Chinese vocabulary and other morphemes
- Each semantically independent member is an independent unit for semantic analysis.
- the semantically independent member can also be the smallest unit for semantic analysis.
- word segmentation may be achieved through a Chinese word segmentation toolkit such as jieba (step 202). Then perform operations such as removing stop words, removing non-Chinese characters, removing digital symbols, and performing word error correction on the independent members after word segmentation (step 204).
- a sentence (not shown in the figure) containing the user's key intentions can also be extracted. For example, a user review content data platform user feedback information can be extracted only sentence that includes the negative word or question word as kernel sentences of the original sample and further obtain semantic independent members, the original corpus data, if it is difficult to extract directly jump Go through this step.
- multiple semantically independent members are used to form original corpus data.
- the topic model is determined.
- the relevance of any two morphemes is determined according to the recurrence relationship of the morpheme in the natural text, and the topic corresponding to the relevance is determined according to the relevance, and then the mapping probability relationship between the morpheme and the topic is determined.
- the recurrence relationship reflects the degree of semantic connection between morphemes. For example, in a sentence (or a paragraph of text, etc.), the correlation between "payment” and context semantics reaches a certain value X, and the correlation between "swipe” and context semantics reaches a certain value Y, and X ⁇ Y, Then it can be considered that there is a strong semantic correlation between "payment” and "swipe".
- the relevance of "payment” and context semantics can be derived from statistics, for example, so the relevance of "payment” and context semantics is statistically determined based on its reproduction in natural text.
- the natural text can be the target text (original corpus data in this article) for investigation and processing, or any meaningful natural text library, such as Baidu Encyclopedia, Wikipedia, Sogou Internet corpus and other natural text corpora.
- step 104 may be implemented in the embodiment shown in FIG. 3.
- a word vector is trained.
- the gensim toolkit is used to implement training word vectors for subsequent short text modeling. If less data is collected, the word vector training effect is general, consider introducing large Chinese corpora such as Sogou Internet corpus as Supplement, or can directly use Google open source Chinese vector model. Word vectors can make up for the defect that TF-IDF cannot measure the semantic similarity between words.
- a word pair similarity matrix is created. Create an index of different words in the text. The index exists as the label of the word.
- a probability distribution matrix of word pair-topic may be first generated based on the Chinese restaurant process (CRP). Then count the number of word pairs that appear in each document according to the set of word pairs, and use a 1 ⁇ N-dimensional matrix to store the number of all word pairs that appear in the document.
- a word pair is a pairing of any two words as basic morphemes.
- the word pair similarity matrix Sim is created for subsequent processing.
- step 408 the Sim matrix is used to perform Gibbs iterative sampling, and the overall corpus-topic matrix and topic-word pair matrix are obtained by Gibbs sampling in the word pair topic model, and a text model is established.
- the specific process is as follows:
- d i represents the number of word pairs already in subject i
- n-1 represents the total number of word pairs already existing before the current word pair
- d 0 is the initial parameter
- p(D n k
- D -n ) represents the probability that the word pair D n will be assigned to the topic k.
- step 106 the classification label is learned. Specifically, as shown in FIG. 4, a user comment-topic probability distribution matrix (step 604) and a topic-word probability distribution matrix (step 602) are generated through inference.
- the short text topic matrix is used to represent short text, that is, the probability distribution of the topic is used to represent short text features:
- d i ) represents the probability of a short text z i d i of the subject matter
- k is the number of the whole short text corpus subject.
- step 606 methods such as K-Means clustering can be used to cluster the entire corpus, and the JS distance is used in the clustering algorithm to measure the similarity of the text:
- step 608 traverse all the user review corpus in the cluster, find the maximum probability topic of each review data according to the user review-topic matrix, and count the number of different maximum probability topics, extract the largest number of topics to cluster Cluster topics (step 610).
- step 612 the top n words with the highest probability value are selected from the topic-word probability matrix as the clustered tag information. Check the weight of each clustered label keyword. If different clusters have repeated keywords, re-select the keywords under the corresponding theme of each cluster. See the probability value of the same keyword under the respective theme. Keywords with small values are replaced by the next word or phrase with probability.
- step 108 the classification model is trained. Specifically, the embodiment shown in FIG. 5.
- the user comment corpus is automatically labeled with a category label, and then the mapping relationship between the user comment and the label is obtained.
- the user review corpus is obtained according to the user reviews after the clustered topic.
- TF-IDF and word vectors are extracted as text features for each user review corpus.
- two classification algorithms SVM and bidirectional LSTM, are used to train a classification model (step 808), and then vote aggregation is performed by a voting classifier to construct a user comment classification model (step 810).
- This embodiment mainly analyzes the feedback message of the data platform user, first extracts the semantic feature information of the data platform user feedback message based on the short text feature extraction method proposed by the present invention, and then builds a classification model to realize automatic classification of the user feedback message.
- the data source is the feedback message data of the data platform APP users of a certain month.
- the original data is mainly stored in the form of text. For specific examples, please refer to Table 1:
- the automatic classification of feedback messages from users of the data platform can be performed as follows, for example.
- Step 1 Preprocess the feedback message data
- Step 2 Characteristic representation of short text feedback from data platform users
- step one For the corpus preprocessed in step one, the Skip-gram model in the Word2Vec method proposed by Google is adopted, and the word2vec function in the gensim library is used for training, where the set word vector dimension is set to 200, and the Skip-gram model The window size is 5.
- Table 2 shows exemplary results.
- Baidu Encyclopedia Vector Private domain word vector ('Flash Payment', 0.8876532316207886) ('Cloud Payment', 0.7113977074623108) ('Tianyi mobile phone', 0.8041104674339294) ('UnionPay Wallet', 0.6253437995910645) ('Dual network dual standby', 0.7926369905471802) ('Cloud Flash', 0.5981202125549316) ('Dual standby', 0.7770497798919678) ('Flash Payment', 0.5895633101463318) ('Mobile payment', 0.7767471075057983) ('QR code', 0.5603029727935791) ('Card reader', 0.7745838761329651) ('Mobile phone', 0.5016968250274658) ('Referring to pay pass', 0.7724637985229492) ('App', 0.49683672189712524) ('WeChat end', 0.7695549
- Word vectors can more accurately express payment domain knowledge, which provides more accurate semantic information for subsequent classifications.
- Step 3 Data platform users feedback the extraction of message classification labels
- Step 4 Automatic classification of data platform user messages
- FIG. 7 shows an example of the label prediction result according to this configuration.
- the probability threshold of classification prediction can be set, and the data with a low prediction probability category is manually processed. Taking into account the model accuracy and recall rate, the threshold can be set to 0.6.
- the automatic reply method of APP user comments proposed in this article can effectively mine hot topic categories in short text data such as user comments, grasp the main consultation hotspots of users during the use of products, and on the other hand, can realize automatic classification of user comments. It can greatly improve the operating efficiency of APP.
- the classification label system mentioned in the present invention is based on a self-learning method, which does not require business personnel to manually analyze all text information in the short text corpus, and subsequent update and maintenance of the label system is also automatically completed, which can greatly reduce manual participation.
- the workload is easier to apply to the actual scene.
- the classification training corpus of the present invention is also generated in the process of classification labeling, so there is no need to manually mark the corpus.
- the present invention merges the entire short text corpus for topic modeling to effectively alleviate the problem of text sparseness.
- the similarity of word pairs is merged, so the text is considered
- the contextual relationship of different word pairs can extract wider semantic features in the text, and the semantic expression ability is stronger.
- the features of each short text include the features calculated by the TF-IDF and also the features extracted by the topic model, which not only achieves statistical considerations, but also incorporates the features of context information.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Description
百度百科词向量 | 专用领域词向量 |
(′闪付′,0.8876532316207886) | (′云支付′,0.7113977074623108) |
(′天翼手机′,0.8041104674339294) | (′银联钱包′,0.6253437995910645) |
(′双网双待′,0.7926369905471802) | (′云闪′,0.5981202125549316) |
(′双待′,0.7770497798919678) | (′闪付′,0.5895633101463318) |
(′手机支付′,0.7767471075057983) | (′二维码′,0.5603029727935791) |
(′刷卡器′,0.7745838761329651) | (′手机′,0.5016968250274658) |
(′指付通′,0.7724637985229492) | (′app′,0.49683672189712524) |
(′微信端′,0.7695549130439758) | (′单独′,0.4926530122756958) |
(′双模双待′,0.7687188386917114) | (′闪′,0.490323543548584) |
(′智能电话′,0.7658545970916748) | (′扫码′,0.4879230260848999) |
Claims (21)
- 一种将语义文本数据与标签匹配的方法,其特征在于,所述方法包括:将多个语义文本数据进行预处理,以获得包括多个语义独立成员的原始语料数据;根据多个所述语义独立成员在自然文本中的重现关系确定多个所述语义独立成员中任意两者之间的关联度,并根据所述任意两者之间的关联度确定对应于该关联的主题,进而确定所述多个语义文本数据与所述主题的映射概率关系;选取对应于该关联的多个语义独立成员中的一者作为所述主题的标签,根据已经确定的所述多个语义文本数据与所述主题的映射概率关系,将所述多个语义文本数据映射到所述标签;以及将已经确定的所述多个语义文本数据与所述标签的映射关系作为监督材料,并且根据所述监督材料将未经映射的语义文本数据匹配所述标签。
- 根据权利要求1所述的方法,其特征在于:所述预处理包括将所述多个语义文本数据进行分词、去除停用词、去除非中文字符、去除数字符号以及进行词语纠错中的一者或多者。
- 根据权利要求1所述的方法,其特征在于:所述预处理包括只抽取包含否定语义和/或疑问语义的所述多个语义文本数据。
- 根据权利要求1所述的方法,其特征在于:所述在自然文本中的重现关系为在所述原始语料数据中和/或在自然文本语料库中的上下文重现关联程度。
- 根据权利要求1或4所述的方法,其特征在于,所述确定多个所述语义独立成员中任意两者之间的关联度包括:将所述原始语料数据中所有语义独立成员进行索引;确定所述语义独立成员在所述原始语料数据中的词向量,并确 定所述语义独立成员任意两者之间的相似性;根据所述索引以及所述相似性构建语义独立成员对的相似性矩阵。
- 根据权利要求5所述的方法,其特征在于,所述根据所述任意两者之间的关联度确定对应于该关联的主题包括:对所述相似性矩阵进行吉布斯迭代采样以获得所述原始语料数据与所述主题的映射关系以及所述主题与所述语义独立成员对的映射关系,进而确定所述多个语义文本数据与所述主题的映射概率关系以及所述主题与多个所述语义独立成员的映射概率关系。
- 根据权利要求6所述的方法,其特征在于,所述选取对应于该关联的多个语义独立成员中的一者作为所述主题的标签包括:将所述多个语义文本数据进行聚类,以及根据所述多个语义文本数据与所述主题的映射关系确定聚类后的所述多个语义文本数据的主题;根据所述主题与多个所述语义独立成员的映射概率关系将聚类后的所述多个语义文本数据的主题映射为语义独立成员,以作为对应于聚类后的主题的所述标签。
- 根据权利要求7所述的方法,其特征在于,所述根据所述多个语义文本数据与所述主题的映射概率关系确定聚类后的所述多个语义文本数据的主题包括:确定所述多个语义文本数据中的每一个的最大概率主题;确定每个聚类中的所述最大概率主题的数目;将聚类中数目最大的所述最大概率主题作为聚类后的主题。
- 根据权利要求8所述的方法,其特征在于,根据所述主题与多个所述语义独立成员的映射概率关系确定对应于聚类后的主题的概率值最高的预定数量的语义独立成员以作为所述聚类后的主题的标签。
- 根据权利要求9所述的方法,其特征在于,若不同的聚类后的主题的标签包括相同的标签,则比较所述相同的标签在所述不同的聚类后的主题中的概率值,保留概率值最大 的标签作为所述概率值最大的标签所属的所述聚类后的主题的标签;对于除了概率值最大的标签所属的所述聚类后的主题,使用概率值比所述相同的标签概率值更低的语义独立成员作为所述聚类后的主题的标签。
- 一种将语义文本数据与标签匹配的装置,其特征在于,所述装置包括:预处理单元,其用于将多个语义文本数据进行预处理,以获得包括多个语义独立成员的原始语料数据;主题模型单元,其用于根据多个所述语义独立成员在自然文本中的重现关系确定多个所述语义独立成员中任意两者之间的关联度,并根据所述任意两者之间的关联度确定对应于该关联的主题,进而确定所述多个语义文本数据与所述主题的映射概率关系;标签确定单元,其用于选取对应于该关联的多个语义独立成员中的一者作为所述主题的标签,根据已经确定的所述多个语义文本数据与所述主题的映射概率关系,将所述多个语义文本数据映射到所述标签;以及标签匹配单元,其用于将已经确定的所述多个语义文本数据与所述标签的映射关系作为监督材料,并且根据所述监督材料将未经映射的语义文本数据匹配所述标签。
- 根据权利要求11所述的装置,其特征在于,所述预处理包括将所述多个语义文本数据进行分词、去除停用词、去除非中文字符、去除数字符号以及进行词语纠错中的一者或多者。
- 根据权利要求11所述的装置,其特征在于,所述预处理包括只抽取包含否定语义和/或疑问语义的所述多个语义文本数据。
- 根据权利要求11所述的装置,其特征在于,所述在自然文本中的重现关系为在所述原始语料数据中和/或在自然文本语料库中的上下文重现关联程度。
- 根据权利要求11或14所述的装置,其特征在于,所述主题模型单元用于确定多个所述语义独立成员中任意两者之间的关联度,包括:将所述原始语料数据中所有语义独立成员进行索引;确定所述语义独立成员在所述原始语料数据中的词向量,并确定所述语义独立成员任意两者之间的相似性;根据所述索引以及所述相似性构建语义独立成员对的相似性矩阵。
- 根据权利要求15所述的装置,其特征在于,所述主题模型单元用于根据所述任意两者之间的关联度确定对应于该关联的主题,包括:对所述相似性矩阵进行吉布斯迭代采样以获得所述原始语料数据与所述主题的映射关系以及所述主题与所述语义独立成员对的映射关系,进而确定所述多个语义文本数据与所述主题的映射概率关系以及所述主题与多个所述语义独立成员的映射概率关系。
- 根据权利要求16所述的装置,其特征在于,所述标签确定单元用于选取对应于该关联的多个语义独立成员中的一者作为所述主题的标签,包括:将所述多个语义文本数据进行聚类,以及根据所述多个语义文本数据与所述主题的映射关系确定聚类后的所述多个语义文本数据的主题;根据所述主题与多个所述语义独立成员的映射概率关系将聚类后的所述多个语义文本数据的主题映射为语义独立成员,以作为对应于聚类后的主题的所述标签。
- 根据权利要求17所述的装置,其特征在于,所述标签确定单元用于根据所述多个语义文本数据与所述主题的映射概率关系确定聚类后的所述多个语义文本数据的主题,包括:确定所述多个语义文本数据中的每一个的最大概率主题;确定每个聚类中的所述最大概率主题的数目;将聚类中数目最大的所述最大概率主题作为聚类后的主题。
- 根据权利要求18所述的装置,其特征在于,所述标签确定单元用于:根据所述主题与多个所述语义独立成员的映射概率关系确定对应于聚类后的主题的概率值最高的预定数量的语义独立成员以作为所述聚类后的主题的标签。
- 根据权利要求19所述的装置,其特征在于,所述标签确定单元用于:若不同的聚类后的主题的标签包括相同的标签,则比较所述相同的标签在所述不同的聚类后的主题中的概率值,保留概率值最大的标签作为所述概率值最大的标签所属的所述聚类后的主题的标签;对于除了概率值最大的标签所属的所述聚类后的主题,使用概率值比所述相同的标签概率值更低的语义独立成员作为所述聚类后的主题的标签。
- 一种储存指令的计算机可读存储介质,当所述指令当所述指令由处理器执行时,将所述处理器配置为执行如权利要求1-10中任一项所述的方法。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2021501074A JP7164701B2 (ja) | 2018-12-27 | 2019-07-04 | セマンティックテキストデータをタグとマッチングさせる方法、装置、及び命令を格納するコンピュータ読み取り可能な記憶媒体 |
US17/260,177 US11586658B2 (en) | 2018-12-27 | 2019-07-04 | Method and device for matching semantic text data with a tag, and computer-readable storage medium having stored instructions |
KR1020207028156A KR20200127020A (ko) | 2018-12-27 | 2019-07-04 | 의미 텍스트 데이터를 태그와 매칭시키는 방법, 장치 및 명령을 저장하는 컴퓨터 판독 가능한 기억 매체 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811610544.4A CN110032639B (zh) | 2018-12-27 | 2018-12-27 | 将语义文本数据与标签匹配的方法、装置及存储介质 |
CN201811610544.4 | 2018-12-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020134008A1 true WO2020134008A1 (zh) | 2020-07-02 |
Family
ID=67235412
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/094646 WO2020134008A1 (zh) | 2018-12-27 | 2019-07-04 | 一种将语义文本数据与标签匹配的方法、装置以及一种储存指令的计算机可读存储介质 |
Country Status (5)
Country | Link |
---|---|
US (1) | US11586658B2 (zh) |
JP (1) | JP7164701B2 (zh) |
KR (1) | KR20200127020A (zh) |
CN (1) | CN110032639B (zh) |
WO (1) | WO2020134008A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114281928A (zh) * | 2020-09-28 | 2022-04-05 | 中国移动通信集团广西有限公司 | 基于文本数据的模型生成方法、装置及设备 |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110515895B (zh) * | 2019-08-30 | 2023-06-23 | 北京燕山电子设备厂 | 大数据存储系统中对数据文件进行关联存储的方法及系统 |
CN111274798B (zh) * | 2020-01-06 | 2023-08-18 | 北京大米科技有限公司 | 一种文本主题词确定方法、装置、存储介质及终端 |
CN111310438B (zh) * | 2020-02-20 | 2021-06-08 | 齐鲁工业大学 | 基于多粒度融合模型的中文句子语义智能匹配方法及装置 |
CN111311450B (zh) * | 2020-02-28 | 2024-03-29 | 重庆百事得大牛机器人有限公司 | 用于法律咨询服务的大数据管理平台及方法 |
CN111695358B (zh) * | 2020-06-12 | 2023-08-08 | 腾讯科技(深圳)有限公司 | 生成词向量的方法、装置、计算机存储介质和电子设备 |
CN112989971B (zh) * | 2021-03-01 | 2024-03-22 | 武汉中旗生物医疗电子有限公司 | 一种不同数据源的心电数据融合方法及装置 |
CN112926339B (zh) * | 2021-03-09 | 2024-02-09 | 北京小米移动软件有限公司 | 文本相似度确定方法、系统、存储介质以及电子设备 |
CN113934819A (zh) * | 2021-10-14 | 2022-01-14 | 陈鹏 | 基于context的标签管理方法、装置、服务器及存储介质 |
CN114398968B (zh) * | 2022-01-06 | 2022-09-20 | 北京博瑞彤芸科技股份有限公司 | 基于文件相似度对同类获客文件进行标注的方法和装置 |
CN114896398A (zh) * | 2022-05-05 | 2022-08-12 | 南京邮电大学 | 一种基于特征选择的文本分类系统及方法 |
CN116151542A (zh) * | 2022-11-30 | 2023-05-23 | 上海韵达高新技术有限公司 | 物流订单实时监控方法、装置、设备及存储介质 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100030780A1 (en) * | 2008-07-30 | 2010-02-04 | Kave Eshghi | Identifying related objects in a computer database |
CN105975475A (zh) * | 2016-03-31 | 2016-09-28 | 华南理工大学 | 基于中文短语串的细粒度主题信息抽取方法 |
CN106033445A (zh) * | 2015-03-16 | 2016-10-19 | 北京国双科技有限公司 | 获取文章关联度数据的方法和装置 |
Family Cites Families (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2391967A (en) * | 2002-08-16 | 2004-02-18 | Canon Kk | Information analysing apparatus |
JP4521343B2 (ja) * | 2005-09-29 | 2010-08-11 | 株式会社東芝 | 文書処理装置及び文書処理方法 |
US10536728B2 (en) * | 2009-08-18 | 2020-01-14 | Jinni | Content classification system |
GB2488925A (en) | 2009-12-09 | 2012-09-12 | Ibm | Method of searching for document data files based on keywords,and computer system and computer program thereof |
JP5252593B2 (ja) | 2010-08-12 | 2013-07-31 | Necビッグローブ株式会社 | 最適タグ提案装置、最適タグ提案システム、最適タグ提案方法、およびプログラム |
JP2014153977A (ja) | 2013-02-12 | 2014-08-25 | Mitsubishi Electric Corp | コンテンツ解析装置、コンテンツ解析方法、コンテンツ解析プログラム、およびコンテンツ再生システム |
US9311386B1 (en) * | 2013-04-03 | 2016-04-12 | Narus, Inc. | Categorizing network resources and extracting user interests from network activity |
KR101478016B1 (ko) | 2013-09-04 | 2015-01-02 | 한국과학기술정보연구원 | 공기 정보를 이용한 문장 클러스터 기반의 정보 검색 장치 및 방법 |
US10510018B2 (en) * | 2013-09-30 | 2019-12-17 | Manyworlds, Inc. | Method, system, and apparatus for selecting syntactical elements from information as a focus of attention and performing actions to reduce uncertainty |
US10509814B2 (en) * | 2014-12-19 | 2019-12-17 | Universidad Nacional De Educacion A Distancia (Uned) | System and method for the indexing and retrieval of semantically annotated data using an ontology-based information retrieval model |
CN106156204B (zh) * | 2015-04-23 | 2020-05-29 | 深圳市腾讯计算机系统有限公司 | 文本标签的提取方法和装置 |
CN104850650B (zh) * | 2015-05-29 | 2018-04-10 | 清华大学 | 基于类标关系的短文本扩充方法 |
EP3151131A1 (en) | 2015-09-30 | 2017-04-05 | Hitachi, Ltd. | Apparatus and method for executing an automated analysis of data, in particular social media data, for product failure detection |
CN106055538B (zh) * | 2016-05-26 | 2019-03-08 | 达而观信息科技(上海)有限公司 | 主题模型和语义分析相结合的文本标签自动抽取方法 |
KR101847847B1 (ko) | 2016-11-15 | 2018-04-12 | 주식회사 와이즈넛 | 딥러닝을 이용한 비정형 텍스트 데이터의 문서 군집화 방법 |
CN107301199B (zh) * | 2017-05-17 | 2021-02-12 | 北京融数云途科技有限公司 | 一种数据标签生成方法和装置 |
US10311454B2 (en) * | 2017-06-22 | 2019-06-04 | NewVoiceMedia Ltd. | Customer interaction and experience system using emotional-semantic computing |
CN107798043B (zh) * | 2017-06-28 | 2022-05-03 | 贵州大学 | 基于狄利克雷多项混合模型的长文本辅助短文本的文本聚类方法 |
US10678816B2 (en) * | 2017-08-23 | 2020-06-09 | Rsvp Technologies Inc. | Single-entity-single-relation question answering systems, and methods |
CN107818153B (zh) * | 2017-10-27 | 2020-08-21 | 中航信移动科技有限公司 | 数据分类方法和装置 |
CN108399228B (zh) * | 2018-02-12 | 2020-11-13 | 平安科技(深圳)有限公司 | 文章分类方法、装置、计算机设备及存储介质 |
CN108763539B (zh) * | 2018-05-31 | 2020-11-10 | 华中科技大学 | 一种基于词性分类的文本分类方法和系统 |
CN108959431B (zh) * | 2018-06-11 | 2022-07-05 | 中国科学院上海高等研究院 | 标签自动生成方法、系统、计算机可读存储介质及设备 |
US11397859B2 (en) * | 2019-09-11 | 2022-07-26 | International Business Machines Corporation | Progressive collocation for real-time discourse |
-
2018
- 2018-12-27 CN CN201811610544.4A patent/CN110032639B/zh active Active
-
2019
- 2019-07-04 US US17/260,177 patent/US11586658B2/en active Active
- 2019-07-04 JP JP2021501074A patent/JP7164701B2/ja active Active
- 2019-07-04 KR KR1020207028156A patent/KR20200127020A/ko not_active IP Right Cessation
- 2019-07-04 WO PCT/CN2019/094646 patent/WO2020134008A1/zh active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100030780A1 (en) * | 2008-07-30 | 2010-02-04 | Kave Eshghi | Identifying related objects in a computer database |
CN106033445A (zh) * | 2015-03-16 | 2016-10-19 | 北京国双科技有限公司 | 获取文章关联度数据的方法和装置 |
CN105975475A (zh) * | 2016-03-31 | 2016-09-28 | 华南理工大学 | 基于中文短语串的细粒度主题信息抽取方法 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114281928A (zh) * | 2020-09-28 | 2022-04-05 | 中国移动通信集团广西有限公司 | 基于文本数据的模型生成方法、装置及设备 |
Also Published As
Publication number | Publication date |
---|---|
KR20200127020A (ko) | 2020-11-09 |
US20210286835A1 (en) | 2021-09-16 |
JP2021518027A (ja) | 2021-07-29 |
CN110032639A (zh) | 2019-07-19 |
US11586658B2 (en) | 2023-02-21 |
CN110032639B (zh) | 2023-10-31 |
JP7164701B2 (ja) | 2022-11-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020134008A1 (zh) | 一种将语义文本数据与标签匹配的方法、装置以及一种储存指令的计算机可读存储介质 | |
US11403680B2 (en) | Method, apparatus for evaluating review, device and storage medium | |
CN113011533A (zh) | 文本分类方法、装置、计算机设备和存储介质 | |
Duwairi et al. | Sentiment analysis for Arabizi text | |
US20180025121A1 (en) | Systems and methods for finer-grained medical entity extraction | |
US20230385549A1 (en) | Systems and methods for colearning custom syntactic expression types for suggesting next best corresponence in a communication environment | |
Diamantini et al. | A negation handling technique for sentiment analysis | |
CN113961685A (zh) | 信息抽取方法及装置 | |
CN107133212B (zh) | 一种基于集成学习和词句综合信息的文本蕴涵识别方法 | |
US11347944B2 (en) | Systems and methods for short text identification | |
CN111753082A (zh) | 基于评论数据的文本分类方法及装置、设备和介质 | |
CN111310467B (zh) | 一种在长文本中结合语义推断的主题提取方法及系统 | |
Sazzed | A hybrid approach of opinion mining and comparative linguistic analysis of restaurant reviews | |
CN114398943B (zh) | 样本增强方法及其装置 | |
CN115062621A (zh) | 标签提取方法、装置、电子设备和存储介质 | |
KR20220074576A (ko) | 마케팅 지식 그래프 구축을 위한 딥러닝 기반 신조어 추출 방법 및 그 장치 | |
US20210117448A1 (en) | Iterative sampling based dataset clustering | |
Chen et al. | Learning the chinese sentence representation with LSTM autoencoder | |
CN113051396B (zh) | 文档的分类识别方法、装置和电子设备 | |
Kang et al. | Sentiment analysis on Malaysian airlines with BERT | |
WO2021189291A1 (en) | Methods and systems for extracting self-created terms in professional area | |
CN114238586A (zh) | 基于联邦学习框架的Bert结合卷积神经网络的情感分类方法 | |
Nsaif et al. | Political Post Classification based on Firefly and XG Boost | |
JP2024518458A (ja) | テキスト内の自動トピック検出のシステム及び方法 | |
KR20220074572A (ko) | 딥러닝 기반 신조어 추출 방법 및 그 장치 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19902446 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 20207028156 Country of ref document: KR Kind code of ref document: A Ref document number: 2021501074 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19902446 Country of ref document: EP Kind code of ref document: A1 |