WO2023134074A1 - 文本主题的生成方法、装置、设备及存储介质 - Google Patents

文本主题的生成方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023134074A1
WO2023134074A1 PCT/CN2022/090162 CN2022090162W WO2023134074A1 WO 2023134074 A1 WO2023134074 A1 WO 2023134074A1 CN 2022090162 W CN2022090162 W CN 2022090162W WO 2023134074 A1 WO2023134074 A1 WO 2023134074A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
target
encoding
vector
clustering
Prior art date
Application number
PCT/CN2022/090162
Other languages
English (en)
French (fr)
Inventor
陈浩
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023134074A1 publication Critical patent/WO2023134074A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of artificial intelligence, in particular to a method, device, device and storage medium for generating text topics.
  • Topic modeling is often used when humans cannot reasonably read and sort large volumes of text. Given a corpus consisting of many texts, a topic model will discover the underlying semantic structure and themes present in the texts. Topics can then be used to find high-level summaries of large collections of texts, search for documents of interest, group similar documents, etc.
  • LDA Local Dirichlet Allocation
  • PLSA Probabilistic Latent Semantic Analysis
  • the main purpose of this application is to provide a method, device, device, and storage medium for generating text topics, aiming at solving the LDA model and PLSA model of the prior art, usually setting a custom stop word list, traditional conversion Complex operations such as simplified characters ignore the order and semantics of words, resulting in a technical problem of low accuracy in determining topics.
  • the present application proposes a method for generating a text topic, the method comprising:
  • a target text topic is generated for each of the encoding vector clustering sets.
  • the present application also proposes a device for generating a text theme, the device comprising:
  • a target text acquisition module configured to acquire multiple target texts
  • Vocabulary generation module is used for respectively carrying out word segmentation and word deduplication processing to each described target text, obtains vocabulary
  • a text encoding vector determination module configured to encode each of the target texts to obtain a text encoding vector
  • a word encoding vector determination module used to encode each word in the vocabulary to obtain a word encoding vector
  • a clustering module configured to cluster each of the text encoding vectors to obtain a plurality of encoding vector clustering sets
  • the cluster set theme vector determination module is used to calculate the average value of each of the coded vector cluster sets to obtain the cluster set theme vector;
  • a target similarity determination module configured to calculate the similarity between each of the word encoding vectors and each of the clustering set topic vectors to obtain the target similarity
  • a target text topic generating module configured to generate a target text topic for each of the encoding vector clusters according to the vocabulary and each of the target similarities.
  • the present application also proposes a computer device, including a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, a method for generating a text theme is implemented, wherein the text theme
  • the generation method includes: obtaining a plurality of target texts; respectively performing word segmentation and word deduplication processing on each of the target texts to obtain a vocabulary; encoding each of the target texts to obtain a text encoding vector; Each word of each word is encoded to obtain a word encoding vector; each of the text encoding vectors is clustered to obtain a plurality of encoding vector clustering sets; each of the encoding vector clustering sets is averaged to obtain a clustering set subject vector; calculate the similarity between each described word encoding vector and each described clustering set subject vector, obtain target similarity; according to described vocabulary and each described target similarity, for each described The encoding vector clustering set is used to generate the target text topic.
  • the present application also proposes a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, a method for generating a text theme is implemented, wherein the method for generating a text theme includes: obtaining A plurality of target texts; each of the target texts is subjected to word segmentation and word deduplication processing to obtain a vocabulary; each of the target texts is encoded to obtain a text encoding vector; each word in the vocabulary is Encoding, to obtain word code vectors; Clustering each of the text code vectors to obtain a plurality of code vector clusters; Carry out average calculation to each of the code vector clusters, to obtain the cluster subject vectors; Calculation The similarity between each of the word encoding vectors and each of the clustering set topic vectors is obtained to obtain the target similarity; according to the vocabulary and each of the target similarities, each of the encoding vectors is clustered set for target text topic generation.
  • the method, device, device, and storage medium for generating text topics of the present application wherein the method obtains multiple target texts; performs word segmentation and word deduplication processing on each of the target texts to obtain a vocabulary;
  • the text is encoded to obtain a text encoding vector; each word in the vocabulary is encoded to obtain a word encoding vector; each of the text encoding vectors is clustered to obtain a plurality of encoding vector clusters; for each The encoding vector clustering set carries out average calculation to obtain the clustering set theme vector; calculate the similarity between each of the word encoding vectors and each of the clustering set theme vectors to obtain the target similarity; according to the The predicate table and each of the target similarities are used to generate target text topics for each of the encoding vector clustering sets.
  • the semantic information of the text is captured, and the order information between words is preserved; after clustering based on the text encoding vector, the cluster set topic vector of each cluster set is determined, and the word encoding Into a vector representation, the text encoding vector, the word encoding vector and the clustering set topic vector are mapped to the same vector space, and the text topic is determined based on the same vector space, which improves the accuracy of the text topic; and there is no need to set the automatic Define complex operations such as stop word list, traditional Chinese to simplified Chinese, etc.
  • FIG. 1 is a schematic flow diagram of a method for generating a text theme according to an embodiment of the present application
  • Fig. 2 is a structural schematic block diagram of a device for generating text topics according to an embodiment of the present application
  • FIG. 3 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • a method for generating a text theme is provided in an embodiment of the present application, which relates to the field of artificial intelligence technology, and the method includes:
  • S8 Perform target text topic generation for each of the encoding vector clustering sets according to the vocabulary and each of the target similarities.
  • the semantic information of the text is captured by encoding the text into a vector representation, and the order information between words is retained; after clustering based on the text encoding vector, the clustering set topic vector of each clustering set is determined, and Words are encoded into vector representations to map text encoding vectors, word encoding vectors, and clustering set topic vectors to the same vector space, and determine text topics based on the same vector space, which improves the accuracy of text topics; and does not require modeling Set custom stop word list, traditional Chinese to simplified and other complex operations.
  • multiple target texts input by the user can be obtained, multiple target texts can also be obtained from a database, and multiple target texts can also be obtained from a third-party application system.
  • the target text is a text containing one or more sentences.
  • each of the target texts is merged into one document to obtain a document to be processed; word segmentation is performed on the document to be processed to obtain a word set; the words in the word set are deduplicated, and the deduplicated
  • the set of words serves as the vocabulary.
  • each target text is encoded, and the encoded data is used as a text encoding vector, that is, the text encoding vector corresponds to the target text one-to-one.
  • each target text into a preset encoding model for encoding, and use the encoded data as a text encoding vector.
  • the coding model is a model based on neural network training.
  • the encoding model is a model trained based on the Bert model.
  • each word in the vocabulary is encoded, and the encoded data is used as a word encoding vector, that is, the word encoding vector is in one-to-one correspondence with the words in the vocabulary.
  • each word in the vocabulary is input into the encoding model for encoding, and the encoded data is used as a word encoding vector.
  • step S3 and step S4 adopt the same coding model.
  • each of the text encoding vectors is clustered, and each set obtained by clustering is used as a clustering set of encoding vectors.
  • each of the text encoding vectors is clustered by using the K-Means clustering algorithm and a preset number of clusters, and each set obtained by clustering is used as an encoding vector clustering set.
  • K-Means clustering algorithm that is, K-means clustering algorithm.
  • the average value is calculated for each of the text encoding vectors in each of the encoding vector clusters, and the calculated vector is used as a cluster set topic vector. That is to say, there is a one-to-one correspondence between the cluster set topic vectors and the encoding vector cluster sets.
  • the target similarity is used to measure the similarity between a coding vector of a word and a subject vector of the cluster set.
  • each of the target similarities find out one or more of the target similarities that each of the coding vector clustering sets are most similar to, and find out for the same coding vector clustering set
  • Each word corresponding to each target similarity in the vocabulary is used as the target text topic of the encoding vector clustering set.
  • the target text topic is the text topic of each target text corresponding to the encoding vector cluster set.
  • the above-mentioned step of obtaining multiple target texts includes:
  • S12 Perform blank character deletion processing, repeated punctuation deletion processing, and special symbol deletion processing on each of the news texts to obtain the target text.
  • This embodiment realizes that the news text is used as the target text after blank character deletion processing, repeated punctuation deletion processing and special symbol deletion processing respectively, so that the target text topic determined by this application can be used for news classification; noise is reduced through data cleaning noise, which improves the accuracy of identifying the target text topics.
  • multiple news texts input by the user may be obtained, multiple news texts may be obtained from a database, or multiple news texts may be obtained from a third-party application system.
  • the news text is the text of a piece of news.
  • News text includes: news title, news introduction and news text.
  • For S12 adopt preset regular expressions to perform blank character deletion processing, repeated punctuation deletion processing and special symbol deletion processing on each of the news texts, and use each of the processed news texts as a target text.
  • the above-mentioned step of deleting blank characters, deleting repeated punctuation and deleting special symbols for each of the news texts to obtain the target text includes:
  • S121 Perform blank character deletion processing, repeated punctuation deletion processing, and special symbol deletion processing on each of the news texts to obtain text to be processed;
  • S122 Find each of the to-be-processed texts whose number of words is greater than a preset number of words from each of the to-be-processed texts as the target text.
  • the preset regular expressions are used to perform blank character deletion processing, repeated punctuation deletion processing, and special symbol deletion processing on each of the news texts, and each of the processed news texts is used as one of the pending news texts. Process text.
  • each text to be processed whose number of words is greater than a preset number of words is found from each of the texts to be processed as the target text, so as to eliminate the texts to be processed whose number of words is less than or equal to the preset number of words.
  • the preset word count is set to 1000.
  • the above step of clustering each of the text encoding vectors to obtain a plurality of encoding vector clustering sets includes:
  • S55 Use the target vector average value as the cluster center of the to-be-judged cluster set corresponding to the target vector average value, wherein the target vector average value is any one of the vector average values;
  • This embodiment uses the K-Means clustering algorithm and the preset number of clusters to cluster each of the text encoding vectors with semantic information of the text and order information between words, thereby improving the encoding vector obtained by clustering.
  • the accuracy of the clustering set is the K-Means clustering algorithm and the preset number of clusters to cluster each of the text encoding vectors with semantic information of the text and order information between words, thereby improving the encoding vector obtained by clustering. The accuracy of the clustering set.
  • the number of cluster centers is set to be the same as the preset number of clusters, that is, the number of cluster centers is the same as the number of clusters.
  • the average value of the target vector is used as the cluster center of the cluster set to be judged corresponding to the average value of the target vector, thereby realizing updating of the cluster centers.
  • each of the clustering sets to be judged is used as a clustering set of encoding vectors, thereby realizing clustering of each of the encoding vectors of the text having semantic information of the text and sequence information between words.
  • the step of calculating the distance between each of the text encoding vectors and each of the cluster centers to obtain the distance to be analyzed includes:
  • S521 Using a cosine similarity algorithm, calculate the cosine similarity between each of the text encoding vectors and each of the cluster centers to obtain the distance to be analyzed.
  • the cosine similarity algorithm is used as the vector measurement index of the clustering algorithm, thereby better measuring the distance between vectors and improving the accuracy of clustering.
  • the cosine similarity algorithm is used to calculate the cosine similarity between each of the text encoding vectors and each of the cluster centers, and the calculated cosine similarity is used as the distance to be analyzed.
  • the step of calculating the similarity between each of the word encoding vectors and each of the clustering set topic vectors to obtain the target similarity includes:
  • the cosine similarity algorithm is used as the measure index of the distance between the word encoding vector and the cluster subject vector, so as to better measure the distance between the vectors and improve the accuracy of the target similarity.
  • the cosine similarity algorithm is used to calculate the cosine similarity between each of the word encoding vectors and each of the clustering topic vectors, and use the calculated similarity as the target similarity.
  • the above step of generating target text topics for each of the encoding vector clustering sets according to the vocabulary and each of the target similarities includes:
  • a preset number of words corresponding to the minimum target similarity are obtained as target text topics, so that the text topics can be determined based on the same vector space, and the accuracy of the text topics is improved.
  • the method of obtaining from the beginning is adopted, that is, the minimum similarity of the target is obtained first, which provides a basis for finding the most similar words.
  • a method of acquiring from the beginning is adopted, a preset number of target similarities are found from the similarity set, and each of the found target similarities is used as a hit similarity set.
  • the number of target similarities in the similarity set is less than a preset number
  • the number of target similarities in the hit similarity set is less than a preset number
  • the step of finding out a preset number of target similarities from the similarity set by using the method of acquiring from the beginning, and obtaining the hit similarity set includes: dividing the similarity set by a predetermined The similarity threshold is set for division to obtain the first set and the second set; a preset number of target similarities are found from the first set by using a method of acquiring from the beginning to obtain a hit similarity set. Therefore, the target similarities in the hit similarity set are all smaller than the preset similarity threshold, which further improves the accuracy of text topics.
  • the target similarities in the first set are all smaller than a preset similarity threshold, and the target similarities in the second set are all greater than or equal to a preset similarity threshold.
  • each word in the vocabulary corresponding to the hit similarity set is used as the target text topic corresponding to the target coding vector clustering set, thereby realizing the determination of the text topic based on the same vector space, Improved accuracy of text themes.
  • the present application also proposes a device for generating text topics, the device comprising:
  • a target text acquisition module 100 configured to acquire a plurality of target texts
  • Vocabulary generation module 200 is used for respectively carrying out participle and word deduplication processing to each described target text, obtains vocabulary;
  • a text encoding vector determination module 300 configured to encode each of the target texts to obtain a text encoding vector
  • Word coding vector determination module 400 is used for encoding each word in the vocabulary, obtains word coding vector;
  • a clustering module 500 configured to cluster each of the text encoding vectors to obtain a plurality of encoding vector clustering sets
  • the clustering set theme vector determination module 600 is used to calculate the average value of each of the encoding vector clustering sets to obtain the clustering set theme vector;
  • a target similarity determining module 700 configured to calculate the similarity between each of the word encoding vectors and each of the clustering subject vectors to obtain the target similarity
  • the target text topic generating module 800 is configured to generate a target text topic for each of the encoding vector clustering sets according to the vocabulary and each of the target similarities.
  • the semantic information of the text is captured by encoding the text into a vector representation, and the order information between words is retained; after clustering based on the text encoding vector, the clustering set topic vector of each clustering set is determined, and Words are encoded into vector representations to map text encoding vectors, word encoding vectors, and clustering set topic vectors to the same vector space, and determine text topics based on the same vector space, which improves the accuracy of text topics; and does not require modeling Set custom stop word list, traditional Chinese to simplified and other complex operations.
  • the above-mentioned target text acquisition module 100 includes: a news text acquisition submodule and a target text determination submodule;
  • the news text acquisition submodule is used to acquire multiple news texts
  • the target text determination sub-module is used to respectively perform blank character deletion processing, repeated punctuation deletion processing and special symbol deletion processing on each of the news texts to obtain the target text.
  • the above-mentioned target text determination submodule includes: a text to be processed determination unit and a screening unit;
  • the text-to-be-processed determining unit is configured to perform blank character deletion processing, repeated punctuation deletion processing, and special symbol deletion processing on each of the news texts to obtain the text to be processed;
  • the screening unit is configured to find each text to be processed whose number of words is greater than a preset number of words from each text to be processed as the target text.
  • the clustering module 500 includes: a cluster center setting submodule, a distance calculation submodule to be analyzed, a cluster set determination submodule to be judged, a vector average calculation submodule, a cluster center update submodule, The loop control submodule and the encoding vector clustering set determination submodule;
  • the cluster center setting submodule is used to set the cluster centers whose number is the same as the preset number of clusters, and initialize each of the cluster centers;
  • the to-be-analyzed distance calculation submodule is used to calculate the distance between each of the text encoding vectors and each of the cluster centers to obtain the to-be-analyzed distance;
  • the sub-module for determining the cluster set to be judged is used to assign each of the text encoding vectors to the cluster set to be judged corresponding to the nearest cluster center according to the distance to be analyzed according to the distance to be analyzed;
  • the vector average calculation submodule is used to calculate the vector average value for each cluster set to be judged
  • the cluster center update submodule is configured to use the target vector average value as the cluster center of the cluster set to be judged corresponding to the target vector average value, wherein the target vector average value is any a said vector average;
  • the loop control submodule is used to repeatedly execute the step of calculating the distance between each of the text encoding vectors and each of the cluster centers to obtain the distance to be analyzed until each of the clusters to be judged The cluster centers corresponding to the set will no longer change;
  • the encoding vector clustering set determining submodule is configured to use each of the clustering sets to be judged as one encoding vector clustering set.
  • the above-mentioned distance calculation submodule to be analyzed includes: a cosine similarity calculation unit;
  • the cosine similarity calculation unit is configured to use a cosine similarity algorithm to calculate the cosine similarity between each of the text encoding vectors and each of the cluster centers to obtain the distance to be analyzed.
  • the target similarity determination module 700 includes: a similarity calculation unit;
  • the similarity calculation unit is configured to use a cosine similarity algorithm to calculate the cosine similarity between each of the word encoding vectors and each of the clustering topic vectors to obtain the target similarity.
  • the target text topic generation module 800 includes: a target encoding vector clustering set determination submodule, a similarity set determination submodule, a hit similarity set determination submodule and a target text topic determination submodule;
  • the target encoding vector clustering set determination submodule is used to use any one of the encoding vector clustering sets as the target encoding vector clustering set;
  • the similarity set determination submodule is used to sort the target similarities corresponding to the target coding vector clustering set in positive order to obtain a similarity set;
  • the hit similarity set determining submodule is used to find a preset number of target similarities from the similarity set by using the method of acquisition from the beginning to obtain a hit similarity set;
  • the target text topic determination submodule is configured to use each word in the vocabulary corresponding to the hit similarity set as the target text topic corresponding to the target coding vector clustering set.
  • an embodiment of the present application also provides a computer device, which may be a server, and its internal structure may be as shown in FIG. 3 .
  • the computer device includes a processor, memory, network interface and database connected by a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer programs and databases.
  • the memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer device is used to store data such as methods for generating text topics.
  • the network interface of the computer device is used to communicate with an external terminal via a network connection.
  • the method for generating the text theme includes: obtaining a plurality of target texts; respectively performing word segmentation and word deduplication processing on each of the target texts to obtain a vocabulary; encoding each of the target texts to obtain a text encoding vector; Encoding each word in the vocabulary to obtain a word encoding vector; clustering each of the text encoding vectors to obtain a plurality of encoding vector clusters; averaging each of the encoding vector clusters value calculation to obtain the clustering set topic vector; calculate the similarity between each of the word encoding vectors and each of the clustering set topic vectors to obtain the target similarity; according to the vocabulary and each of the target similarities degree, generating target text topics for each of the encoding vector clustering sets.
  • the semantic information of the text is captured by encoding the text into a vector representation, and the order information between words is retained; after clustering based on the text encoding vector, the clustering set topic vector of each clustering set is determined, and Words are encoded into vector representations to map text encoding vectors, word encoding vectors, and clustering set topic vectors to the same vector space, and determine text topics based on the same vector space, which improves the accuracy of text topics; and does not require modeling Set custom stop word list, traditional Chinese to simplified and other complex operations.
  • the above-mentioned step of acquiring multiple target texts includes: acquiring multiple news texts; performing blank character deletion processing, repeated punctuation deletion processing, and special symbol deletion processing on each of the news texts to obtain the target text.
  • the step of performing blank character deletion processing, repeated punctuation deletion processing, and special symbol deletion processing on each of the news texts to obtain the target text includes: performing Blank character deletion processing, repeated punctuation deletion processing, and special symbol deletion processing to obtain the text to be processed; find each text to be processed whose number of words is greater than the preset number of words from each text to be processed as the target text.
  • the above-mentioned step of clustering each of the text encoding vectors to obtain a plurality of encoding vector cluster sets includes: setting the number of cluster centers equal to the preset number of clusters, and for each The clustering center is initialized; the distance between each of the text encoding vectors and each of the clustering centers is calculated to obtain the distance to be analyzed; according to each of the distances to be analyzed, each of the text encoding vectors according to The minimum distance principle is assigned to the cluster set to be judged corresponding to the nearest cluster center; the average value of the vector is calculated for each cluster set to be judged; the average value of the target vector is used as the average value of the target vector Corresponding to the cluster center of the cluster set to be judged, wherein the target vector average value is any one of the vector average values; repeating the calculation of each of the text encoding vectors and each of the The distance between the cluster centers, the step of obtaining the distance to be analyzed, until the cluster centers corresponding to each of the cluster
  • the step of calculating the distance between each of the text encoding vectors and each of the cluster centers to obtain the distance to be analyzed includes: using a cosine similarity algorithm to calculate the distance between each of the text encoding vectors The cosine similarity between the vector and each cluster center is used to obtain the distance to be analyzed.
  • the step of calculating the similarity between each of the word encoding vectors and each of the clustering set topic vectors to obtain the target similarity includes: using the cosine similarity algorithm to calculate the similarity of each of the clustering set The cosine similarity between the word encoding vector and each of the cluster set topic vectors is used to obtain the target similarity.
  • the above-mentioned step of generating target text topics for each of the encoding vector clustering sets according to the vocabulary and each of the target similarities includes: clustering any one of the encoding vector clustering sets As a clustering set of target coded vectors; sort each target similarity corresponding to the target coded vector clustering set in positive order to obtain a similarity set; adopt the method of obtaining from the beginning, find out from the similarity set Get a preset number of target similarities to obtain a hit similarity set; use each word in the vocabulary corresponding to the hit similarity set as the target corresponding to the target coding vector clustering set text subject.
  • An embodiment of the present application also provides a computer-readable storage medium, the storage medium is a volatile storage medium or a non-volatile storage medium, on which a computer program is stored, and when the computer program is executed by a processor, a A method for generating a text theme, comprising the steps of: obtaining a plurality of target texts; performing word segmentation and word deduplication processing on each of the target texts to obtain a vocabulary; encoding each of the target texts to obtain a text encoding vector; Each word in the vocabulary is encoded to obtain a word encoding vector; each of the text encoding vectors is clustered to obtain a plurality of encoding vector clusters; each of the encoding vector clusters is averaged Calculate to obtain the clustering set topic vector; calculate the similarity between each of the word encoding vectors and each of the clustering set topic vectors to obtain the target similarity; according to the vocabulary and each of the target similarities , performing target text topic generation on each of the encoding vector clustering sets.
  • the method for generating text topics executed above captures the semantic information of the text and preserves the order information between words by encoding the text into a vector representation; after clustering based on the text encoding vector, the clustering of each cluster set is determined.
  • Cluster topic vectors, and encoding words into vector representations realize mapping text encoding vectors, word encoding vectors, and clustering set topic vectors to the same vector space, and determine text topics based on the same vector space, improving the accuracy of text topics; And there is no need to set a custom stop word list before modeling, and complex operations such as converting traditional Chinese to simplified Chinese.
  • the above-mentioned step of acquiring multiple target texts includes: acquiring multiple news texts; performing blank character deletion processing, repeated punctuation deletion processing, and special symbol deletion processing on each of the news texts to obtain the target text.
  • the step of performing blank character deletion processing, repeated punctuation deletion processing, and special symbol deletion processing on each of the news texts to obtain the target text includes: performing Blank character deletion processing, repeated punctuation deletion processing, and special symbol deletion processing to obtain the text to be processed; find each text to be processed whose number of words is greater than the preset number of words from each text to be processed as the target text.
  • the above-mentioned step of clustering each of the text encoding vectors to obtain a plurality of encoding vector cluster sets includes: setting the number of cluster centers equal to the preset number of clusters, and for each The clustering center is initialized; the distance between each of the text encoding vectors and each of the clustering centers is calculated to obtain the distance to be analyzed; according to each of the distances to be analyzed, each of the text encoding vectors according to The minimum distance principle is assigned to the cluster set to be judged corresponding to the nearest cluster center; the average value of the vector is calculated for each cluster set to be judged; the average value of the target vector is used as the average value of the target vector Corresponding to the cluster center of the cluster set to be judged, wherein the target vector average value is any one of the vector average values; repeating the calculation of each of the text encoding vectors and each of the The distance between the cluster centers, the step of obtaining the distance to be analyzed, until the cluster centers corresponding to each of the cluster
  • the step of calculating the distance between each of the text encoding vectors and each of the cluster centers to obtain the distance to be analyzed includes: using a cosine similarity algorithm to calculate the distance between each of the text encoding vectors The cosine similarity between the vector and each cluster center is used to obtain the distance to be analyzed.
  • the step of calculating the similarity between each of the word encoding vectors and each of the clustering set topic vectors to obtain the target similarity includes: using the cosine similarity algorithm to calculate the similarity of each of the clustering set The cosine similarity between the word encoding vector and each of the cluster set topic vectors is used to obtain the target similarity.
  • the above-mentioned step of generating target text topics for each of the encoding vector clustering sets according to the vocabulary and each of the target similarities includes: clustering any one of the encoding vector clustering sets As a clustering set of target coded vectors; sort each target similarity corresponding to the target coded vector clustering set in positive order to obtain a similarity set; adopt the method of obtaining from the beginning, find out from the similarity set Get a preset number of target similarities to obtain a hit similarity set; use each word in the vocabulary corresponding to the hit similarity set as the target corresponding to the target coding vector clustering set text subject.
  • Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM random access memory
  • RAM is available in many forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请涉及人工智能技术领域,揭示了一种文本主题的生成方法、装置、设备及存储介质,其中方法包括:对各个目标文本分别进行分词及词语去重处理得到词表;对每个目标文本进行编码得到文本编码向量;对词表中的每个词语进行编码得到词语编码向量;对各个文本编码向量进行聚类得到多个编码向量聚类集;对每个编码向量聚类集进行平均值计算得到聚类集主题向量;计算每个词语编码向量与每个聚类集主题向量之间的相似度得到目标相似度;根据词表和各个目标相似度对每个编码向量聚类集进行目标文本主题生成。捕获到了文本的语义信息,保留了词语之间的顺序信息,提高了文本主题的准确性,不需要在建模前设置自定义停用词列表、繁体转简体等复杂操作。

Description

文本主题的生成方法、装置、设备及存储介质
本申请要求于2022年01月12日提交中国专利局、申请号为202210033712.8,发明名称为“文本主题的生成方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及到人工智能技术领域,特别是涉及到一种文本主题的生成方法、装置、设备及存储介质。
背景技术
在自然语言处理(Nature Language Process,NLP)领域中,搜索、总结大量文本一直以来是普遍存在问题。当人工无法合理地阅读和排序海量文本时,经常使用主题建模。给定一个由许多文本组成的语料库,主题模型将发现文本中存在的潜在语义结构和主题,然后可以使用主题查找大型文本集合的高级摘要,搜索感兴趣的文档,并将相似的文档分组等。
当前,使用最广泛的主题方法是LDA(Latent Dirichlet Allocation)模型和PLSA(Probabilistic Latent Semantic Analysis)模型,发明人发现,尽管它们在NLP领域很受欢迎,使用也最为广泛,但是为了达到最佳结果,它们通常在建模前设置主题数量、自定义停用词列表、繁体转简体等操作,此外这些方法忽略了词语的顺序和语义,导致确定的主题的准确性不高。
技术问题
本申请的主要目的为提供一种文本主题的生成方法、装置、设备及存储介质,旨在解决现有技术的LDA模型和PLSA模型,通常在建模前设置自定义停用词列表、繁体转简体等复杂操作,忽略了词语的顺序和语义,导致确定的主题的准确性不高的技术问题。
技术解决方案
本申请提出一种文本主题的生成方法,所述方法包括:
获取多个目标文本;
对各个所述目标文本分别进行分词及词语去重处理,得到词表;
对每个所述目标文本进行编码,得到文本编码向量;
对所述词表中的每个词语进行编码,得到词语编码向量;
对各个所述文本编码向量进行聚类,得到多个编码向量聚类集;
对每个所述编码向量聚类集进行平均值计算,得到聚类集主题向量;
计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,得到目标相似度;
根据所述词表和各个所述目标相似度,对每个所述编码向量聚类集进行目标文本主题生成。
本申请还提出了一种文本主题的生成装置,所述装置包括:
目标文本获取模块,用于获取多个目标文本;
词表生成模块,用于对各个所述目标文本分别进行分词及词语去重处理,得到词表;
文本编码向量确定模块,用于对每个所述目标文本进行编码,得到文本编码向量;
词语编码向量确定模块,用于对所述词表中的每个词语进行编码,得到词语编码向量;
聚类模块,用于对各个所述文本编码向量进行聚类,得到多个编码向量聚类集;
聚类集主题向量确定模块,用于对每个所述编码向量聚类集进行平均值计算,得到聚类集主题向量;
目标相似度确定模块,用于计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,得到目标相似度;
目标文本主题生成模块,用于根据所述词表和各个所述目标相似度,对每个所述编码向量聚类集进行目标文本主题生成。
本申请还提出了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现一种文本主题的生成方法,其中,所述文本主题的生成方法包括:获取多个目标文本;对各个所述目标文本分别进行分词及词语去重处理,得到词表;对每个所述目标文本进行编码,得到文本编码向量;对所述词表中的每个词语进行编码,得到词语编码向量;对各个所述文本编码向量进行聚类,得到多个编码向量聚类集;对每个所述编码向量聚类集进行平均值计算,得到聚类集主题向量;计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,得到目标相似度;根据所述词表和各个所述目标相似度,对每个所述编码向量聚类集进行目标文本主题生成。
本申请还提出了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现一种文本主题的生成方法,其中,所述文本主题的生成方法包括:获取多个目标文本;对各个所述目标文本分别进行分词及词语去重处理,得到词表;对每个所述目标文本进行编码,得到文本编码向量;对所述词表中的每个词语进行编码,得到词语编码向量;对各个所述文本编码向量进行聚类,得到多个编码向量聚类集;对每个所述编码向量聚类集进行平均值计算,得到聚类集主题向量;计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,得到目标相似度;根据所述词表和各个所述目标相似度,对每个所述编码向量聚类集进行目标文本主题生成。
有益效果
本申请的文本主题的生成方法、装置、设备及存储介质,其中方法通过获取多个目标文本;对各个所述目标文本分别进行分词及词语去重处理,得到词表;对每个所述目标文本进行编码,得到文本编码向量;对所述词表中的每个词语进行编码,得到词语编码向量;对各个所述文本编码向量进行聚类,得到多个编码向量聚类集;对每个所述编码向量聚类集进行平均值计算,得到聚类集主题向量;计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,得到目标相似度;根据所述词表和各个所述目标相似度,对每个所述编码向量聚类集进行目标文本主题生成。通过将文本编码成向量表示,从而捕获到了文本的语义信息,保留了词语之间的顺序信息;基于文本编码向量进行聚类后确定每个聚类集的聚类集主题向量,以及将词语编码成向量表示,实现将文本编码向量、词语编码向量和聚类集主题向量映射到同一向量空间,基于同一向量空间确定文本主题,提高了文本主题的准确性;而且不需要在建模前设置自定义停用词列表、繁体转简体等复杂操作。
附图说明
图1为本申请一实施例的文本主题的生成方法的流程示意图;
图2 为本申请一实施例的文本主题的生成装置的结构示意框图;
图3 为本申请一实施例的计算机设备的结构示意框图。
本发明的最佳实施方式
参照图1,本申请实施例中提供一种文本主题的生成方法,涉及人工智能技术领域,所述方法包括:
S1:获取多个目标文本;
S2:对各个所述目标文本分别进行分词及词语去重处理,得到词表;
S3:对每个所述目标文本进行编码,得到文本编码向量;
S4:对所述词表中的每个词语进行编码,得到词语编码向量;
S5:对各个所述文本编码向量进行聚类,得到多个编码向量聚类集;
S6:对每个所述编码向量聚类集进行平均值计算,得到聚类集主题向量;
S7:计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,得到目标相似度;
S8:根据所述词表和各个所述目标相似度,对每个所述编码向量聚类集进行目标文本主题生成。
本实施例通过将文本编码成向量表示,从而捕获到了文本的语义信息,保留了词语之间的顺序信息;基于文本编码向量进行聚类后确定每个聚类集的聚类集主题向量,以及将词语编码成向量表示,实现将文本编码向量、词语编码向量和聚类集主题向量映射到同一向量空间,基于同一向量空间确定文本主题,提高了文本主题的准确性;而且不需要在建模前设置自定义停用词列表、繁体转简体等复杂操作。
对于S1,可以获取用户输入的多个目标文本,也可以从数据库中获取多个目标文本,还可以从第三方应用系统中获取多个目标文本。
目标文本,是包含一句或多句话的文本。
对于S2,将各个所述目标文本合并到一个文档,得到待处理文档;对所述待处理文档进行分词,得到词语集;对所述词语集中的词语进行去重处理,将去重处理后的所述词语集作为所述词表。
对于S3,对每个所述目标文本进行编码,将编码得到的数据作为文本编码向量,也就是说,文本编码向量与目标文本一一对应。
可选的,将每个所述目标文本输入预设的编码模型进行编码,将编码得到的数据作为文本编码向量。
编码模型,是基于神经网络训练得到的模型。
可选的,编码模型是基于Bert模型训练得到的模型。
对于S4,对所述词表中的每个词语进行编码,将编码得到的数据作为词语编码向量,也就是说,词语编码向量与所述词表中的词语一一对应。
可选的,所述词表中的每个词语输入所述编码模型进行编码,将编码得到的数据作为词语编码向量。
也就是说,步骤S3和步骤S4采用相同的编码模型。
对于S5,对各个所述文本编码向量进行聚类,将聚类得到的每个集合作为一个编码向量聚类集。
可选的,采用K-Means聚类算法和预设的聚类数量,对各个所述文本编码向量进行聚类,将聚类得到的每个集合作为一个编码向量聚类集。
K-Means聚类算法,也就是K均值聚类算法。
对于S6,对每个所述编码向量聚类集中的各个所述文本编码向量进行平均值计算,将计算得到的向量作为一个聚类集主题向量。也就是说,聚类集主题向量与所述编码向量聚类集一一对应。
对于S7,计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,将计算得到的相似度作为目标相似度。也就是说,目标相似度用于衡量一个词语的编码向量与一个所述聚类集主题向量之间的相似度。
对于S8,从各个所述目标相似度中,找出每个所述编码向量聚类集最相似的一个或多个所述目标相似度,将针对同一个所述编码向量聚类集找出的各个所述目标相似度在所述词表中对应的各个词语作为该所述编码向量聚类集的目标文本主题。
可以理解的是,目标文本主题,是所述编码向量聚类集对应的各个所述目标文本的文本主题。
在一个实施例中,上述获取多个目标文本的步骤,包括:
S11:获取多个新闻文本;
S12:对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到所述目标文本。
本实施例实现了对新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理后作为目标文本,从而使本申请确定的目标文本主题可以用于新闻分类;通过数据清洗减少了噪音干扰,提高了确定的目标文本主题的准确性。
对于S11,可以获取用户输入的多个新闻文本,也可以从数据库中获取多个新闻文本,还可以从第三方应用系统中获取多个新闻文本。
新闻文本,是一篇新闻的文本。新闻文本包括:新闻标题、新闻简介和新闻正文。
对于S12,采用预设的正则表达式,对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,将完成处理的每个所述新闻文本作为一个所述目标文本。
在一个实施例中,上述对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到所述目标文本的步骤,包括:
S121:对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到待处理文本;
S122:从各个所述待处理文本中找出字数大于预设字数的每个所述待处理文本作为所述目标文本。
本实施例通过先进行空白符删除处理、重复标点删除处理和特殊符号删除处理,然后剔除小于或等于预设字数的待处理文本,减少了噪音干扰,减少了字数太少的文本影响确定的文本主题准确性。
对于S121,采用预设的正则表达式,对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,将完成处理的每个所述新闻文本作为一个所述待处理文本。
对于S122,从各个所述待处理文本中找出字数大于预设字数的每个所述待处理文本作为所述目标文本,从而将小于或等于预设字数的待处理文本进行剔除处理。
可选的,预设字数设为1000。
在一个实施例中,上述对各个所述文本编码向量进行聚类,得到多个编码向量聚类集的步骤,包括:
S51:设置数量与预设的聚类数量相同的聚类中心,并对每个所述聚类中心进行初始化;
S52:计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离;
S53:根据各个所述待分析距离,将各个所述文本编码向量按照最小距离原则分配到最邻近的所述聚类中心对应的待判断聚类集;
S54:对每个所述待判断聚类集进行向量平均值计算;
S55:将目标向量平均值作为与所述目标向量平均值对应的所述待判断聚类集的所述聚类中心,其中,所述目标向量平均值是任一个所述向量平均值;
S56:重复执行所述计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离的步骤,直至每个所述待判断聚类集对应的所述聚类中心均不再变化;
S57:将每个所述待判断聚类集作为一个所述编码向量聚类集。
本实施例采用K-Means聚类算法和预设的聚类数量,对具有文本的语义信息和词语之间的顺序信息的各个所述文本编码向量进行聚类,提高了聚类得到的编码向量聚类集的准确性。
对于S51,设置数量与预设的聚类数量相同的聚类中心,也就是聚类中心的数量与聚类数量相同。
对每个所述聚类中心进行初始化的方法在此不做赘述。
对于S52,计算每个所述文本编码向量与每个所述聚类中心之间的距离,也就是说,所述文本编码向量的数量与所述聚类中心的数量的乘积等于待分析距离的数量。
对于S53,将任一个所述文本编码向量作为待处理编码向量;从所述待处理编码向量对应的各个所述待分析距离中找出值为最小的所述待分析距离作为目标距离;将所述待处理编码向量分配到与所述目标距离对应的所述聚类中心对应的待判断聚类集。
对于S54,对每个所述待判断聚类集中的各个所述文本编码向量进行向量平均值计算。
对于S55,将目标向量平均值作为与所述目标向量平均值对应的所述待判断聚类集的所述聚类中心,从而实现了对聚类中心的更新。
对于S56,重复执行所述计算每个所述文本编码向量与每个所述聚类中心之间的距离的步骤,也就是重复执行步骤S52至步骤S56,直至每个所述待判断聚类集对应的所述聚类中心均不再变化。当每个所述待判断聚类集对应的所述聚类中心均不再变化时,意味着已经实现了最优的聚类。
对于S57,将每个所述待判断聚类集作为一个所述编码向量聚类集,从而实现了对具有文本的语义信息和词语之间的顺序信息的各个所述文本编码向量进行聚类。
在一个实施例中,上述计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离的步骤,包括:
S521:采用余弦相似度算法,计算每个所述文本编码向量与每个所述聚类中心之间的余弦相似度,得到所述待分析距离。
本实施例采用余弦相似度算法作为聚类算法的向量度量指标,从而较好的度量了向量之间的距离,提高了聚类的准确性。
对于S521,采用余弦相似度算法,计算每个所述文本编码向量与每个所述聚类中心之间的余弦相似度,将计算得到的余弦相似度作为所述待分析距离。
在一个实施例中,上述计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,得到目标相似度的步骤,包括:
S71:采用余弦相似度算法,计算每个所述词语编码向量与每个所述聚类集主题向量之间的余弦相似度,得到所述目标相似度。
本实施例采用余弦相似度算法作为词语编码向量和聚类集主题向量之间之间的距离的度量指标,从而较好的度量了向量之间的距离,提高了目标相似度的准确性。
对于S71,采用余弦相似度算法,计算每个所述词语编码向量与每个所述聚类集主题向量之间的余弦相似度,将计算得到的相似度作为目标相似度。
在一个实施例中,上述根据所述词表和各个所述目标相似度,对每个所述编码向量聚类集进行目标文本主题生成的步骤,包括:
S81:将任一个所述编码向量聚类集作为目标编码向量聚类集;
S82:对所述目标编码向量聚类集对应的各个所述目标相似度进行正序排序,得到相似度集;
S83:采用从开头开始获取的方法,从所述相似度集中找出预设数量的所述目标相似度,得到命中相似度集;
S84:将所述词表中的与所述命中相似度集对应的各个词语,作为所述目标编码向量聚类集对应的所述目标文本主题。
本实施例获取预设数量的最小的目标相似度对应的词语作为目标文本主题,实现了基于同一向量空间确定文本主题,提高了文本主题的准确性。
对于S82,对所述目标编码向量聚类集对应的各个所述目标相似度进行正序排序,将排序后的各个所述目标相似度作为相似度集,从而使相似度集中的各个所述目标相似度从小到大排列。
对于S83,采用从开头开始获取的方法,也就是先获取最小的所述目标相似度,为找最相似的词语提供了基础。
其中,采用从开头开始获取的方法,从所述相似度集中找出预设数量的所述目标相似度,将找到的各个所述目标相似度作为命中相似度集。
可以理解的是,当所述相似度集中的所述目标相似度的数量小于预设数量时,命中相似度集中的所述目标相似度的数量小于预设数量。
可选的,所述采用从开头开始获取的方法,从所述相似度集中找出预设数量的所述目标相似度,得到命中相似度集的步骤,包括:将所述相似度集按预设相似度阈值进行划分,得到第一集合和第二集合;采用从开头开始获取的方法,从所述第一集合中找出预设数量的所述目标相似度,得到命中相似度集。从而使命中相似度集中的所述目标相似度均小于所述预设相似度阈值,进一步提高了文本主题的准确性。
也就是说,第一集合中的所述目标相似度均小于预设相似度阈值,第二集合中的所述目标相似度均大于或等于预设相似度阈值。
对于S84,将所述词表中的与所述命中相似度集对应的各个词语,作为所述目标编码向量聚类集对应的所述目标文本主题,从而实现了基于同一向量空间确定文本主题,提高了文本主题的准确性。
参照图2,本申请还提出了一种文本主题的生成装置,所述装置包括:
目标文本获取模块100,用于获取多个目标文本;
词表生成模块200,用于对各个所述目标文本分别进行分词及词语去重处理,得到词表;
文本编码向量确定模块300,用于对每个所述目标文本进行编码,得到文本编码向量;
词语编码向量确定模块400,用于对所述词表中的每个词语进行编码,得到词语编码向量;
聚类模块500,用于对各个所述文本编码向量进行聚类,得到多个编码向量聚类集;
聚类集主题向量确定模块600,用于对每个所述编码向量聚类集进行平均值计算,得到聚类集主题向量;
目标相似度确定模块700,用于计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,得到目标相似度;
目标文本主题生成模块800,用于根据所述词表和各个所述目标相似度,对每个所述编码向量聚类集进行目标文本主题生成。
本实施例通过将文本编码成向量表示,从而捕获到了文本的语义信息,保留了词语之间的顺序信息;基于文本编码向量进行聚类后确定每个聚类集的聚类集主题向量,以及将词语编码成向量表示,实现将文本编码向量、词语编码向量和聚类集主题向量映射到同一向量空间,基于同一向量空间确定文本主题,提高了文本主题的准确性;而且不需要在建模前设置自定义停用词列表、繁体转简体等复杂操作。
在一个实施例中,上述目标文本获取模块100包括:新闻文本获取子模块和目标文本确定子模块;
所述新闻文本获取子模块,用于获取多个新闻文本;
所述目标文本确定子模块,用于对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到所述目标文本。
在一个实施例中,上述目标文本确定子模块包括:待处理文本确定单元和筛选单元;
所述待处理文本确定单元,用于对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到待处理文本;
所述筛选单元,用于从各个所述待处理文本中找出字数大于预设字数的每个所述待处理文本作为所述目标文本。
在一个实施例中,上述聚类模块500包括:聚类中心设置子模块、待分析距离计算子模块、待判断聚类集确定子模块、向量平均值计算子模块、聚类中心更新子模块、循环控制子模块和编码向量聚类集确定子模块;
所述聚类中心设置子模块,用于设置数量与预设的聚类数量相同的聚类中心,并对每个所述聚类中心进行初始化;
所述待分析距离计算子模块,用于计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离;
所述待判断聚类集确定子模块,用于根据各个所述待分析距离,将各个所述文本编码向量按照最小距离原则分配到最邻近的所述聚类中心对应的待判断聚类集;
所述向量平均值计算子模块,用于对每个所述待判断聚类集进行向量平均值计算;
所述聚类中心更新子模块,用于将目标向量平均值作为与所述目标向量平均值对应的所述待判断聚类集的所述聚类中心,其中,所述目标向量平均值是任一个所述向量平均值;
所述循环控制子模块,用于重复执行所述计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离的步骤,直至每个所述待判断聚类集对应的所述聚类中心均不再变化;
所述编码向量聚类集确定子模块,用于将每个所述待判断聚类集作为一个所述编码向量聚类集。
在一个实施例中,上述待分析距离计算子模块包括:余弦相似度计算单元;
所述余弦相似度计算单元,用于采用余弦相似度算法,计算每个所述文本编码向量与每个所述聚类中心之间的余弦相似度,得到所述待分析距离。
在一个实施例中,上述目标相似度确定模块700包括:相似度计算单元;
所述相似度计算单元,用于采用余弦相似度算法,计算每个所述词语编码向量与每个所述聚类集主题向量之间的余弦相似度,得到所述目标相似度。
在一个实施例中,上述目标文本主题生成模块800包括:目标编码向量聚类集确定子模块、相似度集确定子模块、命中相似度集确定子模块和目标文本主题确定子模块;
所述目标编码向量聚类集确定子模块,用于将任一个所述编码向量聚类集作为目标编码向量聚类集;
所述相似度集确定子模块,用于对所述目标编码向量聚类集对应的各个所述目标相似度进行正序排序,得到相似度集;
所述命中相似度集确定子模块,用于采用从开头开始获取的方法,从所述相似度集中找出预设数量的所述目标相似度,得到命中相似度集;
所述目标文本主题确定子模块,用于将所述词表中的与所述命中相似度集对应的各个词语,作为所述目标编码向量聚类集对应的所述目标文本主题。
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于储存文本主题的生成方法等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种文本主题的生成方法。所述文本主题的生成方法,包括:获取多个目标文本;对各个所述目标文本分别进行分词及词语去重处理,得到词表;对每个所述目标文本进行编码,得到文本编码向量;对所述词表中的每个词语进行编码,得到词语编码向量;对各个所述文本编码向量进行聚类,得到多个编码向量聚类集;对每个所述编码向量聚类集进行平均值计算,得到聚类集主题向量;计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,得到目标相似度;根据所述词表和各个所述目标相似度,对每个所述编码向量聚类集进行目标文本主题生成。
本实施例通过将文本编码成向量表示,从而捕获到了文本的语义信息,保留了词语之间的顺序信息;基于文本编码向量进行聚类后确定每个聚类集的聚类集主题向量,以及将词语编码成向量表示,实现将文本编码向量、词语编码向量和聚类集主题向量映射到同一向量空间,基于同一向量空间确定文本主题,提高了文本主题的准确性;而且不需要在建模前设置自定义停用词列表、繁体转简体等复杂操作。
在一个实施例中,上述获取多个目标文本的步骤,包括:获取多个新闻文本;对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到所述目标文本。
在一个实施例中,上述对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到所述目标文本的步骤,包括:对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到待处理文本;从各个所述待处理文本中找出字数大于预设字数的每个所述待处理文本作为所述目标文本。
在一个实施例中,上述对各个所述文本编码向量进行聚类,得到多个编码向量聚类集的步骤,包括:设置数量与预设的聚类数量相同的聚类中心,并对每个所述聚类中心进行初始化;计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离;根据各个所述待分析距离,将各个所述文本编码向量按照最小距离原则分配到最邻近的所述聚类中心对应的待判断聚类集;对每个所述待判断聚类集进行向量平均值计算;将目标向量平均值作为与所述目标向量平均值对应的所述待判断聚类集的所述聚类中心,其中,所述目标向量平均值是任一个所述向量平均值;重复执行所述计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离的步骤,直至每个所述待判断聚类集对应的所述聚类中心均不再变化;将每个所述待判断聚类集作为一个所述编码向量聚类集。
在一个实施例中,上述计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离的步骤,包括:采用余弦相似度算法,计算每个所述文本编码向量与每个所述聚类中心之间的余弦相似度,得到所述待分析距离。
在一个实施例中,上述计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,得到目标相似度的步骤,包括:采用余弦相似度算法,计算每个所述词语编码向量与每个所述聚类集主题向量之间的余弦相似度,得到所述目标相似度。
在一个实施例中,上述根据所述词表和各个所述目标相似度,对每个所述编码向量聚类集进行目标文本主题生成的步骤,包括:将任一个所述编码向量聚类集作为目标编码向量聚类集;对所述目标编码向量聚类集对应的各个所述目标相似度进行正序排序,得到相似度集;采用从开头开始获取的方法,从所述相似度集中找出预设数量的所述目标相似度,得到命中相似度集;将所述词表中的与所述命中相似度集对应的各个词语,作为所述目标编码向量聚类集对应的所述目标文本主题。
本申请一实施例还提供一种计算机可读存储介质,所述存储介质为易失性存储介质或非易失性存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现一种文本主题的生成方法,包括步骤:获取多个目标文本;对各个所述目标文本分别进行分词及词语去重处理,得到词表;对每个所述目标文本进行编码,得到文本编码向量;对所述词表中的每个词语进行编码,得到词语编码向量;对各个所述文本编码向量进行聚类,得到多个编码向量聚类集;对每个所述编码向量聚类集进行平均值计算,得到聚类集主题向量;计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,得到目标相似度;根据所述词表和各个所述目标相似度,对每个所述编码向量聚类集进行目标文本主题生成。
上述执行的文本主题的生成方法,通过将文本编码成向量表示,从而捕获到了文本的语义信息,保留了词语之间的顺序信息;基于文本编码向量进行聚类后确定每个聚类集的聚类集主题向量,以及将词语编码成向量表示,实现将文本编码向量、词语编码向量和聚类集主题向量映射到同一向量空间,基于同一向量空间确定文本主题,提高了文本主题的准确性;而且不需要在建模前设置自定义停用词列表、繁体转简体等复杂操作。
在一个实施例中,上述获取多个目标文本的步骤,包括:获取多个新闻文本;对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到所述目标文本。
在一个实施例中,上述对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到所述目标文本的步骤,包括:对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到待处理文本;从各个所述待处理文本中找出字数大于预设字数的每个所述待处理文本作为所述目标文本。
在一个实施例中,上述对各个所述文本编码向量进行聚类,得到多个编码向量聚类集的步骤,包括:设置数量与预设的聚类数量相同的聚类中心,并对每个所述聚类中心进行初始化;计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离;根据各个所述待分析距离,将各个所述文本编码向量按照最小距离原则分配到最邻近的所述聚类中心对应的待判断聚类集;对每个所述待判断聚类集进行向量平均值计算;将目标向量平均值作为与所述目标向量平均值对应的所述待判断聚类集的所述聚类中心,其中,所述目标向量平均值是任一个所述向量平均值;重复执行所述计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离的步骤,直至每个所述待判断聚类集对应的所述聚类中心均不再变化;将每个所述待判断聚类集作为一个所述编码向量聚类集。
在一个实施例中,上述计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离的步骤,包括:采用余弦相似度算法,计算每个所述文本编码向量与每个所述聚类中心之间的余弦相似度,得到所述待分析距离。
在一个实施例中,上述计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,得到目标相似度的步骤,包括:采用余弦相似度算法,计算每个所述词语编码向量与每个所述聚类集主题向量之间的余弦相似度,得到所述目标相似度。
在一个实施例中,上述根据所述词表和各个所述目标相似度,对每个所述编码向量聚类集进行目标文本主题生成的步骤,包括:将任一个所述编码向量聚类集作为目标编码向量聚类集;对所述目标编码向量聚类集对应的各个所述目标相似度进行正序排序,得到相似度集;采用从开头开始获取的方法,从所述相似度集中找出预设数量的所述目标相似度,得到命中相似度集;将所述词表中的与所述命中相似度集对应的各个词语,作为所述目标编码向量聚类集对应的所述目标文本主题。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的和实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双速据率SDRAM(SSRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。

Claims (20)

  1. 一种文本主题的生成方法,其中,所述方法包括:
    获取多个目标文本;
    对各个所述目标文本分别进行分词及词语去重处理,得到词表;
    对每个所述目标文本进行编码,得到文本编码向量;
    对所述词表中的每个词语进行编码,得到词语编码向量;
    对各个所述文本编码向量进行聚类,得到多个编码向量聚类集;
    对每个所述编码向量聚类集进行平均值计算,得到聚类集主题向量;
    计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,得到目标相似度;
    根据所述词表和各个所述目标相似度,对每个所述编码向量聚类集进行目标文本主题生成。
  2. 根据权利要求1所述的文本主题的生成方法,其中,所述获取多个目标文本的步骤,包括:
    获取多个新闻文本;
    对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到所述目标文本。
  3. 根据权利要求2所述的文本主题的生成方法,其中,所述对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到所述目标文本的步骤,包括:
    对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到待处理文本;
    从各个所述待处理文本中找出字数大于预设字数的每个所述待处理文本作为所述目标文本。
  4. 根据权利要求1所述的文本主题的生成方法,其中,所述对各个所述文本编码向量进行聚类,得到多个编码向量聚类集的步骤,包括:
    设置数量与预设的聚类数量相同的聚类中心,并对每个所述聚类中心进行初始化;
    计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离;
    根据各个所述待分析距离,将各个所述文本编码向量按照最小距离原则分配到最邻近的所述聚类中心对应的待判断聚类集;
    对每个所述待判断聚类集进行向量平均值计算;
    将目标向量平均值作为与所述目标向量平均值对应的所述待判断聚类集的所述聚类中心,其中,所述目标向量平均值是任一个所述向量平均值;
    重复执行所述计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离的步骤,直至每个所述待判断聚类集对应的所述聚类中心均不再变化;
    将每个所述待判断聚类集作为一个所述编码向量聚类集。
  5. 根据权利要求4所述的文本主题的生成方法,其中,所述计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离的步骤,包括:
    采用余弦相似度算法,计算每个所述文本编码向量与每个所述聚类中心之间的余弦相似度,得到所述待分析距离。
  6. 根据权利要求1所述的文本主题的生成方法,其中,所述计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,得到目标相似度的步骤,包括:
    采用余弦相似度算法,计算每个所述词语编码向量与每个所述聚类集主题向量之间的余弦相似度,得到所述目标相似度。
  7. 根据权利要求1所述的文本主题的生成方法,其中,所述根据所述词表和各个所述目标相似度,对每个所述编码向量聚类集进行目标文本主题生成的步骤,包括:
    将任一个所述编码向量聚类集作为目标编码向量聚类集;
    对所述目标编码向量聚类集对应的各个所述目标相似度进行正序排序,得到相似度集;
    采用从开头开始获取的方法,从所述相似度集中找出预设数量的所述目标相似度,得到命中相似度集;
    将所述词表中的与所述命中相似度集对应的各个词语,作为所述目标编码向量聚类集对应的所述目标文本主题。
  8. 一种文本主题的生成装置,其中,所述装置包括:
    目标文本获取模块,用于获取多个目标文本;
    词表生成模块,用于对各个所述目标文本分别进行分词及词语去重处理,得到词表;
    文本编码向量确定模块,用于对每个所述目标文本进行编码,得到文本编码向量;
    词语编码向量确定模块,用于对所述词表中的每个词语进行编码,得到词语编码向量;
    聚类模块,用于对各个所述文本编码向量进行聚类,得到多个编码向量聚类集;
    聚类集主题向量确定模块,用于对每个所述编码向量聚类集进行平均值计算,得到聚类集主题向量;
    目标相似度确定模块,用于计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,得到目标相似度;
    目标文本主题生成模块,用于根据所述词表和各个所述目标相似度,对每个所述编码向量聚类集进行目标文本主题生成。
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时实现一种文本主题的生成方法;
    其中,所述文本主题的生成方法包括:
    获取多个目标文本;
    对各个所述目标文本分别进行分词及词语去重处理,得到词表;
    对每个所述目标文本进行编码,得到文本编码向量;
    对所述词表中的每个词语进行编码,得到词语编码向量;
    对各个所述文本编码向量进行聚类,得到多个编码向量聚类集;
    对每个所述编码向量聚类集进行平均值计算,得到聚类集主题向量;
    计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,得到目标相似度;
    根据所述词表和各个所述目标相似度,对每个所述编码向量聚类集进行目标文本主题生成。
  10. 根据权利要求9所述的计算机设备,其中,所述获取多个目标文本的步骤,包括:
    获取多个新闻文本;
    对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到所述目标文本。
  11. 根据权利要求10所述的计算机设备,其中,所述对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到所述目标文本的步骤,包括:
    对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到待处理文本;
    从各个所述待处理文本中找出字数大于预设字数的每个所述待处理文本作为所述目标文本。
  12. 根据权利要求9所述的计算机设备,其中,所述对各个所述文本编码向量进行聚类,得到多个编码向量聚类集的步骤,包括:
    设置数量与预设的聚类数量相同的聚类中心,并对每个所述聚类中心进行初始化;
    计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离;
    根据各个所述待分析距离,将各个所述文本编码向量按照最小距离原则分配到最邻近的所述聚类中心对应的待判断聚类集;
    对每个所述待判断聚类集进行向量平均值计算;
    将目标向量平均值作为与所述目标向量平均值对应的所述待判断聚类集的所述聚类中心,其中,所述目标向量平均值是任一个所述向量平均值;
    重复执行所述计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离的步骤,直至每个所述待判断聚类集对应的所述聚类中心均不再变化;
    将每个所述待判断聚类集作为一个所述编码向量聚类集。
  13. 根据权利要求12所述的计算机设备,其中,所述计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离的步骤,包括:
    采用余弦相似度算法,计算每个所述文本编码向量与每个所述聚类中心之间的余弦相似度,得到所述待分析距离。
  14. 根据权利要求9所述的计算机设备,其中,所述计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,得到目标相似度的步骤,包括:
    采用余弦相似度算法,计算每个所述词语编码向量与每个所述聚类集主题向量之间的余弦相似度,得到所述目标相似度。
  15. 根据权利要求9所述的计算机设备,其中,所述根据所述词表和各个所述目标相似度,对每个所述编码向量聚类集进行目标文本主题生成的步骤,包括:
    将任一个所述编码向量聚类集作为目标编码向量聚类集;
    对所述目标编码向量聚类集对应的各个所述目标相似度进行正序排序,得到相似度集;
    采用从开头开始获取的方法,从所述相似度集中找出预设数量的所述目标相似度,得到命中相似度集;
    将所述词表中的与所述命中相似度集对应的各个词语,作为所述目标编码向量聚类集对应的所述目标文本主题。
  16. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现一种文本主题的生成方法,其中,所述文本主题的生成方法包括:
    获取多个目标文本;
    对各个所述目标文本分别进行分词及词语去重处理,得到词表;
    对每个所述目标文本进行编码,得到文本编码向量;
    对所述词表中的每个词语进行编码,得到词语编码向量;
    对各个所述文本编码向量进行聚类,得到多个编码向量聚类集;
    对每个所述编码向量聚类集进行平均值计算,得到聚类集主题向量;
    计算每个所述词语编码向量与每个所述聚类集主题向量之间的相似度,得到目标相似度;
    根据所述词表和各个所述目标相似度,对每个所述编码向量聚类集进行目标文本主题生成。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述获取多个目标文本的步骤,包括:
    获取多个新闻文本;
    对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到所述目标文本。
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到所述目标文本的步骤,包括:
    对每个所述新闻文本分别进行空白符删除处理、重复标点删除处理和特殊符号删除处理,得到待处理文本;
    从各个所述待处理文本中找出字数大于预设字数的每个所述待处理文本作为所述目标文本。
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述对各个所述文本编码向量进行聚类,得到多个编码向量聚类集的步骤,包括:
    设置数量与预设的聚类数量相同的聚类中心,并对每个所述聚类中心进行初始化;
    计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离;
    根据各个所述待分析距离,将各个所述文本编码向量按照最小距离原则分配到最邻近的所述聚类中心对应的待判断聚类集;
    对每个所述待判断聚类集进行向量平均值计算;
    将目标向量平均值作为与所述目标向量平均值对应的所述待判断聚类集的所述聚类中心,其中,所述目标向量平均值是任一个所述向量平均值;
    重复执行所述计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离的步骤,直至每个所述待判断聚类集对应的所述聚类中心均不再变化;
    将每个所述待判断聚类集作为一个所述编码向量聚类集。
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述计算每个所述文本编码向量与每个所述聚类中心之间的距离,得到待分析距离的步骤,包括:
    采用余弦相似度算法,计算每个所述文本编码向量与每个所述聚类中心之间的余弦相似度,得到所述待分析距离。
PCT/CN2022/090162 2022-01-12 2022-04-29 文本主题的生成方法、装置、设备及存储介质 WO2023134074A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210033712.8 2022-01-12
CN202210033712.8A CN114492429B (zh) 2022-01-12 2022-01-12 文本主题的生成方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023134074A1 true WO2023134074A1 (zh) 2023-07-20

Family

ID=81511312

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090162 WO2023134074A1 (zh) 2022-01-12 2022-04-29 文本主题的生成方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN114492429B (zh)
WO (1) WO2023134074A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117235137A (zh) * 2023-11-10 2023-12-15 深圳市一览网络股份有限公司 一种基于向量数据库的职业信息查询方法及装置

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116361470B (zh) * 2023-04-03 2024-05-14 北京中科闻歌科技股份有限公司 一种基于话题描述的文本聚类清洗和合并方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180046708A1 (en) * 2016-08-11 2018-02-15 International Business Machines Corporation System and Method for Automatic Detection and Clustering of Articles Using Multimedia Information
CN109558482A (zh) * 2018-07-27 2019-04-02 中山大学 一种基于Spark框架的文本聚类模型PW-LDA的并行化方法
CN111061877A (zh) * 2019-12-10 2020-04-24 厦门市美亚柏科信息股份有限公司 文本主题提取方法和装置
CN111241282A (zh) * 2020-01-14 2020-06-05 北京百度网讯科技有限公司 文本主题生成方法、装置及电子设备

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657472A (zh) * 2015-02-13 2015-05-27 南京邮电大学 一种基于进化算法的英文文本聚类方法
EP3591545A1 (en) * 2018-07-06 2020-01-08 Universite Paris Descartes Method for co-clustering senders and receivers based on text or image data files
CN109271520B (zh) * 2018-10-25 2022-02-08 北京星选科技有限公司 数据提取方法、数据提取装置、存储介质和电子设备
CN111639175B (zh) * 2020-05-29 2023-05-02 电子科技大学 一种自监督的对话文本摘要方法及系统
CN112597769B (zh) * 2020-12-15 2022-06-03 中山大学 一种基于狄利克雷变分自编码器的短文本主题识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180046708A1 (en) * 2016-08-11 2018-02-15 International Business Machines Corporation System and Method for Automatic Detection and Clustering of Articles Using Multimedia Information
CN109558482A (zh) * 2018-07-27 2019-04-02 中山大学 一种基于Spark框架的文本聚类模型PW-LDA的并行化方法
CN111061877A (zh) * 2019-12-10 2020-04-24 厦门市美亚柏科信息股份有限公司 文本主题提取方法和装置
CN111241282A (zh) * 2020-01-14 2020-06-05 北京百度网讯科技有限公司 文本主题生成方法、装置及电子设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117235137A (zh) * 2023-11-10 2023-12-15 深圳市一览网络股份有限公司 一种基于向量数据库的职业信息查询方法及装置
CN117235137B (zh) * 2023-11-10 2024-04-02 深圳市一览网络股份有限公司 一种基于向量数据库的职业信息查询方法及装置

Also Published As

Publication number Publication date
CN114492429B (zh) 2023-07-18
CN114492429A (zh) 2022-05-13

Similar Documents

Publication Publication Date Title
CN110413780B (zh) 文本情感分析方法和电子设备
CN109960724B (zh) 一种基于tf-idf的文本摘要方法
CN107085581B (zh) 短文本分类方法和装置
US10354170B2 (en) Method and apparatus of establishing image search relevance prediction model, and image search method and apparatus
CN110750640B (zh) 基于神经网络模型的文本数据分类方法、装置及存储介质
US20230039496A1 (en) Question-and-answer processing method, electronic device and computer readable medium
TW201837746A (zh) 特徵向量的產生、搜索方法、裝置及電子設備
WO2023134074A1 (zh) 文本主题的生成方法、装置、设备及存储介质
CN111985228B (zh) 文本关键词提取方法、装置、计算机设备和存储介质
CN112487190B (zh) 基于自监督和聚类技术从文本中抽取实体间关系的方法
CN110134777B (zh) 问题去重方法、装置、电子设备和计算机可读存储介质
CN112819023A (zh) 样本集的获取方法、装置、计算机设备和存储介质
WO2023065642A1 (zh) 语料筛选方法、意图识别模型优化方法、设备及存储介质
CN110858217A (zh) 微博敏感话题的检测方法、装置及可读存储介质
CN113987174A (zh) 分类标签的核心语句提取方法、系统、设备及存储介质
WO2023134075A1 (zh) 基于人工智能的文本主题生成方法、装置、设备及介质
CN111506726B (zh) 基于词性编码的短文本聚类方法、装置及计算机设备
CN109543036A (zh) 基于语义相似度的文本聚类方法
CN114491062B (zh) 一种融合知识图谱和主题模型的短文本分类方法
CN111325033A (zh) 实体识别方法、装置、电子设备及计算机可读存储介质
CN113486670B (zh) 基于目标语义的文本分类方法、装置、设备及存储介质
CN111523311B (zh) 一种搜索意图识别方法及装置
CN115858780A (zh) 一种文本聚类方法、装置、设备及介质
CN115203206A (zh) 数据内容搜索方法、装置、计算机设备及可读存储介质
CN115600595A (zh) 一种实体关系抽取方法、系统、设备及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22919715

Country of ref document: EP

Kind code of ref document: A1