WO2023134075A1 - 基于人工智能的文本主题生成方法、装置、设备及介质 - Google Patents

基于人工智能的文本主题生成方法、装置、设备及介质 Download PDF

Info

Publication number
WO2023134075A1
WO2023134075A1 PCT/CN2022/090163 CN2022090163W WO2023134075A1 WO 2023134075 A1 WO2023134075 A1 WO 2023134075A1 CN 2022090163 W CN2022090163 W CN 2022090163W WO 2023134075 A1 WO2023134075 A1 WO 2023134075A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
sentence
clustering
target
cluster
Prior art date
Application number
PCT/CN2022/090163
Other languages
English (en)
French (fr)
Inventor
陈浩
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023134075A1 publication Critical patent/WO2023134075A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of artificial intelligence, in particular to an artificial intelligence-based text topic generation method, device, equipment and medium.
  • the text topic model has always been one of the most widely used models in the industry. Through the text topic model, massive documents can be classified, which is convenient for daily screening, management and application.
  • LDA Local Dirichlet Allocation
  • PLSA Probabilistic Latent Semantic Analysis
  • the main purpose of this application is to provide a text topic generation method, device, device and medium based on artificial intelligence, aiming at solving the LDA model and PLSA model of the prior art, only relying on statistical methods, and cannot capture the content contained in the text Semantic information, leading to technical problems of lower accuracy of text topics.
  • the present application proposes a method for generating text topics based on artificial intelligence, the method comprising:
  • Sentence vector generation is performed on each of the target texts in the target text set
  • each of the sentence vectors is clustered to obtain a plurality of sentence vector clusters
  • the calculation of the TF-IDF weight value and the word extraction are respectively performed to obtain the target text topic corresponding to the specified sentence vector clustering set, wherein the specified sentence
  • the vector clustering set is any one of the sentence vector clustering sets.
  • the application also proposes a device for generating text topics based on artificial intelligence, said device comprising:
  • a data acquisition module configured to acquire a target text set
  • a sentence vector generating module configured to generate a sentence vector for each of the target texts in the target text set
  • Clustering module for adopting K-Means clustering algorithm and preset clustering quantity, clustering each described sentence vector, obtains a plurality of sentence vector clustering sets;
  • the target text topic generation module is used to perform the calculation of the TF-IDF weight value and word extraction from each of the target texts corresponding to the specified sentence vector clustering set, and obtain the target text corresponding to the specified sentence vector clustering set topic, wherein the specified sentence vector clustering set is any one of the sentence vector clustering sets.
  • the present application also proposes a computer device, including a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, an artificial intelligence-based text topic generation method is implemented, wherein the The method for generating text topics based on artificial intelligence includes: obtaining a target text set; generating sentence vectors for each of the target texts in the target text set; using the K-Means clustering algorithm and a preset clustering number to The sentence vectors are clustered to obtain a plurality of sentence vector cluster sets; the calculation of the TF-IDF weight value and the word extraction are respectively carried out from each of the target texts corresponding to the specified sentence vector cluster sets, and the specified sentence vectors are obtained.
  • the target text topic corresponding to the sentence vector clustering set wherein the specified sentence vector clustering set is any one of the sentence vector clustering sets.
  • the present application also proposes a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, an artificial intelligence-based text topic generation method is implemented, wherein the artificial intelligence-based text
  • the topic generation method includes: obtaining a target text set; generating a sentence vector for each of the target texts in the target text set; using a K-Means clustering algorithm and a preset number of clusters to generate a sentence vector for each of the sentence vectors Clustering to obtain a plurality of sentence vector clustering sets; respectively carry out calculation of TF-IDF weight value and word extraction from each of the target texts corresponding to the specified sentence vector clustering sets, and obtain the clustering sets corresponding to the specified sentence vectors
  • the corresponding target text topic wherein the specified sentence vector clustering set is any one of the sentence vector clustering sets.
  • the artificial intelligence-based text topic generation method, device, device and medium of the present application wherein the method obtains a target text set; generates sentence vectors for each of the target texts in the target text set; uses K-Means clustering Algorithm and the number of clusters preset, clustering each of the sentence vectors to obtain a plurality of sentence vector clustering sets; performing TF-IDF weight values from each of the target texts corresponding to the specified sentence vector clustering sets calculation and word extraction to obtain the target text topic corresponding to the specified sentence vector clustering set, wherein the specified sentence vector clustering set is any one of the sentence vector clustering sets.
  • Sentence vector generation is performed on the target text to extract the semantic information contained in the target text, so that in the subsequent clustering, the sentence vectors with the same semantic information are clustered into the same cluster set, which effectively improves the clustering efficiency.
  • Semantic effects of clusters the TF-IDF algorithm is used to extract text topics from each target text corresponding to each cluster with semantic effects, which realizes the combination of statistical methods and methods based on semantic information, and improves the general simplification, improving the accuracy of identified text topics.
  • Fig. 1 is a schematic flow chart of an artificial intelligence-based text topic generation method according to an embodiment of the present application
  • Fig. 2 is a structural schematic block diagram of an artificial intelligence-based text topic generation device according to an embodiment of the present application
  • FIG. 3 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • an artificial intelligence-based text topic generation method is provided in an embodiment of the present application, which relates to the technical field of artificial intelligence, and the method includes:
  • S4 Calculate the TF-IDF weight value and extract words from each of the target texts corresponding to the specified sentence vector clustering set, and obtain the target text topic corresponding to the specified sentence vector clustering set, wherein the The designated sentence vector clustering set is any one of the sentence vector clustering sets.
  • sentence vector generation is performed on the target text to extract the semantic information contained in the target text, so that in the subsequent clustering, the sentence vectors with the same semantic information are clustered into the same cluster set, effectively
  • the semantic effect of clustering sets is improved;
  • the TF-IDF algorithm is used to extract text topics from each target text corresponding to each clustering set with semantic effects, and the combination of statistical methods and methods based on semantic information is realized. Improved generalization and improved accuracy on identified text topics.
  • the target text set input by the user can be obtained, the target text set can also be obtained from the database, and the target text set can also be obtained from a third-party application system.
  • the target text set includes one or more target texts.
  • the target text is a text containing one or more sentences.
  • each target text in the target text set is subjected to sentence vector generation, so that the sentence vector extracts the semantic information contained in the target text.
  • the K-Means clustering algorithm is used to cluster each of the sentence vectors into cluster sets whose number is the same as the number of clusters, and each cluster set obtained by clustering is regarded as a sentence vector cluster set. Because the sentence vectors have semantic information, the sentence vectors with the same semantic information are clustered into the same cluster set, which effectively improves the semantic effect of the cluster set.
  • the number of clusters is an integer greater than 1.
  • K-Means clustering algorithm that is, K-means clustering algorithm.
  • the TF-IDF algorithm is used to calculate the TF-IDF weight value for each of the target texts corresponding to the specified sentence vector clustering set, and extract one or more TF-IDF weight values according to each TF-IDF weight value, The extracted words corresponding to each TF-IDF weight value are used as target text topics corresponding to the specified sentence vector clustering set.
  • the TF-IDF algorithm is used to extract text topics from each target text corresponding to each cluster set with semantic effects, and the combination of statistical methods and semantic information-based methods is realized, which improves generalization and improves accuracy of identified text topics.
  • TF-IDF (term frequency–inverse document frequency) is a commonly used weighting technique for information retrieval and data mining, and is often used to mine keywords in articles.
  • the target text topic corresponding to the specified sentence vector cluster set is the text topic of each target text corresponding to the specified sentence vector cluster set.
  • the above-mentioned step of obtaining the target text set includes:
  • This embodiment realizes that the novel introduction text is used as the target text after data cleaning, so that the target text topic determined in this application can be used for novel classification and novel recommendation; noise interference is reduced through data cleaning, and the determined target text topic is improved. accuracy.
  • multiple novel brief introduction texts input by the user may be acquired, multiple novel brief introduction texts may be acquired from a database, or multiple novel brief introduction texts may be acquired from a third-party application system.
  • the novel introduction text is the introduction text of a novel.
  • the introduction text of the novel is a text larger than a preset word count.
  • the preset word count is set to 1024.
  • each of the target texts corresponding to each of the novel introduction texts is used as the target text set, so that each target text without noise is used as the target text set, and the text theme is extracted based on the target text set without noise , improving the accuracy of identified text topics.
  • the above step of generating sentence vectors for each of the target texts in the target text set includes:
  • S21 Input each target text in the target text set into a preset sentence vector generation model to generate the sentence vector, wherein the sentence vector generation model is a model trained based on the Bert model.
  • sentence vectors are generated using a model trained based on the Bert model, which helps to improve the extraction of semantic information contained in the target text, and further improves the accuracy of the determined text topic.
  • For S21 input each target text in the target text set into a preset sentence vector generation model, and obtain the sentence vector output by the encoding layer of the sentence vector generation model.
  • a sentence vector generation model of a text type corresponding to the target text set from a model library, and use the acquired sentence vector generation model to generate a sentence vector for each of the target texts in the target text set.
  • the semantic information contained in the target text is further improved.
  • Bert Bidirectional Encoder Representations from Transformers
  • Bert Base model uses Bert Base model.
  • the steps of clustering each of the sentence vectors by using the K-Means clustering algorithm and a preset number of clusters to obtain a plurality of sentence vector clusters include:
  • S311 Set the number of cluster centers equal to the number of clusters, and initialize each of the cluster centers;
  • S312 Calculate the vector distance between each of the sentence vectors and each of the cluster centers
  • S314 Perform vector average calculation on each of the initial cluster sets to obtain a vector average value corresponding to each of the initial cluster sets;
  • S315 Use a designated vector mean value as the cluster center of the initial cluster set corresponding to the designated vector mean value, wherein the designated vector mean value is any one of the vector mean values;
  • This embodiment uses the K-Means clustering algorithm and the preset number of clusters to cluster each of the sentence vectors with semantic effects, so that each sentence vector in the sentence vector clustering set obtained by clustering has the same semantics information.
  • the number of cluster centers is set to be the same as the number of clusters, that is, the number of cluster centers is the same as the number of clusters.
  • the processing sentence vector is assigned to the initial cluster set corresponding to the cluster center corresponding to the distance from the target vector.
  • vector average calculation is performed on each of the sentence vectors in each of the initial clustering sets.
  • the specified vector average value is used as the cluster center of the initial cluster set corresponding to the specified vector average value, thereby realizing updating of the cluster centers.
  • each of the initial clustering sets as a sentence vector clustering set, thereby obtaining a sentence vector clustering set with the same semantic information.
  • the step of calculating the vector distance between each of the sentence vectors and each of the cluster centers includes:
  • the cosine similarity algorithm is used as the vector measurement index of the clustering algorithm, thereby better measuring the distance between sentence vectors and improving the accuracy of clustering.
  • the cosine similarity algorithm is used to calculate the cosine similarity between each of the sentence vectors and each of the cluster centers, and the calculated cosine similarity is used as the vector distance.
  • the above-mentioned steps of clustering each of the sentence vectors by using the K-Means clustering algorithm and a preset number of clusters to obtain a plurality of sentence vector clusters also include:
  • this embodiment first performs dimensionality reduction processing on each of the sentence vectors, and then reduces the Each of the sentence vectors after dimension processing is clustered, so as to reduce the sparseness of the sentence vectors through dimensionality reduction to improve the clustering effect, and further improve the accuracy of the determined text theme.
  • a preset dimensionality reduction algorithm is used to perform dimensionality reduction processing on each of the sentence vectors, so as to reduce the sparsity of the sentence vectors.
  • the UMAP algorithm dimension reduction manifold learning algorithm
  • the UMAP algorithm is used to perform dimension reduction processing on each of the sentence vectors.
  • the sentence vector is generated by using a model based on Bert model training, and the sentence vector has 768 dimensions. After the UMAP algorithm is used to reduce the dimensionality of the sentence vector, the sentence vector will become much smaller than the 768 dimension. vector.
  • step S311 to step S316 using the K-Means clustering algorithm and the number of clusters to cluster each of the sentence vectors after dimensionality reduction processing, the method from step S311 to step S316 can be used, that is to say, step S311 to The sentence vector in step S316 is replaced with the sentence vector after dimensionality reduction processing.
  • the calculation of the TF-IDF weight value and the word extraction are respectively performed from each of the target texts corresponding to the specified sentence vector clustering set to obtain the target text topic corresponding to the specified sentence vector clustering set steps, including:
  • S42 Perform word segmentation on the target document to obtain an initial word set
  • each of the target texts corresponding to the specified sentence vector clustering set is combined into one document, and then the TF-IDF algorithm is used to calculate the TF-IDF weight value of the words in the document, and finally according to the calculated
  • Each TF-IDF weight value extracts words as the target text topic corresponding to the specified sentence vector clustering set, and realizes the use of the TF-IDF algorithm to extract text topics from each target text corresponding to each clustering set with semantic effects , realizes the combination of statistical methods and methods based on semantic information, improves the generalization, and improves the accuracy of the determined text topics.
  • word segmentation is performed on the target document, and each word obtained by word segmentation is used as an initial word set.
  • deduplication is performed on the initial word set, and the initial word set after deduplication is used as the target word set, that is to say, the words in the target word set are unique.
  • the TF-IDF weight values are sorted in reverse order, and the TF-IDF weight values sorted in reverse order are used as a TF-IDF weight value set.
  • the method of obtaining from the beginning is adopted, that is, the extraction is started from the beginning of the TF-IDF weight value set, so as to realize the extraction from the highest TF-IDF weight value, and the number of extracted words is the same as the preset number of words
  • Each of the extracted TF-IDF weight values is used as a TF-IDF weight value set.
  • the present application also proposes a kind of artificial intelligence-based text topic generation device, and described device comprises:
  • a data acquisition module 100 configured to acquire a target text set
  • a sentence vector generating module 200 configured to generate a sentence vector for each of the target texts in the target text set
  • Clustering module 300 for adopting K-Means clustering algorithm and preset clustering quantity, clustering each described sentence vector, obtains a plurality of sentence vector clustering sets;
  • the target text topic generation module 400 is used to perform TF-IDF weight value calculation and word extraction from each of the target texts corresponding to the specified sentence vector clustering set to obtain the target text corresponding to the specified sentence vector clustering set.
  • sentence vector generation is performed on the target text to extract the semantic information contained in the target text, so that in the subsequent clustering, the sentence vectors with the same semantic information are clustered into the same cluster set, effectively
  • the semantic effect of clustering sets is improved;
  • the TF-IDF algorithm is used to extract text topics from each target text corresponding to each clustering set with semantic effects, and the combination of statistical methods and methods based on semantic information is realized. Improved generalization and improved accuracy on identified text topics.
  • the above-mentioned data acquisition module 100 includes: a novel introduction text acquisition submodule, a data cleaning submodule and a target text set determination submodule;
  • the novel brief introduction text acquisition submodule is used to obtain multiple novel brief introduction texts
  • the data cleaning sub-module is used to perform data cleaning on each of the novel introduction texts to obtain the target text corresponding to each of the novel introduction texts;
  • the target text set determination submodule is used to use the target texts corresponding to each of the novel introduction texts as the target text set.
  • the sentence vector generation module 200 includes: a sentence vector determination submodule
  • the sentence vector determination submodule is used to input each of the target texts in the target text set into a preset sentence vector generation model to generate the sentence vector, wherein the sentence vector generation model is based on the Bert model The model is trained.
  • the clustering module 300 includes: a cluster center setting submodule, a vector distance calculation submodule, an initial cluster set generation submodule, a vector average calculation submodule, a cluster center update submodule, a loop control Submodules and sentence vector clustering sets determine submodules;
  • the cluster center setting submodule is used to set the number of cluster centers equal to the number of clusters, and initialize each of the cluster centers;
  • the vector distance calculation submodule is used to calculate the vector distance between each of the sentence vectors and each of the cluster centers;
  • the initial clustering set generation submodule is used to assign each of the sentence vectors to the initial clustering set corresponding to the nearest cluster center according to the minimum distance principle according to the vector distance;
  • the vector average calculation submodule is used to perform vector average calculation on each of the initial cluster sets to obtain a vector average value corresponding to each of the initial cluster sets;
  • the cluster center update submodule is configured to use the specified vector average value as the cluster center of the initial cluster set corresponding to the specified vector average value, wherein the specified vector average value is any said vector average;
  • the loop control submodule is configured to repeatedly perform the step of calculating the vector distance between each of the sentence vectors and each of the cluster centers until each of the initial cluster sets corresponds to the cluster The class center no longer changes;
  • the sentence vector clustering set determining submodule is configured to use each of the initial clustering sets as one sentence vector clustering set.
  • the above-mentioned vector distance calculation submodule includes: a vector distance calculation unit;
  • the vector distance calculation unit is configured to calculate the vector distance between each of the sentence vectors and each of the cluster centers by using a cosine similarity algorithm.
  • the above-mentioned clustering module 300 further includes: a dimensionality reduction processing submodule and a clustering submodule;
  • the dimensionality reduction processing submodule is used to perform dimensionality reduction processing on each of the sentence vectors by using a preset dimensionality reduction algorithm
  • the clustering sub-module is used to cluster the sentence vectors after dimension reduction processing by using the K-Means clustering algorithm and the number of clusters to obtain multiple clustering sets of the sentence vectors.
  • the target text topic generation module 400 includes: a target document determination submodule, an initial word set determination submodule, a target word set determination submodule, a TF-IDF weight value calculation submodule, a reverse sorting submodule, a TF -IDF weight value set determines the submodule and the target text topic determines the submodule;
  • the target document determination submodule is used to combine each of the target texts corresponding to the specified sentence vector clustering set into one document to obtain the target document;
  • the initial word set determination submodule is used to perform word segmentation on the target document to obtain an initial word set
  • the target word set determination submodule is used to deduplicate words to the initial word set to obtain the target word set;
  • the TF-IDF weight value calculation submodule is used to calculate the TF-IDF weight value for each word in the target word set by using the TF-IDF algorithm and the initial word set;
  • the reverse sorting submodule is used to sort each of the TF-IDF weight values in reverse order to obtain a set of TF-IDF weight values
  • the TF-IDF weight value set determination submodule is used to obtain the TF-IDF weight value from the TF-IDF weight value set with the same number of words as the number of preset words by adopting the method of obtaining from the beginning, and obtain TF-IDF weight value set;
  • the target text topic determination submodule is configured to use each word corresponding to the TF-IDF weight value set as the target text topic corresponding to the specified sentence vector clustering set.
  • an embodiment of the present application further provides a computer device, which may be a server, and its internal structure may be as shown in FIG. 3 .
  • the computer device includes a processor, memory, network interface and database connected by a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer programs and databases.
  • the memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer device is used to store data such as the method for generating text topics based on artificial intelligence.
  • the network interface of the computer device is used to communicate with an external terminal via a network connection.
  • an artificial intelligence-based text topic generation method is realized.
  • the artificial intelligence-based text topic generation method includes: obtaining a target text set; generating sentence vectors for each of the target texts in the target text set; using the K-Means clustering algorithm and a preset number of clusters , each of the sentence vectors is clustered to obtain a plurality of sentence vector clustering sets; the calculation of the TF-IDF weight value and the word extraction are respectively performed from each of the target texts corresponding to the specified sentence vector clustering sets to obtain the same as The target text topic corresponding to the specified sentence vector clustering set, wherein the specified sentence vector clustering set is any one of the sentence vector clustering sets.
  • sentence vector generation is performed on the target text to extract the semantic information contained in the target text, so that in the subsequent clustering, the sentence vectors with the same semantic information are clustered into the same cluster set, effectively
  • the semantic effect of clustering sets is improved;
  • the TF-IDF algorithm is used to extract text topics from each target text corresponding to each clustering set with semantic effects, and the combination of statistical methods and methods based on semantic information is realized. Improved generalization and improved accuracy on identified text topics.
  • the above-mentioned step of acquiring the target text set includes: acquiring a plurality of novel introduction texts; performing data cleaning on each of the novel introduction texts to obtain the target text corresponding to each of the novel introduction texts; The target texts corresponding to each of the novel introduction texts are taken as the target text set.
  • the step of generating sentence vectors for each of the target texts in the target text set includes: inputting each of the target texts in the target text set into a preset sentence vector generation model The sentence vector generation is performed, wherein the sentence vector generation model is a model trained based on the Bert model.
  • the above-mentioned step of clustering each of the sentence vectors by using the K-Means clustering algorithm and a preset number of clusters to obtain a plurality of sentence vector clusters includes: setting the number and the number of clusters Clustering centers with the same number of clusters, and initializing each of the clustering centers; calculating the vector distance between each of the sentence vectors and each of the clustering centers; according to each of the vector distances, the Each of the sentence vectors is assigned to the initial cluster set corresponding to the nearest cluster center according to the minimum distance principle; the vector average value calculation is performed on each of the initial cluster sets to obtain each of the initial cluster sets Corresponding vector mean value; using the specified vector mean value as the cluster center of the initial cluster set corresponding to the specified vector mean value, wherein the specified vector mean value is any one of the vector mean values ; Repeat the step of calculating the vector distance between each of the sentence vectors and each of the cluster centers until the cluster centers corresponding to each of the initial cluster sets no longer change;
  • the initial clustering sets
  • the step of calculating the vector distance between each of the sentence vectors and each of the cluster centers includes: using a cosine similarity algorithm to calculate the distance between each of the sentence vectors and each of the cluster centers.
  • the vector distance between cluster centers includes: using a cosine similarity algorithm to calculate the distance between each of the sentence vectors and each of the cluster centers.
  • the above-mentioned step of clustering each of the sentence vectors by using the K-Means clustering algorithm and a preset number of clusters to obtain a plurality of sentence vector cluster sets also includes: using a preset A dimensionality reduction algorithm is used to perform dimensionality reduction processing on each of the sentence vectors; using the K-Means clustering algorithm and the number of clusters, each of the sentence vectors after the dimensionality reduction processing is clustered to obtain a plurality of the sentence vectors Sentence vector clustering set.
  • the calculation of the TF-IDF weight value and the word extraction are respectively performed from each of the target texts corresponding to the specified sentence vector clustering set to obtain the target text topic corresponding to the specified sentence vector clustering set
  • the step includes: merging each of the target texts corresponding to the specified sentence vector clustering set into one document to obtain a target document; performing word segmentation on the target document to obtain an initial word set; performing word removal on the initial word set Heavy, obtain target word set; Adopt TF-IDF algorithm and described initial word set, carry out TF-IDF weight value calculation to each word in described target word set; Each described TF-IDF weight value is carried out in reverse order, Obtain the TF-IDF weight value set; adopt the method of obtaining from the beginning, obtain the TF-IDF weight value with the same number as the preset number of words from the TF-IDF weight value set, and obtain the TF-IDF weight value set ; Each word corresponding to the TF-IDF weight value set is used as the target text topic corresponding to the
  • An embodiment of the present application also provides a computer-readable storage medium, the storage medium is a volatile storage medium or a non-volatile storage medium, on which a computer program is stored, and when the computer program is executed by a processor, a A method for generating text topics based on artificial intelligence, comprising the steps of: obtaining a target text set; generating sentence vectors for each of the target texts in the target text set; using a K-Means clustering algorithm and a preset number of clusters, Each of the sentence vectors is clustered to obtain a plurality of sentence vector cluster sets; the calculation of the TF-IDF weight value and the word extraction are respectively carried out from each of the target texts corresponding to the specified sentence vector cluster sets to obtain the same The target text topic corresponding to the specified sentence vector clustering set, wherein the specified sentence vector clustering set is any one of the sentence vector clustering sets.
  • this embodiment generates sentence vectors for the target text to extract the semantic information contained in the target text, so that in subsequent clustering, sentences with the same semantic information
  • the vectors are clustered into the same clustering set, which effectively improves the semantic effect of the clustering set;
  • the TF-IDF algorithm is used to extract the text theme from each target text corresponding to each clustering set with semantic effect, realizing the
  • the combination of statistical methods and methods based on semantic information improves generalization and improves the accuracy of identified text topics.
  • the above-mentioned step of acquiring the target text set includes: acquiring a plurality of novel introduction texts; performing data cleaning on each of the novel introduction texts to obtain the target text corresponding to each of the novel introduction texts; The target texts corresponding to each of the novel introduction texts are taken as the target text set.
  • the step of generating sentence vectors for each of the target texts in the target text set includes: inputting each of the target texts in the target text set into a preset sentence vector generation model The sentence vector generation is performed, wherein the sentence vector generation model is a model trained based on the Bert model.
  • the above-mentioned step of clustering each of the sentence vectors by using the K-Means clustering algorithm and a preset number of clusters to obtain a plurality of sentence vector clusters includes: setting the number and the number of clusters Clustering centers with the same number of clusters, and initializing each of the clustering centers; calculating the vector distance between each of the sentence vectors and each of the clustering centers; according to each of the vector distances, the Each of the sentence vectors is assigned to the initial cluster set corresponding to the nearest cluster center according to the minimum distance principle; the vector average value calculation is performed on each of the initial cluster sets to obtain each of the initial cluster sets Corresponding vector mean value; using the specified vector mean value as the cluster center of the initial cluster set corresponding to the specified vector mean value, wherein the specified vector mean value is any one of the vector mean values ; Repeat the step of calculating the vector distance between each of the sentence vectors and each of the cluster centers until the cluster centers corresponding to each of the initial cluster sets no longer change;
  • the initial clustering sets
  • the step of calculating the vector distance between each of the sentence vectors and each of the cluster centers includes: using a cosine similarity algorithm to calculate the distance between each of the sentence vectors and each of the cluster centers.
  • the vector distance between cluster centers includes: using a cosine similarity algorithm to calculate the distance between each of the sentence vectors and each of the cluster centers.
  • the above-mentioned step of clustering each of the sentence vectors by using the K-Means clustering algorithm and a preset number of clusters to obtain a plurality of sentence vector cluster sets also includes: using a preset A dimensionality reduction algorithm is used to perform dimensionality reduction processing on each of the sentence vectors; using the K-Means clustering algorithm and the number of clusters, each of the sentence vectors after the dimensionality reduction processing is clustered to obtain a plurality of the sentence vectors Sentence vector clustering set.
  • the calculation of the TF-IDF weight value and the word extraction are respectively performed from each of the target texts corresponding to the specified sentence vector clustering set to obtain the target text topic corresponding to the specified sentence vector clustering set
  • the step includes: merging each of the target texts corresponding to the specified sentence vector clustering set into one document to obtain a target document; performing word segmentation on the target document to obtain an initial word set; performing word removal on the initial word set Heavy, obtain target word set; Adopt TF-IDF algorithm and described initial word set, carry out TF-IDF weight value calculation to each word in described target word set; Each described TF-IDF weight value is carried out in reverse order, Obtain the TF-IDF weight value set; adopt the method of obtaining from the beginning, obtain the TF-IDF weight value with the same number as the preset number of words from the TF-IDF weight value set, and obtain the TF-IDF weight value set ; Each word corresponding to the TF-IDF weight value set is used as the target text topic corresponding to the
  • Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM random access memory
  • RAM is available in many forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请涉及人工智能技术领域,揭示了一种基于人工智能的文本主题生成方法、装置、设备及介质,其中方法包括:获取目标文本集;对所述目标文本集中的每个所述目标文本进行句子向量生成;采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集;从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题,其中,所述指定句子向量聚类集是任一个所述句子向量聚类集。从而实现了将统计的方法和基于语义信息的方法相结合,提高了泛化性,提高了确定的文本主题的准确性。

Description

基于人工智能的文本主题生成方法、装置、设备及介质
本申请要求于2022年01月12日提交中国专利局、申请号为202210033713.2,发明名称为“基于人工智能的文本主题生成方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及到人工智能技术领域,特别是涉及到一种基于人工智能的文本主题生成方法、装置、设备及介质。
背景技术
在日常的自然语言处理任务中,文本主题模型一直以来都是工业中使用最为广泛的模型之一,通过文本主题模型可以将海量文档归类,从而便于日常筛选、管理以及运用。
当前工业界使用最多的文本主题模型依然是LDA(Latent Dirichlet Allocation)模型和PLSA(Probabilistic Latent Semantic Analysis)模型,这两者皆以统计词频为基础。发明人发现,虽然LDA模型和PLSA模型在工业界的应用都较为广泛,但是都需要预先设置主题的数量,仅仅依赖统计的方法,捕获不到文本中所蕴含的语义信息,导致文本主题的准确性较低。
技术问题
本申请的主要目的为提供一种基于人工智能的文本主题生成方法、装置、设备及介质,旨在解决现有技术的LDA模型和PLSA模型,仅仅依赖统计的方法,捕获不到文本中所蕴含的语义信息,导致文本主题的准确性较低的技术问题。
技术解决方案
为了实现上述发明目的,本申请提出一种基于人工智能的文本主题生成方法,所述方法包括:
获取目标文本集;
对所述目标文本集中的每个所述目标文本进行句子向量生成;
采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集;
从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题,其中,所述指定句子向量聚类集是任一个所述句子向量聚类集。
本申请还提出了一种基于人工智能的文本主题生成装置,所述装置包括:
数据获取模块,用于获取目标文本集;
句子向量生成模块,用于对所述目标文本集中的每个所述目标文本进行句子向量生成;
聚类模块,用于采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集;
目标文本主题生成模块,用于从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题,其中,所述指定句子向量聚类集是任一个所述句子向量聚类集。
本申请还提出了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现一种基于人工智能的文本主题生成方法,其中,所述基于人工智能的文本主题生成方法包括:获取目标文本集;对所述目标文本集中的每个所述目标文本进行句子向量生成;采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集;从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题,其中,所述指定句子向量聚类集是任一个所述句子向量聚类集。本申请还提出了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现一种基于人工智能的文本主题生成方法,其中,所述基于人工智能的文本主题生成方法包括:获取目标文本集;对所述目标文本集中的每个所述目标文本进行句子向量生成;采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集;从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题,其中,所述指定句子向量聚类集是任一个所述句子向量聚类集。
有益效果
本申请的基于人工智能的文本主题生成方法、装置、设备及介质,其中方法通过获取目标文本集;对所述目标文本集中的每个所述目标文本进行句子向量生成;采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集;从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题,其中,所述指定句子向量聚类集是任一个所述句子向量聚类集。通过对目标文本进行句子向量生成,以提取到目标文本中蕴含的语义信息,从而在后续的聚类中,将具有相同语义信息的句子向量聚类到同一个聚类集中,有效的提高了聚类集的语义效果;采用TF-IDF算法从具有语义效果的每个聚类集对应的各个目标文本中提取出文本主题,实现了将统计的方法和基于语义信息的方法相结合,提高了泛化性,提高了确定的文本主题的准确性。
附图说明
图1为本申请一实施例的基于人工智能的文本主题生成方法的流程示意图;
图2 为本申请一实施例的基于人工智能的文本主题生成装置的结构示意框图;
图3 为本申请一实施例的计算机设备的结构示意框图。
本发明的最佳实施方式
参照图1,本申请实施例中提供一种基于人工智能的文本主题生成方法,涉及人工智能技术领域,所述方法包括:
S1:获取目标文本集;
S2:对所述目标文本集中的每个所述目标文本进行句子向量生成;
S3:采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集;
S4:从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题,其中,所述指定句子向量聚类集是任一个所述句子向量聚类集。
本实施例通过对目标文本进行句子向量生成,以提取到目标文本中蕴含的语义信息,从而在后续的聚类中,将具有相同语义信息的句子向量聚类到同一个聚类集中,有效的提高了聚类集的语义效果;采用TF-IDF算法从具有语义效果的每个聚类集对应的各个目标文本中提取出文本主题,实现了将统计的方法和基于语义信息的方法相结合,提高了泛化性,提高了确定的文本主题的准确性。
对于S1,可以获取用户输入的目标文本集,也可以从数据库中获取目标文本集,还可以从第三方应用系统中获取目标文本集。
目标文本集中包括一个或多个目标文本。目标文本,是包含一句或多句话的文本。
对于S2,将所述目标文本集中的每个所述目标文本进行句子向量生成,从而使句子向量提取到了目标文本中蕴含的语义信息。
对于S3,采用K-Means聚类算法,将各个所述句子向量聚类到数量与聚类数量相同的聚类集,将聚类得到的每个聚类集作为一个句子向量聚类集。因句子向量具有语义信息,从而将具有相同语义信息的句子向量聚类到同一个聚类集中,有效的提高了聚类集的语义效果。
所述聚类数量是大于1的整数。
K-Means聚类算法,也就是K均值聚类算法。
对于S4,采用TF-IDF算法,对指定句子向量聚类集对应的各个所述目标文本进行TF-IDF权重值计算,根据各个TF-IDF权重值提取出一个或多个TF-IDF权重值,将提取得到个各个TF-IDF权重值对应的各个词语作为与所述指定句子向量聚类集对应的目标文本主题。实现了采用TF-IDF算法从具有语义效果的每个聚类集对应的各个目标文本中提取出文本主题,实现了将统计的方法和基于语义信息的方法相结合,提高了泛化性,提高了确定的文本主题的准确性。
TF-IDF(term frequency–inverse document frequency),是一种用于信息检索与数据挖掘的常用加权技术,常用于挖掘文章中的关键词。
可以理解的是,与所述指定句子向量聚类集对应的目标文本主题,是所述指定句子向量聚类集对应的各个所述目标文本的文本主题。
在一个实施例中,上述获取目标文本集的步骤,包括:
S11:获取多个小说简介文本;
S12:对每个所述小说简介文本进行数据清洗,得到每个所述小说简介文本对应的所述目标文本;
S13:将各个所述小说简介文本各自对应的所述目标文本作为所述目标文本集。
本实施例实现了对小说简介文本进行数据清洗后作为目标文本,从而使本申请确定的目标文本主题可以用于小说分类、小说推荐;通过数据清洗减少了噪音干扰,提高了确定的目标文本主题的准确性。
对于S11,可以获取用户输入的多个小说简介文本,也可以从数据库中获取多个小说简介文本,还可以从第三方应用系统中获取多个小说简介文本。
小说简介文本,是一篇小说的简介文本。
可选的,小说简介文本是大于预设字数的文本。
可选的,所述预设字数设为1024。
对于S12,小说简介文本中有大量的无用字符,比如,书名号、作为修饰的重复标点符号、空白符号、链接符号,这些无用字符会影响生成的句子向量蕴含的语义信息的准确性,因此需要对每个所述小说简介文本进行数据清洗,将数据清洗后的小说简介文本作为目标文本。
其中,采用预设的正则表达式,对每个所述小说简介文本进行无用字符删除处理,将无用字符删除处理后的每个所述小说简介文本作为一个所述目标文本,从而得到了没有噪音的文本。
对于S13,将各个所述小说简介文本对应的各个所述目标文本作为所述目标文本集,从而实现将没有噪音的各个目标文本作为所述目标文本集,基于没有噪声的目标文本集提取文本主题,提高了确定的文本主题的准确性。
在一个实施例中,上述对所述目标文本集中的每个所述目标文本进行句子向量生成的步骤,包括:
S21:将所述目标文本集中的每个所述目标文本输入预设的句子向量生成模型进行所述句子向量生成,其中,所述句子向量生成模型是基于Bert模型训练得到模型。
本实施例实现了采用基于Bert模型训练得到模型进行句子向量的生成,有利于提高提取目标文本中蕴含的语义信息,进一步提高了确定的文本主题的准确性。
对于S21,将所述目标文本集中的每个所述目标文本输入预设的句子向量生成模型,获取句子向量生成模型的编码层输出的所述句子向量。
可选的,从模型库中获取与所述目标文本集对应的文本类型的句子向量生成模型,采用获取的句子向量生成模型对所述目标文本集中的每个所述目标文本进行句子向量生成。通过采用相同文本类型的句子向量生成模型对目标文本进行句子向量生成,进一步提高了提取目标文本中蕴含的语义信息。
可选的,Bert(Bidirectional Encoder Representations from Transformers)模型采用Bert Base模型。
在一个实施例中,上述采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集的步骤,包括:
S311:设置数量与所述聚类数量相同的聚类中心,并对每个所述聚类中心进行初始化;
S312:计算每个所述句子向量与每个所述聚类中心之间的向量距离;
S313:根据各个所述向量距离,将各个所述句子向量按照最小距离原则分配到最邻近的所述聚类中心对应的初始聚类集;
S314:对每个所述初始聚类集进行向量平均值计算,得到每个所述初始聚类集对应的向量平均值;
S315:将指定向量平均值作为与所述指定向量平均值对应的所述初始聚类集的所述聚类中心,其中,所述指定向量平均值是任一个所述向量平均值;
S316:重复执行所述计算每个所述句子向量与每个所述聚类中心之间的向量距离的步骤,直至每个所述初始聚类集对应的所述聚类中心不再变化;
S317:将每个所述初始聚类集作为一个所述句子向量聚类集。
本实施例采用K-Means聚类算法和预设的聚类数量,对具有语义效果的各个所述句子向量进行聚类,从而使聚类得到的句子向量聚类集中的各个句子向量具有相同语义信息。
对于S311,设置数量与所述聚类数量相同的聚类中心,也就是聚类中心的数量与聚类数量相同。
对每个所述聚类中心进行初始化的方法在此不做赘述。
对于S312,计算每个所述句子向量与每个所述聚类中心之间的向量距离,也就是说,所述句子向量的数量与所述聚类中心的数量的乘积等于向量距离的数量。
对于S313,将任一个所述句子向量作为待处理句子向量;将所述待处理句子向量对应的各个所述向量距离中找出值为最小的所述向量距离作为目标向量距离;将所述待处理句子向量分配到与所述目标向量距离对应的所述聚类中心对应的初始聚类集。
对于S314,对每个所述初始聚类集中的各个所述句子向量进行向量平均值计算。
对于S315,将指定向量平均值作为与所述指定向量平均值对应的所述初始聚类集的所述聚类中心,从而实现了对聚类中心的更新。
对于S316,重复执行所述计算每个所述句子向量与每个所述聚类中心之间的向量距离的步骤,也就是重复执行步骤S312至步骤S316,直至每个所述初始聚类集对应的所述聚类中心不再变化。当每个所述初始聚类集对应的所述聚类中心不再变化时,意味着已经实现了最优的聚类。
对于S317,将每个所述初始聚类集作为一个所述句子向量聚类集,从而得到了具有相同语义信息的句子向量聚类集。
在一个实施例中,上述计算每个所述句子向量与每个所述聚类中心之间的向量距离的步骤,包括:
S3121:采用余弦相似度算法,计算每个所述句子向量与每个所述聚类中心之间的所述向量距离。
本实施例采用余弦相似度算法作为聚类算法的向量度量指标,从而较好的度量了句子向量之间的距离,提高了聚类的准确性。
对于S3121,采用余弦相似度算法,计算每个所述句子向量与每个所述聚类中心之间的余弦相似度,将计算得到的余弦相似度作为向量距离。
在一个实施例中,上述采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集的步骤,还包括:
S321:采用预设的降维算法,对每个所述句子向量进行降维处理;
S322:采用K-Means聚类算法和所述聚类数量,对降维处理后的各个所述句子向量进行聚类,得到多个所述句子向量聚类集。
因句子向量是高纬度的向量,高纬度的向量通常都比较稀疏,导致聚类效果较差,为了解决该问题,本实施例先对每个所述句子向量进行降维处理,然后再对降维处理后的各个所述句子向量进行聚类,从而实现通过降维降低句子向量的稀疏以提高聚类效果,进一步提高了确定的文本主题的准确性。
对于S321,采用预设的降维算法,对每个所述句子向量进行降维处理,以降低句子向量的稀疏性。
可选的,采用UMAP算法(降维流形学习算法),对每个所述句子向量进行降维处理。
比如,所述句子向量是采用基于Bert模型训练得到模型生成是,所述句子向量具有768维度,通过UMAP算法对所述句子向量降维处理之后,所述句子向量将变成远小于768维度的向量。
对于S322,采用K-Means聚类算法和所述聚类数量,对降维处理后的各个所述句子向量进行聚类,可以采用步骤S311至步骤S316的方法,也就是说,将步骤S311至步骤S316中的所述句子向量替换为降维处理后的所述句子向量。
在一个实施例中,上述从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题的步骤,包括:
S41:将所述指定句子向量聚类集对应的各个所述目标文本合并成一个文档,得到目标文档;
S42:对所述目标文档进行分词,得到初始词语集;
S43:对初始词语集进行词语去重,得到目标词语集;
S44:采用TF-IDF算法和所述初始词语集,对所述目标词语集中的每个词语进行TF-IDF权重值计算;
S45:将各个所述TF-IDF权重值进行倒序排序,得到TF-IDF权重值集;
S46:采用从开头开始获取的方法,从所述TF-IDF权重值集中获取数量与预设的词语数量相同的所述TF-IDF权重值,得到TF-IDF权重值集;
S47:将所述TF-IDF权重值集对应的各个词语,作为与所述指定句子向量聚类集对应的所述目标文本主题。
本实施例将所述指定句子向量聚类集对应的各个所述目标文本合并成一个文档,然后采用TF-IDF算法,对该文档中的词语进行TF-IDF权重值计算,最后根据计算得到的各个TF-IDF权重值提取词语作为指定句子向量聚类集对应的所述目标文本主题,实现了采用TF-IDF算法从具有语义效果的每个聚类集对应的各个目标文本中提取出文本主题,实现了将统计的方法和基于语义信息的方法相结合,提高了泛化性,提高了确定的文本主题的准确性。
对于S41,将所述指定句子向量聚类集对应的各个所述目标文本合并成一个文档,将合并得到的文档作为目标文档。
对于S42,对所述目标文档进行分词,将分词得到的各个词语作为初始词语集。
对于S43,对初始词语集进行词语去重,词语去重后的初始词语集作为目标词语集,也就是说,目标词语集中的词语具有唯一性。
对于S44,采用TF-IDF算法和所述初始词语集,对所述目标词语集中的每个词语进行TF-IDF权重值计算的方法在此不做作赘述。
对于S45,将各个所述TF-IDF权重值进行倒序排序,将倒序排序的各个所述TF-IDF权重值作为TF-IDF权重值集。
对于S46,采用从开头开始获取的方法,也就是从所述TF-IDF权重值集的开头开始提取,以实现从最高的TF-IDF权重值开始提取,提取出数量与预设的词语数量相同的所述TF-IDF权重值,将提取的各个所述TF-IDF权重值作为TF-IDF权重值集。
对于S47,将所述TF-IDF权重值集中的各个所述TF-IDF权重值各自对应的词语作为与所述指定句子向量聚类集对应的所述目标文本主题,实现了采用TF-IDF算法从具有语义效果的每个聚类集对应的各个目标文本中提取出文本主题。
参照图2,本申请还提出了一种基于人工智能的文本主题生成装置,所述装置包括:
数据获取模块100,用于获取目标文本集;
句子向量生成模块200,用于对所述目标文本集中的每个所述目标文本进行句子向量生成;
聚类模块300,用于采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集;
目标文本主题生成模块400,用于从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题,其中,所述指定句子向量聚类集是任一个所述句子向量聚类集。
本实施例通过对目标文本进行句子向量生成,以提取到目标文本中蕴含的语义信息,从而在后续的聚类中,将具有相同语义信息的句子向量聚类到同一个聚类集中,有效的提高了聚类集的语义效果;采用TF-IDF算法从具有语义效果的每个聚类集对应的各个目标文本中提取出文本主题,实现了将统计的方法和基于语义信息的方法相结合,提高了泛化性,提高了确定的文本主题的准确性。
在一个实施例中,上述数据获取模块100包括:小说简介文本获取子模块、数据清洗子模块和目标文本集确定子模块;
所述小说简介文本获取子模块,用于获取多个小说简介文本;
所述数据清洗子模块,用于对每个所述小说简介文本进行数据清洗,得到每个所述小说简介文本对应的所述目标文本;
所述目标文本集确定子模块,用于将各个所述小说简介文本各自对应的所述目标文本作为所述目标文本集。
在一个实施例中,上述句子向量生成模块200包括:句子向量确定子模块;
所述句子向量确定子模块,用于将所述目标文本集中的每个所述目标文本输入预设的句子向量生成模型进行所述句子向量生成,其中,所述句子向量生成模型是基于Bert模型训练得到模型。
在一个实施例中,上述聚类模块300包括:聚类中心设置子模块、向量距离计算子模块、初始聚类集生成子模块、向量平均值计算子模块、聚类中心更新子模块、循环控制子模块和句子向量聚类集确定子模块;
所述聚类中心设置子模块,用于设置数量与所述聚类数量相同的聚类中心,并对每个所述聚类中心进行初始化;
所述向量距离计算子模块,用于计算每个所述句子向量与每个所述聚类中心之间的向量距离;
所述初始聚类集生成子模块,用于根据各个所述向量距离,将各个所述句子向量按照最小距离原则分配到最邻近的所述聚类中心对应的初始聚类集;
所述向量平均值计算子模块,用于对每个所述初始聚类集进行向量平均值计算,得到每个所述初始聚类集对应的向量平均值;
所述聚类中心更新子模块,用于将指定向量平均值作为与所述指定向量平均值对应的所述初始聚类集的所述聚类中心,其中,所述指定向量平均值是任一个所述向量平均值;
所述循环控制子模块,用于重复执行所述计算每个所述句子向量与每个所述聚类中心之间的向量距离的步骤,直至每个所述初始聚类集对应的所述聚类中心不再变化;
所述句子向量聚类集确定子模块,用于将每个所述初始聚类集作为一个所述句子向量聚类集。
在一个实施例中,上述向量距离计算子模块包括:向量距离计算单元;
所述向量距离计算单元,用于采用余弦相似度算法,计算每个所述句子向量与每个所述聚类中心之间的所述向量距离。
在一个实施例中,上述聚类模块300还包括:降维处理子模块和聚类子模块;
所述降维处理子模块,用于采用预设的降维算法,对每个所述句子向量进行降维处理;
所述聚类子模块,用于采用K-Means聚类算法和所述聚类数量,对降维处理后的各个所述句子向量进行聚类,得到多个所述句子向量聚类集。
在一个实施例中,上述目标文本主题生成模块400包括:目标文档确定子模块、初始词语集确定子模块、目标词语集确定子模块、TF-IDF权重值计算子模块、倒序排序子模块、TF-IDF权重值集确定子模块和目标文本主题确定子模块;
所述目标文档确定子模块,用于将所述指定句子向量聚类集对应的各个所述目标文本合并成一个文档,得到目标文档;
所述初始词语集确定子模块,用于对所述目标文档进行分词,得到初始词语集;
所述目标词语集确定子模块,用于对初始词语集进行词语去重,得到目标词语集;
所述TF-IDF权重值计算子模块,用于采用TF-IDF算法和所述初始词语集,对所述目标词语集中的每个词语进行TF-IDF权重值计算;
所述倒序排序子模块,用于将各个所述TF-IDF权重值进行倒序排序,得到TF-IDF权重值集;
所述TF-IDF权重值集确定子模块,用于采用从开头开始获取的方法,从所述TF-IDF权重值集中获取数量与预设的词语数量相同的所述TF-IDF权重值,得到TF-IDF权重值集;
所述目标文本主题确定子模块,用于将所述TF-IDF权重值集对应的各个词语,作为与所述指定句子向量聚类集对应的所述目标文本主题。
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于储存基于人工智能的文本主题生成方法等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种基于人工智能的文本主题生成方法。所述基于人工智能的文本主题生成方法,包括:获取目标文本集;对所述目标文本集中的每个所述目标文本进行句子向量生成;采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集;从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题,其中,所述指定句子向量聚类集是任一个所述句子向量聚类集。
本实施例通过对目标文本进行句子向量生成,以提取到目标文本中蕴含的语义信息,从而在后续的聚类中,将具有相同语义信息的句子向量聚类到同一个聚类集中,有效的提高了聚类集的语义效果;采用TF-IDF算法从具有语义效果的每个聚类集对应的各个目标文本中提取出文本主题,实现了将统计的方法和基于语义信息的方法相结合,提高了泛化性,提高了确定的文本主题的准确性。
在一个实施例中,上述获取目标文本集的步骤,包括:获取多个小说简介文本;对每个所述小说简介文本进行数据清洗,得到每个所述小说简介文本对应的所述目标文本;将各个所述小说简介文本各自对应的所述目标文本作为所述目标文本集。
在一个实施例中,上述对所述目标文本集中的每个所述目标文本进行句子向量生成的步骤,包括:将所述目标文本集中的每个所述目标文本输入预设的句子向量生成模型进行所述句子向量生成,其中,所述句子向量生成模型是基于Bert模型训练得到模型。
在一个实施例中,上述采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集的步骤,包括:设置数量与所述聚类数量相同的聚类中心,并对每个所述聚类中心进行初始化;计算每个所述句子向量与每个所述聚类中心之间的向量距离;根据各个所述向量距离,将各个所述句子向量按照最小距离原则分配到最邻近的所述聚类中心对应的初始聚类集;对每个所述初始聚类集进行向量平均值计算,得到每个所述初始聚类集对应的向量平均值;将指定向量平均值作为与所述指定向量平均值对应的所述初始聚类集的所述聚类中心,其中,所述指定向量平均值是任一个所述向量平均值;重复执行所述计算每个所述句子向量与每个所述聚类中心之间的向量距离的步骤,直至每个所述初始聚类集对应的所述聚类中心不再变化;将每个所述初始聚类集作为一个所述句子向量聚类集。
在一个实施例中,上述计算每个所述句子向量与每个所述聚类中心之间的向量距离的步骤,包括:采用余弦相似度算法,计算每个所述句子向量与每个所述聚类中心之间的所述向量距离。
在一个实施例中,上述采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集的步骤,还包括:采用预设的降维算法,对每个所述句子向量进行降维处理;采用K-Means聚类算法和所述聚类数量,对降维处理后的各个所述句子向量进行聚类,得到多个所述句子向量聚类集。
在一个实施例中,上述从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题的步骤,包括:将所述指定句子向量聚类集对应的各个所述目标文本合并成一个文档,得到目标文档;对所述目标文档进行分词,得到初始词语集;对初始词语集进行词语去重,得到目标词语集;采用TF-IDF算法和所述初始词语集,对所述目标词语集中的每个词语进行TF-IDF权重值计算;将各个所述TF-IDF权重值进行倒序排序,得到TF-IDF权重值集;采用从开头开始获取的方法,从所述TF-IDF权重值集中获取数量与预设的词语数量相同的所述TF-IDF权重值,得到TF-IDF权重值集;将所述TF-IDF权重值集对应的各个词语,作为与所述指定句子向量聚类集对应的所述目标文本主题。
本申请一实施例还提供一种计算机可读存储介质,所述存储介质为易失性存储介质或非易失性存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现一种基于人工智能的文本主题生成方法,包括步骤:获取目标文本集;对所述目标文本集中的每个所述目标文本进行句子向量生成;采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集;从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题,其中,所述指定句子向量聚类集是任一个所述句子向量聚类集。
上述执行的基于人工智能的文本主题生成方法,本实施例通过对目标文本进行句子向量生成,以提取到目标文本中蕴含的语义信息,从而在后续的聚类中,将具有相同语义信息的句子向量聚类到同一个聚类集中,有效的提高了聚类集的语义效果;采用TF-IDF算法从具有语义效果的每个聚类集对应的各个目标文本中提取出文本主题,实现了将统计的方法和基于语义信息的方法相结合,提高了泛化性,提高了确定的文本主题的准确性。
在一个实施例中,上述获取目标文本集的步骤,包括:获取多个小说简介文本;对每个所述小说简介文本进行数据清洗,得到每个所述小说简介文本对应的所述目标文本;将各个所述小说简介文本各自对应的所述目标文本作为所述目标文本集。
在一个实施例中,上述对所述目标文本集中的每个所述目标文本进行句子向量生成的步骤,包括:将所述目标文本集中的每个所述目标文本输入预设的句子向量生成模型进行所述句子向量生成,其中,所述句子向量生成模型是基于Bert模型训练得到模型。
在一个实施例中,上述采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集的步骤,包括:设置数量与所述聚类数量相同的聚类中心,并对每个所述聚类中心进行初始化;计算每个所述句子向量与每个所述聚类中心之间的向量距离;根据各个所述向量距离,将各个所述句子向量按照最小距离原则分配到最邻近的所述聚类中心对应的初始聚类集;对每个所述初始聚类集进行向量平均值计算,得到每个所述初始聚类集对应的向量平均值;将指定向量平均值作为与所述指定向量平均值对应的所述初始聚类集的所述聚类中心,其中,所述指定向量平均值是任一个所述向量平均值;重复执行所述计算每个所述句子向量与每个所述聚类中心之间的向量距离的步骤,直至每个所述初始聚类集对应的所述聚类中心不再变化;将每个所述初始聚类集作为一个所述句子向量聚类集。
在一个实施例中,上述计算每个所述句子向量与每个所述聚类中心之间的向量距离的步骤,包括:采用余弦相似度算法,计算每个所述句子向量与每个所述聚类中心之间的所述向量距离。
在一个实施例中,上述采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集的步骤,还包括:采用预设的降维算法,对每个所述句子向量进行降维处理;采用K-Means聚类算法和所述聚类数量,对降维处理后的各个所述句子向量进行聚类,得到多个所述句子向量聚类集。
在一个实施例中,上述从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题的步骤,包括:将所述指定句子向量聚类集对应的各个所述目标文本合并成一个文档,得到目标文档;对所述目标文档进行分词,得到初始词语集;对初始词语集进行词语去重,得到目标词语集;采用TF-IDF算法和所述初始词语集,对所述目标词语集中的每个词语进行TF-IDF权重值计算;将各个所述TF-IDF权重值进行倒序排序,得到TF-IDF权重值集;采用从开头开始获取的方法,从所述TF-IDF权重值集中获取数量与预设的词语数量相同的所述TF-IDF权重值,得到TF-IDF权重值集;将所述TF-IDF权重值集对应的各个词语,作为与所述指定句子向量聚类集对应的所述目标文本主题。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的和实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双速据率SDRAM(SSRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。

Claims (19)

  1. 一种基于人工智能的文本主题生成方法,其中,所述方法包括:
    获取目标文本集;
    对所述目标文本集中的每个所述目标文本进行句子向量生成;
    采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集;
    从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题,其中,所述指定句子向量聚类集是任一个所述句子向量聚类集。
  2. 根据权利要求1所述的基于人工智能的文本主题生成方法,其中,所述获取目标文本集的步骤,包括:
    获取多个小说简介文本;
    对每个所述小说简介文本进行数据清洗,得到每个所述小说简介文本对应的所述目标文本;
    将各个所述小说简介文本各自对应的所述目标文本作为所述目标文本集。
  3. 根据权利要求1所述的基于人工智能的文本主题生成方法,其中,所述对所述目标文本集中的每个所述目标文本进行句子向量生成的步骤,包括:
    将所述目标文本集中的每个所述目标文本输入预设的句子向量生成模型进行所述句子向量生成,其中,所述句子向量生成模型是基于Bert模型训练得到模型。
  4. 根据权利要求1所述的基于人工智能的文本主题生成方法,其中,所述采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集的步骤,包括:
    设置数量与所述聚类数量相同的聚类中心,并对每个所述聚类中心进行初始化;
    计算每个所述句子向量与每个所述聚类中心之间的向量距离;
    根据各个所述向量距离,将各个所述句子向量按照最小距离原则分配到最邻近的所述聚类中心对应的初始聚类集;
    对每个所述初始聚类集进行向量平均值计算,得到每个所述初始聚类集对应的向量平均值;
    将指定向量平均值作为与所述指定向量平均值对应的所述初始聚类集的所述聚类中心,其中,所述指定向量平均值是任一个所述向量平均值;
    重复执行所述计算每个所述句子向量与每个所述聚类中心之间的向量距离的步骤,直至每个所述初始聚类集对应的所述聚类中心不再变化;
    将每个所述初始聚类集作为一个所述句子向量聚类集。
  5. 根据权利要求4所述的基于人工智能的文本主题生成方法,其中,所述计算每个所述句子向量与每个所述聚类中心之间的向量距离的步骤,包括:
    采用余弦相似度算法,计算每个所述句子向量与每个所述聚类中心之间的所述向量距离。
  6. 根据权利要求1所述的基于人工智能的文本主题生成方法,其中,所述采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集的步骤,还包括:
    采用预设的降维算法,对每个所述句子向量进行降维处理;
    采用K-Means聚类算法和所述聚类数量,对降维处理后的各个所述句子向量进行聚类,得到多个所述句子向量聚类集。
  7. 根据权利要求1所述的基于人工智能的文本主题生成方法,其中,所述从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题的步骤,包括:
    将所述指定句子向量聚类集对应的各个所述目标文本合并成一个文档,得到目标文档;
    对所述目标文档进行分词,得到初始词语集;
    对初始词语集进行词语去重,得到目标词语集;
    采用TF-IDF算法和所述初始词语集,对所述目标词语集中的每个词语进行TF-IDF权重值计算;
    将各个所述TF-IDF权重值进行倒序排序,得到TF-IDF权重值集;
    采用从开头开始获取的方法,从所述TF-IDF权重值集中获取数量与预设的词语数量相同的所述TF-IDF权重值,得到TF-IDF权重值集;
    将所述TF-IDF权重值集对应的各个词语,作为与所述指定句子向量聚类集对应的所述目标文本主题。
  8. 一种基于人工智能的文本主题生成装置,其中,所述装置包括:
    数据获取模块,用于获取目标文本集;
    句子向量生成模块,用于对所述目标文本集中的每个所述目标文本进行句子向量生成;
    聚类模块,用于采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集;
    目标文本主题生成模块,用于从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题,其中,所述指定句子向量聚类集是任一个所述句子向量聚类集。
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时实现一种基于人工智能的文本主题生成方法;其中,所述基于人工智能的文本主题生成方法包括:
    获取目标文本集;
    对所述目标文本集中的每个所述目标文本进行句子向量生成;
    采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集;
    从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题,其中,所述指定句子向量聚类集是任一个所述句子向量聚类集。
  10. 根据权利要求9所述的计算机设备,其中,所述获取目标文本集的步骤,包括:
    获取多个小说简介文本;
    对每个所述小说简介文本进行数据清洗,得到每个所述小说简介文本对应的所述目标文本;
    将各个所述小说简介文本各自对应的所述目标文本作为所述目标文本集。
  11. 根据权利要求9所述的计算机设备,其中,所述对所述目标文本集中的每个所述目标文本进行句子向量生成的步骤,包括:
    将所述目标文本集中的每个所述目标文本输入预设的句子向量生成模型进行所述句子向量生成,其中,所述句子向量生成模型是基于Bert模型训练得到模型。
  12. 根据权利要求9所述的计算机设备,其中,所述采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集的步骤,包括:
    设置数量与所述聚类数量相同的聚类中心,并对每个所述聚类中心进行初始化;
    计算每个所述句子向量与每个所述聚类中心之间的向量距离;
    根据各个所述向量距离,将各个所述句子向量按照最小距离原则分配到最邻近的所述聚类中心对应的初始聚类集;
    对每个所述初始聚类集进行向量平均值计算,得到每个所述初始聚类集对应的向量平均值;
    将指定向量平均值作为与所述指定向量平均值对应的所述初始聚类集的所述聚类中心,其中,所述指定向量平均值是任一个所述向量平均值;
    重复执行所述计算每个所述句子向量与每个所述聚类中心之间的向量距离的步骤,直至每个所述初始聚类集对应的所述聚类中心不再变化;
    将每个所述初始聚类集作为一个所述句子向量聚类集。
  13. 根据权利要求12所述的计算机设备,其中,所述计算每个所述句子向量与每个所述聚类中心之间的向量距离的步骤,包括:
    采用余弦相似度算法,计算每个所述句子向量与每个所述聚类中心之间的所述向量距离。
  14. 根据权利要求9所述的计算机设备,其中,所述采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集的步骤,还包括:
    采用预设的降维算法,对每个所述句子向量进行降维处理;
    采用K-Means聚类算法和所述聚类数量,对降维处理后的各个所述句子向量进行聚类,得到多个所述句子向量聚类集。
  15. 根据权利要求9所述的计算机设备,其中,所述从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题的步骤,包括:
    将所述指定句子向量聚类集对应的各个所述目标文本合并成一个文档,得到目标文档;
    对所述目标文档进行分词,得到初始词语集;
    对初始词语集进行词语去重,得到目标词语集;
    采用TF-IDF算法和所述初始词语集,对所述目标词语集中的每个词语进行TF-IDF权重值计算;
    将各个所述TF-IDF权重值进行倒序排序,得到TF-IDF权重值集;
    采用从开头开始获取的方法,从所述TF-IDF权重值集中获取数量与预设的词语数量相同的所述TF-IDF权重值,得到TF-IDF权重值集;
    将所述TF-IDF权重值集对应的各个词语,作为与所述指定句子向量聚类集对应的所述目标文本主题。
    16一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现一种基于人工智能的文本主题生成方法,其中,所述基于人工智能的文本主题生成方法包括:
    获取目标文本集;
    对所述目标文本集中的每个所述目标文本进行句子向量生成;
    采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集;
    从指定句子向量聚类集对应的各个所述目标文本中分别进行TF-IDF权重值的计算及词语提取,得到与所述指定句子向量聚类集对应的目标文本主题,其中,所述指定句子向量聚类集是任一个所述句子向量聚类集。
  16. 根据权利要求16所述的计算机可读存储介质,其中,所述获取目标文本集的步骤,包括:
    获取多个小说简介文本;
    对每个所述小说简介文本进行数据清洗,得到每个所述小说简介文本对应的所述目标文本;
    将各个所述小说简介文本各自对应的所述目标文本作为所述目标文本集。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述对所述目标文本集中的每个所述目标文本进行句子向量生成的步骤,包括:
    将所述目标文本集中的每个所述目标文本输入预设的句子向量生成模型进行所述句子向量生成,其中,所述句子向量生成模型是基于Bert模型训练得到模型。
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述采用K-Means聚类算法和预设的聚类数量,对各个所述句子向量进行聚类,得到多个句子向量聚类集的步骤,包括:
    设置数量与所述聚类数量相同的聚类中心,并对每个所述聚类中心进行初始化;
    计算每个所述句子向量与每个所述聚类中心之间的向量距离;
    根据各个所述向量距离,将各个所述句子向量按照最小距离原则分配到最邻近的所述聚类中心对应的初始聚类集;
    对每个所述初始聚类集进行向量平均值计算,得到每个所述初始聚类集对应的向量平均值;
    将指定向量平均值作为与所述指定向量平均值对应的所述初始聚类集的所述聚类中心,其中,所述指定向量平均值是任一个所述向量平均值;
    重复执行所述计算每个所述句子向量与每个所述聚类中心之间的向量距离的步骤,直至每个所述初始聚类集对应的所述聚类中心不再变化;
    将每个所述初始聚类集作为一个所述句子向量聚类集。
  19. 根据权利要求19所述的计算机可读存储介质,其中,所述计算每个所述句子向量与每个所述聚类中心之间的向量距离的步骤,包括:
    采用余弦相似度算法,计算每个所述句子向量与每个所述聚类中心之间的所述向量距离。
PCT/CN2022/090163 2022-01-12 2022-04-29 基于人工智能的文本主题生成方法、装置、设备及介质 WO2023134075A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210033713.2A CN114510923B (zh) 2022-01-12 2022-01-12 基于人工智能的文本主题生成方法、装置、设备及介质
CN202210033713.2 2022-01-12

Publications (1)

Publication Number Publication Date
WO2023134075A1 true WO2023134075A1 (zh) 2023-07-20

Family

ID=81550709

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090163 WO2023134075A1 (zh) 2022-01-12 2022-04-29 基于人工智能的文本主题生成方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN114510923B (zh)
WO (1) WO2023134075A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117391071A (zh) * 2023-12-04 2024-01-12 中电科大数据研究院有限公司 一种新闻话题数据挖掘方法、装置及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115658879A (zh) * 2022-12-29 2023-01-31 北京天际友盟信息技术有限公司 自动化威胁情报文本聚类方法和系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130041652A1 (en) * 2006-10-10 2013-02-14 Abbyy Infopoisk Llc Cross-language text clustering
CN110413986A (zh) * 2019-04-12 2019-11-05 上海晏鼠计算机技术股份有限公司 一种改进词向量模型的文本聚类多文档自动摘要方法及系统
CN113239150A (zh) * 2021-05-17 2021-08-10 平安科技(深圳)有限公司 文本匹配方法、系统及设备
CN113407679A (zh) * 2021-06-30 2021-09-17 竹间智能科技(上海)有限公司 文本主题挖掘方法、装置、电子设备及存储介质
CN113779246A (zh) * 2021-08-25 2021-12-10 华东计算技术研究所(中国电子科技集团公司第三十二研究所) 基于句子向量的文本聚类分析方法及系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7131130B2 (ja) * 2018-06-29 2022-09-06 富士通株式会社 分類方法、装置、及びプログラム
CN110347835B (zh) * 2019-07-11 2021-08-24 招商局金融科技有限公司 文本聚类方法、电子装置及存储介质
CN111832289B (zh) * 2020-07-13 2023-08-11 重庆大学 一种基于聚类和高斯lda的服务发现方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130041652A1 (en) * 2006-10-10 2013-02-14 Abbyy Infopoisk Llc Cross-language text clustering
CN110413986A (zh) * 2019-04-12 2019-11-05 上海晏鼠计算机技术股份有限公司 一种改进词向量模型的文本聚类多文档自动摘要方法及系统
CN113239150A (zh) * 2021-05-17 2021-08-10 平安科技(深圳)有限公司 文本匹配方法、系统及设备
CN113407679A (zh) * 2021-06-30 2021-09-17 竹间智能科技(上海)有限公司 文本主题挖掘方法、装置、电子设备及存储介质
CN113779246A (zh) * 2021-08-25 2021-12-10 华东计算技术研究所(中国电子科技集团公司第三十二研究所) 基于句子向量的文本聚类分析方法及系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117391071A (zh) * 2023-12-04 2024-01-12 中电科大数据研究院有限公司 一种新闻话题数据挖掘方法、装置及存储介质
CN117391071B (zh) * 2023-12-04 2024-02-27 中电科大数据研究院有限公司 一种新闻话题数据挖掘方法、装置及存储介质

Also Published As

Publication number Publication date
CN114510923A (zh) 2022-05-17
CN114510923B (zh) 2023-08-15

Similar Documents

Publication Publication Date Title
CN111753060B (zh) 信息检索方法、装置、设备及计算机可读存储介质
CN108595706B (zh) 一种基于主题词类相似性的文档语义表示方法、文本分类方法和装置
WO2021042503A1 (zh) 信息分类抽取方法、装置、计算机设备和存储介质
CN107862070B (zh) 基于文本聚类的线上课堂讨论短文本即时分组方法及系统
US7797265B2 (en) Document clustering that applies a locality sensitive hashing function to a feature vector to obtain a limited set of candidate clusters
CN113076431B (zh) 机器阅读理解的问答方法、装置、计算机设备及存储介质
WO2023134075A1 (zh) 基于人工智能的文本主题生成方法、装置、设备及介质
CN109992775B (zh) 一种基于高级语义的文本摘要生成方法
CN111291177A (zh) 一种信息处理方法、装置和计算机存储介质
CN111985228A (zh) 文本关键词提取方法、装置、计算机设备和存储介质
CN113821635A (zh) 一种用于金融领域的文本摘要的生成方法及系统
US20230123941A1 (en) Multiscale Quantization for Fast Similarity Search
CN115759119B (zh) 一种金融文本情感分析方法、系统、介质和设备
CN114492429B (zh) 文本主题的生成方法、装置、设备及存储介质
CN112131341A (zh) 文本相似度计算方法、装置、电子设备和存储介质
CN111651675A (zh) 一种基于ucl的用户兴趣主题挖掘方法及装置
CN115098690A (zh) 一种基于聚类分析的多数据文档分类方法及系统
CN106570196B (zh) 视频节目的搜索方法和装置
CN111325033A (zh) 实体识别方法、装置、电子设备及计算机可读存储介质
CN114742047A (zh) 基于最大概率填充和多头注意力机制的文本情感识别方法
CN116822491A (zh) 日志解析方法及装置、设备、存储介质
KR20230140849A (ko) 영상 컨텐츠 추천 장치 및 방법
JP4567025B2 (ja) テキスト分類装置、テキスト分類方法及びテキスト分類プログラム並びにそのプログラムを記録した記録媒体
CN110162629B (zh) 一种基于多基模型框架的文本分类方法
CN112148855A (zh) 一种智能客服问题检索方法、终端以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22919716

Country of ref document: EP

Kind code of ref document: A1