CN113342980A - PPT text mining method and device, computer equipment and storage medium - Google Patents

PPT text mining method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113342980A
CN113342980A CN202110731612.8A CN202110731612A CN113342980A CN 113342980 A CN113342980 A CN 113342980A CN 202110731612 A CN202110731612 A CN 202110731612A CN 113342980 A CN113342980 A CN 113342980A
Authority
CN
China
Prior art keywords
sentence
ppt
sentences
article
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110731612.8A
Other languages
Chinese (zh)
Inventor
马建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202110731612.8A priority Critical patent/CN113342980A/en
Publication of CN113342980A publication Critical patent/CN113342980A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/41Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Abstract

The invention discloses a method for mining PPT texts, which is applied to the technical field of artificial intelligence and is used for solving the technical problem that the displayed contents of PPT cannot be mined in a cold-start manner by the conventional PPT display method so as to widen the knowledge view angle of a user. The method provided by the invention comprises the following steps: identifying and arranging characters contained in each PPT page, and dividing the arranged characters into sentences to obtain a first sentence; converting the first sentence into a first sentence vector; the method comprises the steps of performing sentence segmentation on articles stored in a database to obtain a plurality of second sentences; converting the second sentence into a second sentence vector; associating the article where the second sentence vector most similar to the first sentence vector corresponds to the second sentence with the PPT; clustering the articles stored in the database to obtain clusters; calculating the scores of all vocabularies in the articles contained in all clusters, and taking the vocabulary with the highest score as a label of the corresponding cluster; an article associated with the PPT and a label of a cluster in which the associated article is located are displayed in the PPT.

Description

PPT text mining method and device, computer equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for PPT text mining, computer equipment and a storage medium.
Background
Text mining is a process of acquiring a text in which a user is interested from unstructured text information, and is mainly used for extracting unknown knowledge from an original unprocessed text, and main sources of extracted data are news documents, research papers, books, digital libraries, e-mails, Web pages and the like. As text information in electronic form has grown at a rapid pace, text mining has become a research hotspot in the information field.
In the current use scenario of text mining, the intention or the content of interest of a user is intelligently analyzed according to keywords input by the user in a text box, and then relevant documents are matched in a database for recommendation and display.
However, in some scenarios, for example, in the process of playing a PPT slide, since the PPT generally shows all contents that are pre-programmed by a user, the problem that the user may be concerned about in the PPT cannot be directly determined by the existing means, and thus a corresponding related solution cannot be recommended. The present invention is to provide a cold-start problem mining method for contents displayed in a PPT, and an article for solving a corresponding problem is recommended for the mined problem.
Disclosure of Invention
The embodiment of the invention provides a method and a device for mining a PPT text, computer equipment and a storage medium, which aim to solve the technical problem that the content displayed by PPT cannot be mined in a cold-start mode through the existing PPT display method so as to widen the knowledge view angle of a user.
A method of PPT text mining, the method comprising:
identifying characters contained in each PPT page, and arranging the characters according to the positions of the characters;
the characters after arrangement are divided into sentences to obtain at least one first sentence;
inputting the first sentences into pre-trained Chinese pre-training models respectively to obtain first sentence vectors corresponding to each first sentence;
the method comprises the steps of performing sentence segmentation on articles stored in a database to obtain a plurality of second sentences;
inputting the second sentences into the Chinese pre-training model respectively to obtain second sentence vectors corresponding to each second sentence;
associating the article where the second sentence vector most similar to the first sentence vector corresponds to the second sentence with the PPT;
clustering the articles stored in the database to obtain clusters;
calculating the scores of all vocabularies in the articles contained in each cluster, and taking the vocabulary with the highest score as the label of the corresponding cluster;
an article associated with the PPT and a label of a cluster in which the associated article is located are displayed in the PPT.
An apparatus for PPT text mining, the apparatus comprising:
the recognition module is used for recognizing characters contained in each page of PPT and arranging the characters according to the positions where the characters appear;
the first sentence dividing module is used for dividing sentences of the arranged characters to obtain at least one first sentence;
the first input module is used for respectively inputting the first sentences into a pre-trained Chinese pre-training model to obtain first sentence vectors corresponding to each first sentence;
the second sentence dividing module is used for dividing sentences of the articles stored in the database to obtain a plurality of second sentences;
the second input module is used for respectively inputting the second sentences into the Chinese pre-training model to obtain second sentence vectors corresponding to each second sentence;
the association module is used for associating the article where the second sentence vector which is most similar to the first sentence vector corresponds to the second sentence with the PPT;
the clustering module is used for clustering the articles stored in the database to obtain clusters;
the calculation module is used for calculating the scores of all vocabularies in the articles contained in each cluster, and the vocabulary with the highest score is used as the label of the corresponding cluster;
and the display module is used for displaying the article associated with the PPT and the label of the cluster where the associated article is located in the PPT.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above-described method of PPT text mining when executing the computer program.
A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the above-mentioned steps of the method of PPT text mining.
The invention provides a PPT text mining method, a PPT text mining device, a PPT text mining computer equipment and a PPT text mining storage medium. And on the other hand, the articles stored in the database are divided into sentences to obtain a plurality of second sentences, and then the second sentences are respectively input into the Chinese pre-training model to obtain second sentence vectors corresponding to each second sentence. And then associating the article, corresponding to the second sentence, of the second sentence vector which is most similar to the first sentence vector with the PPT. And clustering the articles stored in the database to obtain clusters, calculating the scores of the words in the articles contained in each cluster, taking the word with the highest score as a label of the corresponding cluster, finally displaying the article associated with the PPT and the label of the cluster where the associated article is located in the PPT, mining possible problems or key points of knowledge in the PPT according to the content displayed in the PPT, matching and displaying the corresponding label and the article in the database according to the mined problems, intelligently analyzing the intention and the possible problems of the user in a cold-start mode, recommending a corresponding solution scheme, and enriching the knowledge view angle of the user during explanation.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a diagram illustrating an application environment of a method for PPT text mining according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method of PPT text mining in an embodiment of the present invention;
FIG. 3 is a flow chart of an implementation of training the Chinese pre-training model according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating an implementation of step S102 in FIG. 2 according to an embodiment of the present invention;
FIG. 5 is a schematic view of a sliding window in accordance with an embodiment of the present invention;
FIG. 6 is a schematic diagram of an apparatus for PPT text mining in accordance with an embodiment of the present invention;
FIG. 7 is a schematic diagram of a computer device according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The method for PPT text mining provided by the application can be applied to the application environment as shown in FIG. 1, wherein the computer device can communicate with an external device, for example, an external server, through a network. Wherein the computer device may be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.
In an embodiment, as shown in fig. 2, a method for PPT text mining is provided, which is described by taking the method as an example for being applied to the computer device in fig. 1, and includes the following steps S101 to S109.
S101, identifying characters contained in each PPT page, and arranging the characters according to the positions where the characters appear.
In one embodiment, the text contained in each page of PPT may be recognized by optical Character recognition (ocr). The OCR technology adopts an optical mode to convert characters in a paper document or a picture into an image file with a black-white dot matrix, and converts the characters in the image into a text format through recognition software, so that the characters can be further edited and processed by character processing software.
It is understood that, the characters are arranged according to the positions where the characters appear, for example, characters with the same word pitch and the same line spacing are arranged together, and then, for example, characters with the word pitch within a preset range are arranged together. In the arranging process, the arranged characters can be organized into sentences according to the sequence from left to right and from top to bottom.
S102, the arranged characters are divided into sentences to obtain at least one first sentence.
In one embodiment, the words after being arranged may be divided by a sliding window method, fig. 4 is a flowchart illustrating a specific implementation of step S102 in fig. 2 according to an embodiment of the present invention, and as shown in fig. 4, obtaining the first sentence specifically includes the following steps S401 and S402:
s401, obtaining a preset step length and a preset character length of a sliding window;
s402, taking the first arranged character as a starting character, determining the character with the same character number as the character length of the sliding window as a first sentence, and determining the rest first sentence according to the step length and the character length.
Fig. 5 is a schematic diagram of a sliding window in an embodiment of the present invention, and in one embodiment, when the step size is 5 and the length of the sliding window is 8, a process of dividing the arranged text into sentences through the sliding window is shown in fig. 5.
Referring to fig. 5, it can be understood that when the step size is 5 and the length of the sliding window is 8, the first sentences obtained are ABCDEFGH, FGHIJKLM, KLMNOP …, respectively, and so on may obtain each of the first sentences.
In other embodiments, the arranged text may also be divided by identifying punctuation marks in the text, and in particular, the text separated by periods or semicolons may be divided into the first sentence.
In one embodiment, the step size is less than or equal to the text length.
S103, inputting the first sentences into a pre-trained Chinese pre-training model respectively to obtain first sentence vectors corresponding to each first sentence.
Fig. 3 is a flowchart illustrating an implementation of training the chinese pre-training model in an embodiment of the present invention, where the chinese pre-training model represents a RoBERTa model, and the step of training the RoBERTa chinese pre-training model includes the following steps S301 to S303:
s301, obtaining a sample group comprising a first sample sentence and a second sample sentence, wherein the second sample sentence comprises sentences stored in articles in a database, and the sample group carries a mark indicating whether the first sample sentence and the second sample sentence are similar or not;
s302, training the Chinese pre-training model through the sample group carrying the similarity-judging mark;
s303, when the loss function of the Chinese pre-training model is converged, obtaining the trained Chinese pre-training model.
In one embodiment, whether the first sample sentence and the second sample sentence are similar may be manually tagged. Further, similarity may be represented by "0" and dissimilarity may be represented by "1".
Further, the loss function of the RoBERTa chinese pre-training model may be a cross-entropy loss function.
In one embodiment, a fully-connected layer can be added to the RoBERTa chinese pre-training model, and feature data can be reduced from 712 dimension to 64 dimension to speed up the processing of the following processes.
And S104, carrying out sentence segmentation on the articles stored in the database to obtain a plurality of second sentences.
It will be appreciated that the articles stored in the database include various fields, articles relating to various types of questions, and that the types of articles stored in the database relate to fields including, but not limited to, finance, science, medical, computer fields, microorganisms, and so forth.
In one embodiment, the articles stored in the database may be divided into sentences by the sliding window method, so as to obtain the second sentence. The step of sentence splitting the articles stored in the database to obtain a plurality of second sentences specifically comprises:
acquiring the preset step length and the character length of a sliding window;
and determining the characters with the same word number as the characters of the sliding window in the article as the first second sentence by taking the first characters in the article as initial characters, and determining the remaining second sentences in the article according to the step length and the character length.
When the step length is 5 and the length of the sliding window is 8, the articles stored in the database are divided into sentences, and an example of a plurality of second sentences is obtained as shown in fig. 5. In other embodiments, the words in the article may also be divided by identifying punctuation marks in the words, and specifically, the words in the article separated by periods or semicolons may be divided into the second sentence.
S105, the second sentences are respectively input into the Chinese pre-training model, and second sentence vectors corresponding to the second sentences are obtained.
It will be appreciated that the second sentence vector may be expressed as a vector for each article in the database, i.e., the RoBERTA Chinese pre-training model trained in the above-described steps. Through the Chinese pre-training model, each sentence in the article stored in the database can be converted into a corresponding second sentence vector.
S106, associating the article where the second sentence vector most similar to the first sentence vector corresponds to the second sentence with the PPT.
Optionally, when querying a second sentence vector most similar to the first sentence vector, in order to increase the querying speed, the step of associating, with the PPT, an article in which a second sentence corresponding to the second sentence vector most similar to the first sentence vector is located includes:
taking all second sentence vectors included in an article as a tree unit, taking an article title as a root node, taking the second sentence vectors as child nodes corresponding to the root node, and storing all the second vectors into an annoy tree index;
calculating the similarity between the first sentence vector and each child node in the annoy tree;
and associating the root node of the tree where the child node with the highest similarity to the first sentence vector is located with the PPT.
It will be appreciated that the PPT and corresponding articles being associated represent articles that need to be recalled or are the most relevant articles according to the content presented by the PPT.
In one embodiment, the second sentence vector most similar to the first sentence vector is determined by cosine similarity. Specifically, the step of judging the second sentence vector most similar to the first sentence vector comprises the following steps:
respectively calculating the cosine similarity of each second sentence vector and the first sentence vector;
and taking the second sentence vector with the maximum value of the calculated cosine similarity as the second sentence vector which is most similar to the first sentence vector.
Specifically, the cosine similarity of the second sentence vector and the first sentence vector is calculated by the following formula (1):
Figure BDA0003139402710000081
wherein the content of the first and second substances,
Figure BDA0003139402710000082
representing the first sentence vector in a first sentence vector,
Figure BDA0003139402710000083
representing the second sentence vector.
It will be appreciated that the second sentence vector comprises corresponding word vectors belonging to sentences in different articles.
And S107, clustering the articles stored in the database to obtain clusters.
In one embodiment, the articles stored in the database may be clustered by a DBSCAN (sensitivity-Based Spatial Clustering of Applications with Noise) Clustering algorithm. DBSCAN (Density-Based Clustering of Applications with Noise) is a relatively representative Density-Based Clustering algorithm, which, unlike the partitioning and hierarchical Clustering method, defines clusters as the largest set of Density-connected points, can partition areas with sufficiently high Density into clusters, and can find clusters of any number of clusters in a Spatial database of Noise. By clustering the articles stored in the database, the articles belonging to the same category can be classified into one category, also called a cluster.
And S108, calculating the scores of all the vocabularies in the articles contained in all the clusters, and taking the vocabulary with the highest score as the label of the corresponding cluster.
In one embodiment, the scores of the words in the article can be calculated by a TF-IDF (Term Frequency-Inverse Document Frequency) algorithm.
When the articles stored in the database are clustered by using a DBSCAN clustering algorithm, subject words are obtained in each cluster by using a TF-IDF algorithm according to a clustering result, the articles belonging to the same cluster are converted into a document, then the frequency of each word in each cluster is extracted and normalized, and the frequency words can be used as a regularization form of frequency words in the cluster.
Further, the score of each word in the article contained in each cluster is calculated by the following formulas (2) to (4):
score=tfij*idfi (2)
Figure BDA0003139402710000091
Figure BDA0003139402710000092
wherein, tfijIndicating the frequency of occurrence of the ith word in the jth article, idfiInverse document word frequency, n, representing the ith vocabularyi,jRepresents the number of times, Σ, that said vocabulary i appears in article jknk,jIndicates the total number of occurrences of the words in all articles k, N indicates the number of clusters, I (N, d)i) Indicates whether the cluster d contains the vocabulary i.
It can be understood that, when the word frequency of the inverse document of the ith vocabulary is calculated, the addition of 1 to the denominator represents the smoothing strategy, so that the calculated result is more accurate.
It can be understood that the subject word labels corresponding to each family can be calculated through the step, but the obtained subject word labels are dispersed, and the hierarchical meaning in the subject is not fully mined.
In one embodiment, in order to make the obtained subject word labels more concentrated, after the step of using the vocabulary with the highest score as the label of the corresponding cluster, the method further comprises:
and merging the synonyms in the labels corresponding to the clusters.
Because the number of clusters obtained by clustering is limited, the labels corresponding to the clusters are also limited, and the synonym words in the labels corresponding to the clusters can be merged in a manual judgment mode. The synonym words in the labels corresponding to each cluster are merged, so that the obtained subject word labels are more concentrated.
The representation of the topics is updated by comparing the tf-idf vectors between the topics, merging the most similar vectors, and finally recalculating the tf-idf vector values. And finally mapping to each page of PPT content to obtain a matching problem and a corresponding label.
S109, displaying the article associated with the PPT and the label of the cluster where the associated article is located in the PPT.
It can be understood that the displayed labels are the knowledge points mined according to the PPT, and the displayed articles associated with the PPT are articles recalled according to the knowledge points of the PPT.
The method and the device perform offline mining on the content in the PPT, fully utilize a RoBERTA Chinese pre-training model with word embedding and sentence embedding information of context, perform problem matching and clustering by utilizing the generated sentence vector, and extract subject words by utilizing tf-idf, so that the matching accuracy is ensured, the subject of the PPT content in each page is obtained, a high-quality label is provided for the data content in the PPT, and the cold start problem of the PPT content recommendation is solved.
The method for mining the PPT text provided by this embodiment includes first identifying characters included in each page of the PPT, arranging the characters according to positions where the characters appear, then performing sentence segmentation on the arranged characters to obtain at least one first sentence, and then inputting the first sentence into a pre-trained chinese pre-training model respectively to obtain a first sentence vector corresponding to each first sentence. And on the other hand, the articles stored in the database are divided into sentences to obtain a plurality of second sentences, and then the second sentences are respectively input into the Chinese pre-training model to obtain second sentence vectors corresponding to each second sentence. And then associating the article, corresponding to the second sentence, of the second sentence vector which is most similar to the first sentence vector with the PPT. And clustering the articles stored in the database to obtain clusters, calculating the scores of the words in the articles contained in each cluster, taking the word with the highest score as a label of the corresponding cluster, finally displaying the article associated with the PPT and the label of the cluster where the associated article is located in the PPT, mining possible problems or key points of knowledge in the PPT according to the content displayed in the PPT, matching and displaying the corresponding label and the article in the database according to the mined problems, intelligently analyzing the intention and the possible problems of the user in a cold-start mode, recommending a corresponding solution scheme, and enriching the knowledge view angle of the user during explanation.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In one embodiment, a PPT text mining device is provided, which corresponds to the PPT text mining method in the above embodiments one to one. As shown in fig. 6, the apparatus 100 for PPT text mining includes a recognition module 11, a first sentence segmentation module 12, a first input module 13, a second sentence segmentation module 14, a second input module 15, an association module 16, a clustering module 17, a calculation module 18, and a display module 19. The functional modules are explained in detail as follows:
and the identification module 11 is configured to identify characters included in each PPT page, and arrange the characters according to positions where the characters appear.
In one embodiment, the text contained in each PPT page can be recognized by Optical Character Recognition (OCR). The OCR technology adopts an optical mode to convert characters in a paper document or a picture into an image file with a black-white dot matrix, and converts the characters in the image into a text format through recognition software, so that the characters can be further edited and processed by character processing software.
It is understood that, the characters are arranged according to the positions where the characters appear, for example, characters with the same word pitch and the same line spacing are arranged together, and then, for example, characters with the word pitch within a preset range are arranged together. In the arranging process, the arranged characters can be organized into sentences according to the sequence from left to right and from top to bottom.
The first sentence dividing module 12 is configured to divide the arranged text into at least one first sentence.
The first input module 13 is configured to input the first sentences into a pre-trained chinese pre-training model, respectively, to obtain first sentence vectors corresponding to each of the first sentences.
And the second sentence splitting module 14 is configured to split the articles stored in the database to obtain a plurality of second sentences.
It will be appreciated that the articles stored in the database include various fields, articles relating to various types of questions, and that the types of articles stored in the database relate to fields including, but not limited to, finance, science, medical, computer fields, microorganisms, and so forth.
The second input module 15 is configured to input the second sentences into the chinese pre-training model respectively, so as to obtain second sentence vectors corresponding to each of the second sentences.
It will be appreciated that the second sentence vector may be expressed as a vector for each article in the database, i.e., the RoBERTA Chinese pre-training model trained in the above-described steps. Through the Chinese pre-training model, each sentence in the article stored in the database can be converted into a corresponding second sentence vector.
And the association module 16 is configured to associate the article where the second sentence vector most similar to the first sentence vector corresponds to the second sentence with the PPT.
And the clustering module 17 is configured to cluster the articles stored in the database to obtain clusters.
In one embodiment, the articles stored in the database may be clustered by the DBSCAN clustering algorithm. The DBSCAN clustering algorithm is a relatively representative density-based clustering algorithm, and unlike the partitioning and hierarchical clustering method, defines clusters as the maximum set of density-connected points, can partition areas with sufficiently high density into clusters, and can find clusters of any number of clusters in a spatial database of noise.
And the calculating module 18 is used for calculating the scores of all the vocabularies in the articles contained in each cluster, and taking the vocabulary with the highest score as the label of the corresponding cluster.
In one embodiment, the scores of the words in the article may be calculated by the TF-IDF algorithm.
When the articles stored in the database are clustered by using a DBSCAN clustering algorithm, subject words are obtained in each cluster by using a TF-IDF algorithm according to a clustering result, the articles belonging to the same cluster are converted into a document, then the frequency of each word in each cluster is extracted and normalized, and the frequency words can be used as a regularization form of frequency words in the cluster.
And the display module 19 is configured to display the article associated with the PPT and the label of the cluster where the associated article is located in the PPT.
It can be understood that the displayed labels are the knowledge points mined according to the PPT, and the displayed articles associated with the PPT are articles recalled according to the knowledge points of the PPT.
The method and the device perform offline mining on the content in the PPT, fully utilize a RoBERTA Chinese pre-training model with word embedding and sentence embedding information of context, perform problem matching and clustering by utilizing the generated sentence vector, and extract subject words by utilizing tf-idf, so that the matching accuracy is ensured, the subject of the PPT content in each page is obtained, a high-quality label is provided for the data content in the PPT, and the cold start problem of the PPT content recommendation is solved.
In one embodiment, the first sentence dividing module 12 specifically includes:
the first sliding window acquisition unit is used for acquiring the preset step length and the preset character length of the sliding window;
and the first sentence determining unit is used for determining the first sentence as the first character by taking the arranged first character as the starting character, determining the character with the same number of characters as the character length of the sliding window, and determining the remaining first sentence according to the step length and the character length.
Fig. 5 is a schematic diagram of a sliding window in an embodiment of the present invention, and in one embodiment, when the step size is 5 and the length of the sliding window is 8, a process of dividing the arranged text into sentences through the sliding window is shown in fig. 5.
In other embodiments, the arranged text may also be divided by identifying punctuation marks in the text, and in particular, the text separated by periods or semicolons may be divided into the first sentence.
In one embodiment, the step size is less than or equal to the text length.
In one embodiment, the apparatus 100 for PPT text mining further comprises:
a sample group acquisition module for acquiring a sample group including a first sample sentence and a second sample sentence, the second sample sentence including a sentence stored in an article in a database, the sample group carrying a flag indicating whether the first sample sentence and the second sample sentence are similar;
the training module is used for training the Chinese pre-training model through the sample group carrying the similar mark;
and the convergence module is used for obtaining the trained Chinese pre-training model when the loss function of the Chinese pre-training model converges.
In one embodiment, whether the first sample sentence and the second sample sentence are similar may be manually tagged. Further, similarity may be represented by "0" and dissimilarity may be represented by "1". Wherein, the loss function of the RoBERTA Chinese pre-training model can select a cross entropy loss function.
In one embodiment, a fully-connected layer can be added to the RoBERTa chinese pre-training model, and feature data can be reduced from 712 dimension to 64 dimension to speed up the processing of the following processes.
In one embodiment, the second sentence dividing module 14 specifically includes:
the second sliding window acquisition unit is used for acquiring the preset step length and the preset character length of the sliding window;
and the second sentence determining unit is used for determining the first sentence as the second sentence by taking the first character in the sentence as the initial character, determining the character with the same number of characters as the character length of the sliding window in the sentence, and determining the remaining second sentence in the sentence according to the step length and the character length.
When the step length is 5 and the length of the sliding window is 8, the articles stored in the database are divided into sentences, and an example of a plurality of second sentences is obtained as shown in fig. 5. In other embodiments, the words in the article may also be divided by identifying punctuation marks in the words, and specifically, the words in the article separated by periods or semicolons may be divided into the second sentence.
In one embodiment, the determining that the second sentence vector most similar to the first sentence vector can be determined by cosine similarity, and the associating module 16 specifically includes:
the similarity calculation unit is used for calculating the cosine similarity of each second sentence vector and the first sentence vector respectively;
and the maximum value determining unit is used for taking the second sentence vector with the maximum calculated cosine similarity as the second sentence vector which is most similar to the first sentence vector.
Specifically, the cosine similarity of the second sentence vector and the first sentence vector is calculated by the following formula (1):
Figure BDA0003139402710000141
wherein the content of the first and second substances,
Figure BDA0003139402710000142
representing the first sentence vector in a first sentence vector,
Figure BDA0003139402710000143
representing the second sentence vector.
It will be appreciated that the second sentence vector comprises corresponding word vectors belonging to sentences in different articles.
In one embodiment, the calculation module 18 is specifically configured to:
calculating the score of each word in the article contained in each cluster according to the following formula:
score=tfij*idfi
Figure BDA0003139402710000144
Figure BDA0003139402710000145
wherein, tfijIndicating the frequency of occurrence of the ith word in the jth article, idfiInverse document word frequency, n, representing the ith vocabularyi,jRepresents the number of times, Σ, that said vocabulary i appears in article jknk,jIndicates the total number of occurrences of the words in all articles k, N indicates the number of clusters, I (N, d)i) Indicates whether the cluster d contains the vocabulary i.
It can be understood that, when the word frequency of the inverse document of the ith vocabulary is calculated, the addition of 1 to the denominator represents the smoothing strategy, so that the calculated result is more accurate.
It is understood that the topic word tag corresponding to each family can be calculated through this step, but the obtained topic word tags are relatively dispersed, the hierarchical meaning in the topic is not sufficiently mined, and in order to make the topic word tags more centralized, in one embodiment, the apparatus 100 for PPT text mining further includes:
and the merging module is used for merging the synonym words in the labels corresponding to the clusters.
Because the number of clusters obtained by clustering is limited, and the labels corresponding to the clusters are also limited, the synonym words in the labels corresponding to the clusters can be merged in a manual judgment mode.
The representation of the topics is updated by comparing the tf-idf vectors between the topics, merging the most similar vectors, and finally recalculating the tf-idf vector values. And finally mapping to each page of PPT content to obtain a matching problem and a corresponding label.
In one embodiment, optionally, when querying a second sentence vector that is most similar to the first sentence vector, in order to increase the querying speed, the associating module 16 specifically includes:
the index unit is used for storing all second vectors into the annoy tree index by taking all second sentence vectors included in one article as a tree unit, taking the article title as a root node and taking the second sentence vectors as child nodes;
the node similarity calculation unit is used for calculating the similarity between the first sentence vector and each child node in the annoy tree;
and the association unit is used for associating the root node of the tree where the child node with the highest similarity to the first sentence vector is located with the PPT.
It will be appreciated that the PPT and corresponding articles being associated represent articles that need to be recalled or are the most relevant articles according to the content presented by the PPT.
According to the PPT text mining device 100 provided by the embodiment, possible problems or knowledge points in the PPT are mined by combining articles stored in a database according to the content displayed in the PPT, and corresponding labels and articles are matched and displayed in the database according to the mined problems, so that the intention and possible problems of a user can be intelligently analyzed in a cold start mode, corresponding solutions are recommended, and the knowledge view angle of the user in the process of explaining the PPT is enriched.
Wherein the meaning of "first" and "second" in the above modules/units is only to distinguish different modules/units, and is not used to define which module/unit has higher priority or other defining meaning. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not explicitly listed or inherent to such process, method, article, or apparatus, and such that a division of modules presented in this application is merely a logical division and may be implemented in a practical application in a further manner.
Specific limitations on the apparatus for PPT text mining can be referred to the above limitations on the method for PPT text mining, and are not described herein again. All or part of the modules in the PPT text mining device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external server through a network connection. The computer program is executed by a processor to implement a method of PPT text mining.
In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the steps of the method for PPT text mining in the above-described embodiments are implemented, such as steps 101 to 109 shown in fig. 2 and other extensions of the method and related steps. Alternatively, the processor, when executing the computer program, implements the functions of the modules/units of the PPT text mining apparatus in the above embodiments, such as the functions of the modules 11 to 19 shown in fig. 6. To avoid repetition, further description is omitted here.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.
The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, video data, etc.) created according to the use of the cellular phone, etc.
The memory may be integrated in the processor or may be provided separately from the processor.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the steps of the method of PPT text mining in the above-described embodiments, such as the extensions of steps 101 to 109 and other extensions and related steps of the method shown in fig. 2. Alternatively, the computer program, when executed by the processor, implements the functions of the modules/units of the PPT text-mining apparatus in the above-described embodiments, such as the functions of the modules 11 to 19 shown in fig. 6. To avoid repetition, further description is omitted here.
According to the PPT text mining method, the PPT text mining device, the PPT text mining method, the PPT text mining device, the computer equipment and the storage medium, possible problems or knowledge points in the PPT can be mined according to contents displayed in the PPT, corresponding labels and articles are matched and displayed in the database according to the mined problems, intentions and possible problems of a user can be intelligently analyzed in a cold-start mode, corresponding solutions are recommended, and knowledge view angles of the user when explaining the PPT are enriched.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A method of PPT text mining, the method comprising:
identifying characters contained in each PPT page, and arranging the characters according to the positions of the characters;
the arranged characters are divided into sentences to obtain at least one first sentence;
inputting the first sentences into a pre-trained Chinese pre-training model respectively to obtain first sentence vectors corresponding to each first sentence;
the method comprises the steps of performing sentence segmentation on articles stored in a database to obtain a plurality of second sentences;
inputting the second sentences into the Chinese pre-training model respectively to obtain second sentence vectors corresponding to each second sentence;
associating the article where the second sentence vector which is most similar to the first sentence vector corresponds to the second sentence with the PPT;
clustering the articles stored in the database to obtain clusters;
calculating the scores of all vocabularies in the articles contained in all the clusters, and taking the vocabulary with the highest score as the label of the corresponding cluster;
displaying, in the PPT, an article associated with the PPT and a label of a cluster in which the associated article is located.
2. The method of PPT text mining as recited in claim 1, wherein said step of clauseing said organized text to obtain at least one first sentence further comprises:
acquiring the preset step length and the character length of a sliding window;
and taking the first arranged character as a starting character, determining the character with the same character number as the character length of the sliding window as a first sentence, and determining the rest first sentences according to the step length and the character length.
3. The method for PPT text mining according to claim 1, wherein the step of training the chinese pre-trained model specifically comprises:
obtaining a sample group comprising a first sample sentence and a second sample sentence, wherein the second sample sentence comprises sentences stored in articles in a database, and the sample group carries a mark indicating whether the first sample sentence and the second sample sentence are similar or not;
training the Chinese pre-training model through the sample group carrying the similar mark;
and when the loss function of the Chinese pre-training model is converged, obtaining the trained Chinese pre-training model.
4. The method of PPT text mining as recited in claim 1, wherein said step of sentence-splitting the articles stored in the database to obtain a plurality of second sentences further comprises:
acquiring the preset step length and the character length of a sliding window;
and determining the characters with the same word number as the characters of the sliding window in the article as the first second sentence by taking the first characters in the article as initial characters, and determining the remaining second sentences in the article according to the step length and the character length.
5. The method for PPT text mining according to claim 1, wherein the step of determining the second sentence vector that is most similar to the first sentence vector specifically comprises:
respectively calculating the cosine similarity of each second sentence vector and the first sentence vector;
and taking the second sentence vector with the maximum value of the calculated cosine similarity as the second sentence vector which is most similar to the first sentence vector.
6. The method of PPT text mining according to any of claims 1-5, wherein the score of each vocabulary in the articles contained in each cluster is calculated by the following formula:
score=tfij*idfi
Figure FDA0003139402700000021
Figure FDA0003139402700000022
wherein, tfijIndicating the frequency of occurrence of the ith word in the jth article, idfiInverse document word frequency, n, representing the ith vocabularyi,jRepresents the number of times, Σ, that said vocabulary i appears in article jknk,jIndicates the total number of occurrences of the words in all articles k, N indicates the number of clusters, I (N, d)i) Indicates whether the cluster d contains the vocabulary i.
7. The method of PPT text mining as recited in claim 6, wherein after said step of tagging a highest scoring vocabulary to corresponding clusters, said method further comprises:
and merging the synonyms in the labels corresponding to the clusters.
8. An apparatus for PPT text mining, the apparatus comprising:
the recognition module is used for recognizing characters contained in each page of PPT and arranging the characters according to the positions of the characters;
the first sentence dividing module is used for dividing sentences of the arranged characters to obtain at least one first sentence;
the first input module is used for respectively inputting the first sentences into a pre-trained Chinese pre-training model to obtain first sentence vectors corresponding to each first sentence;
the second sentence dividing module is used for dividing sentences of the articles stored in the database to obtain a plurality of second sentences;
the second input module is used for respectively inputting the second sentences into the Chinese pre-training model to obtain second sentence vectors corresponding to each second sentence;
the association module is used for associating the article where the second sentence vector which is most similar to the first sentence vector corresponds to the second sentence with the PPT;
the clustering module is used for clustering the articles stored in the database to obtain clusters;
the calculation module is used for calculating the scores of all vocabularies in the articles contained in all the clusters and taking the vocabularies with the highest scores as the labels of the corresponding clusters;
the display module is used for displaying the article associated with the PPT and the label of the cluster where the associated article is located in the PPT.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, implements the steps of the method of PPT text mining according to any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method of PPT text mining according to any one of claims 1 to 7.
CN202110731612.8A 2021-06-29 2021-06-29 PPT text mining method and device, computer equipment and storage medium Pending CN113342980A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110731612.8A CN113342980A (en) 2021-06-29 2021-06-29 PPT text mining method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110731612.8A CN113342980A (en) 2021-06-29 2021-06-29 PPT text mining method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113342980A true CN113342980A (en) 2021-09-03

Family

ID=77481543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110731612.8A Pending CN113342980A (en) 2021-06-29 2021-06-29 PPT text mining method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113342980A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959329A (en) * 2017-05-27 2018-12-07 腾讯科技(北京)有限公司 A kind of file classification method, device, medium and equipment
WO2019218660A1 (en) * 2018-05-15 2019-11-21 北京三快在线科技有限公司 Article generation
CN111581162A (en) * 2020-05-06 2020-08-25 上海海事大学 Ontology-based clustering method for mass literature data
CN111753167A (en) * 2020-06-22 2020-10-09 北京百度网讯科技有限公司 Search processing method, search processing device, computer equipment and medium
CN112364068A (en) * 2021-01-14 2021-02-12 平安科技(深圳)有限公司 Course label generation method, device, equipment and medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959329A (en) * 2017-05-27 2018-12-07 腾讯科技(北京)有限公司 A kind of file classification method, device, medium and equipment
WO2019218660A1 (en) * 2018-05-15 2019-11-21 北京三快在线科技有限公司 Article generation
CN111581162A (en) * 2020-05-06 2020-08-25 上海海事大学 Ontology-based clustering method for mass literature data
CN111753167A (en) * 2020-06-22 2020-10-09 北京百度网讯科技有限公司 Search processing method, search processing device, computer equipment and medium
CN112364068A (en) * 2021-01-14 2021-02-12 平安科技(深圳)有限公司 Course label generation method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
US11403680B2 (en) Method, apparatus for evaluating review, device and storage medium
US20210157984A1 (en) Intelligent system that dynamically improves its knowledge and code-base for natural language understanding
CN108829893B (en) Method and device for determining video label, storage medium and terminal equipment
Sebastiani Classification of text, automatic
CN109685056B (en) Method and device for acquiring document information
CN112395506A (en) Information recommendation method and device, electronic equipment and storage medium
CN112215008B (en) Entity identification method, device, computer equipment and medium based on semantic understanding
US8874590B2 (en) Apparatus and method for supporting keyword input
CN113569135B (en) Recommendation method, device, computer equipment and storage medium based on user portrait
US11023503B2 (en) Suggesting text in an electronic document
US20210103622A1 (en) Information search method, device, apparatus and computer-readable medium
CN111324771B (en) Video tag determination method and device, electronic equipment and storage medium
CN111324713B (en) Automatic replying method and device for conversation, storage medium and computer equipment
WO2021139274A1 (en) Document classification method and apparatus based on deep learning model, and computer device
CN109634436B (en) Method, device, equipment and readable storage medium for associating input method
CN113343108B (en) Recommended information processing method, device, equipment and storage medium
CN112380866A (en) Text topic label generation method, terminal device and storage medium
CN106570196B (en) Video program searching method and device
CN111859950A (en) Method for automatically generating lecture notes
CN114842982B (en) Knowledge expression method, device and system for medical information system
Yao et al. A unified approach to researcher profiling
CN115712700A (en) Hot word extraction method, system, computer device and storage medium
CN113342980A (en) PPT text mining method and device, computer equipment and storage medium
CN113495964A (en) Method, device and equipment for screening triples and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination