WO2022143069A1 - Procédé et appareil de regroupement de texte, dispositif électronique et support de stockage - Google Patents

Procédé et appareil de regroupement de texte, dispositif électronique et support de stockage Download PDF

Info

Publication number
WO2022143069A1
WO2022143069A1 PCT/CN2021/136677 CN2021136677W WO2022143069A1 WO 2022143069 A1 WO2022143069 A1 WO 2022143069A1 CN 2021136677 W CN2021136677 W CN 2021136677W WO 2022143069 A1 WO2022143069 A1 WO 2022143069A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
text data
target
frequency
piece
Prior art date
Application number
PCT/CN2021/136677
Other languages
English (en)
Chinese (zh)
Inventor
封江涛
陈家泽
周浩
李磊
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2022143069A1 publication Critical patent/WO2022143069A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus

Definitions

  • the embodiments of the present disclosure relate to the field of computer technology, for example, to a text clustering method, apparatus, electronic device, and storage medium.
  • Text clustering is to divide similar text data into the same cluster and distinguish different text clusters, among which, clusters can also be called “clusters”.
  • Clustering methods are divided into different fields such as networking, medicine, biology, computer vision, natural language, etc.
  • the text clustering method in the related art represents the text as a feature vector, and then calculates the similarity between the texts by calculating the feature vector corresponding to the text; finally, the text is clustered according to the similarity between the texts, as can be seen in It is pointed out that the text clustering method in the related art first needs to represent the text as a feature vector, and then the similarity between the texts can be calculated by the feature vector, which makes the calculation process of text clustering complicated and the efficiency is low.
  • Embodiments of the present disclosure provide a text clustering method, apparatus, electronic device, and storage medium, which can effectively improve the efficiency and accuracy of text clustering.
  • an embodiment of the present disclosure provides a text clustering method, including:
  • the target text data set includes at least one piece of target text data
  • a pre-built dictionary tree is searched for a target word sequence adapted to each word sequence to be searched; wherein, the target word sequence belongs to the child of each word sequence to be searched sequence;
  • the target text data corresponding to the at least one target word sequence is clustered according to the at least one target word sequence, respectively, to obtain a text clustering result.
  • an embodiment of the present disclosure further provides a text clustering apparatus, including:
  • a text data acquisition module configured to acquire a target text data set to be clustered; wherein, the target text data set includes at least one piece of target text data;
  • a search word sequence generation module configured to calculate the first importance score of at least one word in each piece of target text data for each piece of target text data in the target text data set, and based on the first importance score Sort at least one word in each piece of target text data, and generate a word sequence to be searched corresponding to each piece of target text data;
  • the target word sequence determination module is configured to search a pre-built dictionary tree for a target word sequence adapted to each to-be-searched word sequence for each to-be-searched word sequence; wherein, the target word sequence belongs to the a subsequence of each sequence of words to be searched;
  • the text clustering module is configured to cluster the target text data corresponding to the at least one target word sequence according to the at least one target word sequence, respectively, to obtain a text clustering result.
  • an embodiment of the present disclosure further provides an electronic device, the electronic device comprising:
  • At least one processing device At least one processing device
  • a storage device configured to store at least one program
  • the at least one processing apparatus When the at least one program is executed by the at least one processing apparatus, the at least one processing apparatus implements the text clustering method according to the embodiment of the present disclosure.
  • an embodiment of the present disclosure further provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing apparatus, implements the text clustering method according to the embodiment of the present disclosure.
  • FIG. 1 is a flowchart of a text clustering method in an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of a dictionary tree in an embodiment of the present disclosure
  • FIG. 3 is a flowchart of another text clustering method in an embodiment of the present disclosure.
  • FIG. 4 is a flowchart of yet another text clustering method in an embodiment of the present disclosure.
  • FIG. 5 is a flowchart of yet another text clustering method in an embodiment of the present disclosure.
  • FIG. 6 is a schematic structural diagram of a text clustering apparatus in an embodiment of the present disclosure.
  • FIG. 7 is a schematic structural diagram of an electronic device in an embodiment of the present disclosure.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • FIG. 1 is a flowchart of a text clustering method provided by an embodiment of the present disclosure.
  • the embodiment of the present disclosure can be applied to the case of text clustering. and/or software, and generally can be integrated into a device with text clustering function, which can be an electronic device such as a server, a mobile terminal, or a server cluster.
  • the method includes the following steps:
  • Step 110 Obtain a target text data set to be clustered; wherein, the target text data set includes at least one piece of target text data.
  • the target text data set includes at least one piece of target text data, where the target text may be various types of text data, such as news, advertisement, network, natural language, medical, etc. text data.
  • the categories of at least one piece of target text data in the target text data set may be the same or different.
  • the target text data may be English text, Chinese text, or Korean text.
  • the target text data to be clustered can be collected through a web crawler technology, and the target text data can also be obtained through optical character recognition, speech recognition, handwriting recognition, and the like.
  • the target text data input by the user may be collected in real time, and the collected text data may be used as the text data to be clustered.
  • Step 120 for each piece of target text data in the target text data set, calculate the first importance score of at least one word in the target text data, and assign a value to the target text data based on the first importance score. Sort at least one word of the target text data to generate a word sequence to be searched corresponding to the target text data.
  • word segmentation processing is performed on each piece of target text data in the target text data set, so as to divide each piece of target text data into at least one word.
  • word segmentation preprocessing may also be performed on each piece of target text data, such as removing punctuation and stop words. Then, the first importance score of at least one word in each piece of target text data is calculated, and the first importance score is used to reflect the importance of each word in the target text data. The more important the word is in the target text data, on the contrary, the smaller the first importance score is, the less important the word is in the target text data.
  • the number of occurrences of each word in the target text data may be counted, and the number of occurrences of the word in the target text data may be used as the first importance score.
  • the word frequency-inverse document frequency of a word in the target text data may be used as the first importance score of the word.
  • calculating the first importance score of at least one word in the target text data including: for each piece of target text data in the target text data set , respectively calculating the first word frequency-inverse document frequency of at least one word in the target text data; respectively calculating the first importance score of at least one word in the target text data according to at least one first word frequency-inverse document frequency. It should be noted that, the embodiment of the present disclosure does not limit the calculation method of the first importance score of at least one word in the target text data.
  • the at least one word in the target text data is sorted based on the first importance score, for example, the at least one word in the target text data may be sorted in descending order of the first importance score, and the The sequence composed of the sorted words is used as the to-be-searched word sequence corresponding to the target text data. It can be understood that the higher the word in the word sequence to be searched, the greater the first importance score of the word, indicating that the word is more important in the target text data, and the more it can indicate the meaning that the target text data wants to express, content, or the more it can indicate the category of the target text data.
  • Step 130 For each word sequence to be searched, search a pre-built dictionary tree for a target word sequence adapted to the word sequence to be searched; wherein, the target word sequence belongs to a subsequence of the word sequence to be searched .
  • a pre-built dictionary tree is acquired, wherein the dictionary tree is constructed based on a pre-configured target corpus.
  • a dictionary tree is searched for a target word sequence adapted to the word sequence to be searched.
  • searching a pre-built dictionary tree for a target word sequence adapted to the to-be-searched word sequence including: for each to-be-searched word sequence, searching in a pre-built dictionary tree In the sequence from the root node to the child node, search for the target word sequence adapted to the to-be-searched word sequence.
  • the dictionary tree is searched for the first target node that matches the first word in the word sequence to be searched, and then all child nodes connected to the first target node are searched.
  • the second target node that matches the second word in the sequence of words to be searched, and then searches all subnodes connected to the second target node for the third target node that matches the third word in the sequence of words to be searched, And so on, until no node matching p+1 words in the word sequence to be searched is found in all the child nodes connected to the p-th target node, and the sequence of words in multiple target nodes is taken as
  • the target word sequence that is, the sequence consisting of words in the word sequence to be searched that can be searched in the dictionary tree, which matches the node, as the target word sequence.
  • the target word sequence is a subsequence of the word sequence to be searched.
  • the sequence of words to be searched is [A-B-C-D-E], where A, B, C, D, and E respectively represent words in the sequence of words to be searched, and can be searched in the order from the root node to the child nodes in the dictionary tree.
  • A, B, C, D, and E respectively represent words in the sequence of words to be searched, and can be searched in the order from the root node to the child nodes in the dictionary tree.
  • To the target node matching A, B, and C, that is, the first target node matching A can be searched in the dictionary tree, and the second target node matching B can be found in the child nodes connected to the first target node.
  • the target node, the third target node that matches C can be searched in the child nodes connected to the second target node, but the third target node that matches D cannot be found in the child nodes connected to the third target node, then The sequence consisting of A, B, and C is used as the target word sequence.
  • Step 140 Cluster the target text data corresponding to the at least one target word sequence according to the at least one target word sequence, respectively, to obtain a text clustering result.
  • the target text data corresponding to the at least one target word sequence is clustered according to the at least one target word sequence. It can be understood that the target word sequence can intuitively reflect the category of the target text data or the target text data. If the target word sequence corresponding to the target text data is the same or has a high degree of similarity, it can indicate that the category of the target text data or the content expressed are the same or similar. Therefore, the target text data can be clustered according to the target word sequence.
  • target text data with the same target word sequence can be clustered into the same cluster, and target text data with different target word sequences can be clustered into different clusters; the similarity between at least one target word sequence can also be calculated, The target text data whose similarity is greater than the preset threshold are clustered into the same cluster, and the target text data whose similarity is less than the preset threshold are clustered into different clusters. It should be noted that the embodiment of the present disclosure does not limit the manner of clustering the corresponding target text according to the target word sequence.
  • a target text data set to be clustered is obtained; wherein, the target text data set includes at least one piece of target text data; for each piece of target text data in the target text data set, the target text data is calculated the first importance score of at least one word in the target text data, and sort at least one word in the target text data based on the first importance score, and generate a sequence of words to be searched corresponding to the target text data;
  • a word sequence to be searched is searched in a pre-built dictionary tree for a target word sequence adapted to the word sequence to be searched; wherein, the target word sequence belongs to a subsequence of the word sequence to be searched; according to at least one The target word sequence performs clustering on the target text data corresponding to the at least one target word sequence to obtain a text clustering result.
  • the text clustering method calculates the importance score of at least one word in the text data to be clustered, and sorts the at least one word in the text data to be clustered based on the importance score, and generates the word to be searched Then, based on the pre-built dictionary tree, the target word sequence that is suitable for the search word is found, so that the text data is clustered based on the target word sequence, which simplifies the process of text clustering and greatly reduces the time complexity of text clustering. It effectively improves the efficiency and accuracy of text clustering.
  • calculating the first importance score of at least one word in the target text data includes: for each piece of target text data in the target text data set text data, respectively calculate the first word frequency-inverse document frequency of at least one word in the target text data; respectively calculate the first importance of at least one word in the target text data according to at least one first word frequency-inverse document frequency Fraction.
  • the first term frequency-inverse document frequency can indirectly reflect the importance of each word in the target text data. Therefore, the first term of each word in the target text data can be calculated. Word frequency-inverse document frequency, and then calculate the first importance score of each word in the target text data according to each first word frequency-inverse document frequency.
  • calculating the first word frequency-inverse document frequency of at least one word in the target text data respectively includes: respectively determining the first word frequency and the first inverse document frequency of each word in the target text data; Describe the first word frequency and the first inverse document frequency to calculate the first word frequency-inverse document frequency of the corresponding word; wherein, the first word frequency-inverse document frequency is the first word frequency and the first inverse document frequency. product.
  • respectively determining the first word frequency and the first inverse document frequency of each word in the target text data includes: determining the number of occurrences of each word in the target text data, and using the number of occurrences as the first word frequency of the corresponding word; obtain parameter configuration information corresponding to the dictionary tree; wherein, the parameter configuration information includes an inverse document frequency list, and the inverse document frequency list includes each The inverse document frequency of the word; in the inverse document frequency list, find the inverse document frequency corresponding to at least one word in the target text data respectively, as the first inverse document frequency of at least one word in the target text data .
  • the parameter configuration information corresponding to the dictionary tree is acquired, wherein the parameter configuration information is the parameter information determined in the process of constructing the dictionary tree based on the target corpus.
  • the parameter configuration information may include an inverse document frequency list composed of an inverse document frequency (Inverse Document Frequency, IDF) of each word contained in the dictionary tree.
  • the parameter configuration information further includes a distribution deviation list; wherein, the distribution deviation list includes the distribution deviation of each word included in the dictionary tree; Inverse document frequency, before calculating the first importance score of at least one word in the target text data, further comprising: in the distribution deviation list, respectively searching for the distribution deviation corresponding to each word in the target text data , as the first distribution deviation of each word in the target text data; calculate the first importance score of at least one word in the target text data according to at least one first word frequency-inverse document frequency, including: Calculate the first importance score of each word in the target text data according to each first word frequency-inverse document frequency and the corresponding first distribution deviation; wherein, the first importance score is the The first word frequency - the product of the inverse document frequency and the deviation of the first distribution.
  • the parameter configuration information may further include a distribution deviation list consisting of the distribution table deviation of each word in the dictionary tree. It is understandable that in the process of constructing a dictionary tree based on the target corpus, it is not only necessary to calculate the inverse document frequency of each word in the target corpus, but also the distribution deviation of each word in the target corpus, and then based on the inverse of multiple words. Document frequency and distribution bias to build a dictionary tree. Among them, the distribution deviation is used to reflect the distribution deviation of each word in the target corpus and the total corpus.
  • the distribution deviation list corresponding to the dictionary tree find the distribution deviation corresponding to each word in the target text data, and use the found target distribution deviation corresponding to each word as the first distribution of the word in the target text data deviation. Then, according to the first word frequency-inverse document frequency and the corresponding first distribution deviation, the first importance score of the word in the target text data is calculated, wherein the first importance score is the first word frequency-inverse document frequency and the first importance score.
  • the method before acquiring the target text data set to be clustered, the method further includes: acquiring a total corpus and a target corpus; wherein the total corpus includes the target corpus, and the target corpus contains at least one piece of sample text data; calculate the second distribution deviation of each word contained in the target corpus relative to the total corpus; for each piece of sample text data in the target corpus, according to the The second distribution deviation calculates the second importance score of the corresponding word, and sorts at least one word in each piece of sample text data according to the second importance score in descending order, and generates the same value as the sample text data.
  • corresponding sample word sequences constructing the dictionary tree based on at least one sample word sequence. This setting can accurately and quickly build a dictionary tree corresponding to the target corpus.
  • the target corpus may be a corpus belonging to a certain category or a certain field.
  • the target corpus may be an advertising corpus, a network corpus, a legal corpus, or a medical corpus.
  • the total corpus is the total corpus that includes the target corpus.
  • the target corpus is an advertising corpus
  • the total corpus may include a network corpus, a legal corpus, a medical corpus, and an advertising corpus.
  • the target corpus includes at least one piece of sample text data.
  • the total corpus and the target corpus can be obtained through web crawling technology. It should be noted that, the embodiment of the present disclosure does not limit the type of the target corpus, nor does it limit other corpus contents in the total corpus except the target corpus.
  • the target corpus can be calculated.
  • the second distribution deviation of each word contained in the relative to the total corpus wherein the second distribution deviation can reflect the difference between each word in the target corpus and the total corpus.
  • calculating the second distribution deviation of each word included in the target corpus relative to the total corpus includes: calculating the relative value of each word included in the target corpus to the total corpus according to the following formula: Second distribution bias of the corpus:
  • b represents the second distribution deviation of the word w in the target corpus relative to the total corpus
  • freq a (w) represents the frequency of the word w in the target corpus
  • freq(w) represents the word w in the total corpus.
  • the frequency of occurrence in the corpus t represents the number of occurrences of the word w in the target corpus
  • M represents the total number of words contained in the target corpus
  • t' represents the number of occurrences of the word w in the total corpus
  • M ' represents the total number of words contained in the total corpus.
  • the total number of words contained in the target corpus is 1000, and the word “movement” appears 100 times in the target corpus, then the frequency of occurrence of "movement” in the target corpus is:
  • the total number of words contained in the total corpus is 5000, and the word “movement” appears 120 times in the total corpus, so the frequency of "movement” in the total corpus is:
  • the second distribution deviation of "Motion” is:
  • the second importance score of the corresponding word is calculated according to the second distribution deviation of at least one word in the sample text data, wherein the second importance score reflects the The importance of each word in the sample text data, where the larger the second importance score is, the more important the word is in the sample text data; otherwise, the smaller the second importance score is, the more important the word is in the sample text data. unimportant. Then at least one word in the sample text data is sorted in descending order of the second importance score, and the sequence consisting of the sorted words is used as a sample word sequence corresponding to the sample text data.
  • a dictionary tree is constructed based on a sample word sequence corresponding to at least one piece of sample text data in the target corpus.
  • the empty node is used as the root node of the dictionary tree
  • the first word in all sample word sequences is used as the child node of the root node
  • the The second word in all sample word sequences is taken as the child node of the node where the first word in the same sample word sequence is located
  • the third word in all sample word sequences is taken as the location of the second word in the same sample word sequence
  • the first word in all sample word sequences can be used as the root node of the dictionary tree, and the second word in all sample word sequences can be used as a child of the root node node, take the third word in all sample word sequences as the child node of the node where the second word in the same sample word sequence is located, and so on, until all words in all sample word sequences are filled in multiple up to the node.
  • the sample word sequences corresponding to the five pieces of sample text data in the target corpus are: [intermediate commodity], [intermediate bigbuy], [intermediate business Korean], [business middle], [behind the middle], based on the above
  • the dictionary tree constructed by five sample word sequences is shown in Figure 2.
  • calculating the second importance score of the corresponding word according to the second distribution deviation of each word in the sample text data includes: for the target For each piece of sample text data in the corpus, calculate the second word frequency-inverse document frequency of each word in the sample text data; respectively, according to each second word frequency-inverse document frequency and the corresponding second distribution deviation, calculate the The second importance score for each word described in the sample text data.
  • the second word frequency-inverse document frequency can indirectly reflect the importance of each word in the sample text data.
  • the second word frequency-inverse document frequency of each word in the sample text data can be calculated, and then according to each second word frequency-inverse document frequency Word frequency - inverse document frequency and corresponding second distribution deviation, to calculate the second importance score of each word in the sample text data.
  • the second importance score is the product of the second word frequency-inverse document frequency and the corresponding second distribution deviation.
  • calculating the second importance score of each word in the sample text data according to each second word frequency-inverse document frequency and the corresponding second distribution deviation respectively includes: calculating the second importance score according to the following formula: The second importance score for each word described in the sample text data:
  • s(w) represents the second importance score of the word w in the sample text data
  • tf-idf a (w) represents the second word frequency-inverse document frequency of the word w in the sample data text
  • determining the second word frequency and the second inverse document frequency of each word in the sample text data respectively includes: calculating the second word frequency and the second inverse document frequency of each word in the sample text data respectively according to the following formula: Document frequency:
  • Calculating the second word frequency-inverse document frequency of the corresponding word in the sample text data according to the second word frequency and the second inverse document frequency includes: calculating the second word frequency of each word in the sample text data according to the following formula Term Frequency - Inverse Document Frequency:
  • w represents any word in the sample data text
  • tf(w) represents the second word frequency of the word w in the sample data text
  • idf(w) represents the second word frequency of the word w in the sample data text
  • Two inverse document frequency tf-idf(w) represents the second word frequency-inverse document frequency of the word w in the sample data text
  • m represents the number of times the word w appears in the sample data text
  • n represents the target The number of pieces of sample text data containing word w in the corpus
  • N represents the total number of pieces of sample text data contained in the target corpus.
  • the method further includes: determining the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences; according to the dictionary The dictionary tree is pruned by the number of occurrences of the word of each node in the tree at the same position in all sample word sequences until the number of nodes contained in the dictionary tree reaches a preset number.
  • This setting can effectively improve the search speed of the target word sequence on the premise that the target word sequence corresponding to the target text data can be accurately determined based on the dictionary tree, thereby improving the efficiency of text clustering.
  • the dictionary tree determines the total number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences.
  • the dictionary tree shown in FIG. 2 in the order from the root node to the child node, The number of occurrences of the word "middle” in the first level of the dictionary tree at the same position in all sample word sequences is 3, the number of occurrences of the word “quotient” in the first level is 1, and the word “behind” in the first level
  • the number of occurrences of the word “quotient” in the second level is 2, the number of occurrences of the word “big” in the second level is 1, and the number of occurrences of the word "middle” in the second level is 2,
  • the occurrences of the words "pin”, “bu” and “han” in the third level are all 1. According to the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences, the dictionary tree is pruned until the number of nodes contained in the
  • the dictionary tree is pruned according to the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences, until the number of nodes contained in the dictionary tree reaches a preset number. , including: according to the order of occurrences of the words of each node in the dictionary tree at the same position in all sample word sequences from small to large, sequentially delete the nodes corresponding to the same occurrence in the dictionary tree, until the dictionary tree The number of included nodes reaches the preset number.
  • the node whose word in the dictionary tree appears at the same position in all sample word sequences can be deleted, and then the number of occurrences of the node word in the dictionary tree at the same position in all sample word sequences is 2.
  • the nodes corresponding to the same number of occurrences in the dictionary tree can be deleted in sequence from the root node to the child nodes.
  • FIG. 3 is a flowchart of a text clustering method in another embodiment of the present disclosure. As shown in FIG. 3 , the method includes the following steps:
  • Step 310 Obtain a target text data set to be clustered; wherein, the target text data set includes at least one piece of target text data.
  • Step 320 Obtain parameter configuration information corresponding to the pre-built dictionary tree; wherein, the parameter configuration information includes an inverse document frequency list and a distribution deviation list; wherein, the inverse document frequency list includes the inverse of each word contained in the dictionary tree. Document frequency, distribution deviation The list includes the distribution deviation for each word contained in the dictionary tree.
  • Step 330 For each piece of target text data in the target text data set, determine the number of occurrences of each word in the target text data, and use the number of occurrences as the first word frequency of the corresponding word.
  • Step 340 in the inverse document frequency list, search for the inverse document frequency corresponding to at least one word in the target text data, respectively, as the first inverse document frequency of the at least one word in the target text data.
  • Step 350 Calculate the first word frequency-inverse document frequency of the corresponding word according to the first word frequency and the first inverse document frequency; wherein the first word frequency-inverse document frequency is the product of the first word frequency and the first inverse document frequency.
  • Step 360 in the distribution deviation list, search for the distribution deviation corresponding to each word in the target text data, as the first distribution deviation of each word in the target text data.
  • Step 370 Calculate the first importance score of each word in the target text data according to each first word frequency-inverse document frequency and the corresponding first distribution deviation; wherein, the first importance score is the first word frequency - The product of the inverse document frequency and the deviation of the first distribution.
  • Step 380 Rank at least one word in the target text data based on the first importance score, and generate a to-be-searched word sequence corresponding to the target text data.
  • Step 390 For each word sequence to be searched, in the pre-built dictionary tree, search for a target word sequence adapted to the word sequence to be searched in the order from the root node to the child nodes.
  • Step 3100 Cluster the target text data corresponding to the at least one target word sequence according to the at least one target word sequence, respectively, to obtain a text clustering result.
  • the technical solution of the embodiment of the present disclosure calculates the importance score of each word by determining the word frequency, inverse document frequency and distribution deviation of each word in the text data to be clustered, and based on the importance score of the text to be clustered Sort at least one word in the data to generate a sequence of words to be searched, and then search for a target word sequence that matches the word to be searched based on a pre-built dictionary tree, so as to cluster the text data based on the target word sequence, which simplifies text clustering.
  • the class process greatly reduces the time complexity of text clustering and effectively improves the efficiency and accuracy of text clustering.
  • FIG. 4 is a flowchart of a text clustering method in another embodiment of the present disclosure. As shown in FIG. 4 , the method includes the following steps:
  • Step 410 Obtain a total corpus and a target corpus; wherein, the total corpus includes a target corpus, and the target corpus includes at least one piece of sample text data.
  • Step 420 Calculate the second distribution deviation of each word contained in the target corpus relative to the total corpus.
  • calculating the second distribution deviation of each word included in the target corpus relative to the total corpus includes: calculating the second distribution deviation of each word included in the target corpus relative to the total corpus according to the following formula:
  • b represents the second distribution deviation of the word w of the target corpus relative to the total corpus
  • freq a (w) represents the frequency of the word w in the target corpus
  • freq(w) represents the frequency of the word w in the total corpus
  • t represents the number of occurrences of word w in the target corpus
  • M represents the total number of words contained in the target corpus
  • t' represents the number of occurrences of word w in the total corpus
  • M' represents the total number of words contained in the total corpus.
  • Step 430 for each piece of sample text data in the target corpus, calculate the second importance score of the corresponding word according to the second distribution deviation of each word in the sample text data, and follow the second importance score in descending order. Sort at least one word in each piece of sample text data to generate a sample word sequence corresponding to the sample text data.
  • calculating the second importance score of the corresponding word according to the second distribution deviation of each word in the sample text data including: for each piece of sample text data in the target corpus, Calculate the second word frequency-inverse document frequency of each word in the sample text data respectively; according to each second word frequency-inverse document frequency and the corresponding second distribution deviation, calculate the second word frequency of each word in the sample text data. Importance Score.
  • calculating the second word frequency-inverse document frequency of each word in the sample text data respectively includes: respectively determining the second word frequency and the second inverse document frequency of each word in the sample text data; The second inverse document frequency calculates the second word frequency-inverse document frequency of the corresponding word in the sample text data.
  • determining the second word frequency and the second inverse document frequency of each word in the sample text data respectively includes: calculating the second word frequency and the second inverse document frequency of each word in the sample text data according to the following formula:
  • Calculate the second word frequency-inverse document frequency of corresponding words in the sample text data according to the second word frequency and the second inverse document frequency comprising: calculating the second word frequency-inverse document frequency of each word in the sample text data according to the following formula:
  • w represents any word in the sample data text
  • tf(w) represents the second word frequency of the word w in the sample data text
  • idf(w) represents the second inverse document frequency of the word w in the sample data text
  • tf -idf(w) represents the second word frequency-inverse document frequency of the word w in the sample data text
  • m represents the number of times the word w appears in the sample data text
  • n represents the number of sample text data containing the word w in the target corpus
  • N represents the total number of sample text data contained in the target corpus.
  • calculate the second importance score of each word in the sample text data according to each second word frequency-inverse document frequency and the corresponding second distribution deviation including:
  • s(w) represents the second importance score of the word w in the sample text data
  • tf-idf a (w) represents the second word frequency-inverse document frequency of the word w in the sample data text
  • Step 440 construct a dictionary tree based on at least one sample word sequence.
  • Step 450 Determine the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences.
  • Step 460 Delete the nodes corresponding to the same number of occurrences in the dictionary tree in order according to the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences, until the number of nodes contained in the dictionary tree reaches 460. up to the preset number.
  • Step 470 Obtain a target text data set to be clustered; wherein, the target text data set includes at least one piece of target text data.
  • Step 480 for each piece of target text data in the target text data set, calculate the first importance score of at least one word in the target text data, and sort at least one word in the target text data based on the first importance score, and generate The sequence of words to be searched corresponding to the target text data.
  • Step 490 for each word sequence to be searched, in the pre-built dictionary tree in the order from the root node to the child node, search for a target word sequence that is adapted to the word sequence to be searched; wherein, the target word sequence belongs to the word to be searched.
  • a subsequence of a sequence A subsequence of a sequence.
  • Step 4100 Cluster the target text data corresponding to the at least one target word sequence according to the at least one target word sequence, respectively, to obtain a text clustering result.
  • the text clustering method can build a dictionary tree matching the target corpus, prune the dictionary tree, and then calculate the importance score of at least one word in the text data to be clustered, and based on the importance
  • the score sorts at least one word in the text data to be clustered to generate a sequence of words to be searched, and then searches for a target word sequence adapted to the word to be searched based on the dictionary tree, thereby clustering the text data based on the target word sequence.
  • pruning the dictionary tree the depth of the dictionary tree can be reduced.
  • the search speed of the target word sequence can be effectively improved, and the search speed of the target word sequence can be greatly reduced.
  • the time complexity of text clustering effectively improves the efficiency and accuracy of text clustering.
  • FIG. 5 is a flowchart of a text clustering method in another embodiment of the present disclosure. As shown in FIG. 5 , the method includes the following steps:
  • Step 510 Obtain a total corpus and a target corpus; wherein, the total corpus includes a target corpus, and the target corpus includes at least one piece of sample text data.
  • Step 520 Calculate the second distribution deviation of each word contained in the target corpus relative to the total corpus.
  • Step 530 For each piece of sample text data in the target corpus, determine the second word frequency and the second inverse document frequency of each word in the sample text data, respectively.
  • Step 540 Calculate the second word frequency-inverse document frequency of the corresponding word in the sample text data according to the second word frequency and the second inverse document frequency.
  • Step 550 Calculate the second importance score of each word in the sample text data according to each second word frequency-inverse document frequency and the corresponding second distribution deviation.
  • Step 560 Sort at least one word in each piece of sample text data in descending order of the second importance score to generate a sample word sequence corresponding to the sample text data.
  • Step 570 construct a dictionary tree based on at least one sample word sequence.
  • Step 580 Store at least one distribution deviation list composed of the second distribution deviation and at least one inverse document list composed of the second inverse document frequency as parameter configuration information of the dictionary tree.
  • Step 590 Obtain a target text data set to be clustered; wherein, the target text data set includes at least one piece of target text data.
  • Step 5100 For each piece of target text data in the target text data set, determine the number of occurrences of each word in the target text data, and use the number of occurrences as the first word frequency of each word.
  • Step 5110 In the inverse document frequency list, search for the inverse document frequency corresponding to each word in the target text data, as the first inverse document frequency of each word in the target text data.
  • Step 5120 Calculate the first word frequency-inverse document frequency of the corresponding word according to the first word frequency and the first inverse document frequency; wherein the first word frequency-inverse document frequency is the product of the first word frequency and the first inverse document frequency.
  • Step 5130 In the distribution deviation list, search for the distribution deviation corresponding to each word in the target text data, as the first distribution deviation of each word in the target text data.
  • Step 5140 Calculate the first importance score of each word in the target text data according to each first word frequency-inverse document frequency and the corresponding first distribution deviation; wherein, the first importance score is the first word frequency - The product of the inverse document frequency and the deviation of the first distribution.
  • Step 5150 Sort at least one word in the target text data based on the first importance score, and generate a to-be-searched word sequence corresponding to the target text data.
  • Step 5160 For each word sequence to be searched, a pre-built dictionary tree is searched for a target word sequence adapted to the word sequence to be searched; wherein the target word sequence belongs to a subsequence of the word sequence to be searched.
  • Step 5170 Cluster the target text data corresponding to the at least one target word sequence according to the at least one target word sequence, respectively, to obtain a text clustering result.
  • the text clustering method provided by the embodiments of the present disclosure clusters text data based on a dictionary tree, which simplifies the process of text clustering, greatly reduces the time complexity of text clustering, and effectively improves the efficiency and accuracy of text clustering. sex.
  • FIG. 6 is a schematic structural diagram of a text clustering apparatus according to another embodiment of the present disclosure. As shown in FIG. 6 , the apparatus includes: a text data acquisition module 610 , a search word sequence generation module 620 , a target word sequence determination module 630 and a text clustering module 640 .
  • the text data acquisition module 610 is configured to acquire the target text data set to be clustered; wherein, the target text data set includes at least one piece of target text data;
  • the search word sequence generation module 620 is configured to, for each piece of target text data in the target text data set, calculate the first importance score of at least one word in the target text data, and based on the first importance score Sort at least one word in the target text data to generate a to-be-searched word sequence corresponding to the target text data;
  • the target word sequence determination module 630 is configured to search a pre-built dictionary tree for a target word sequence that is adapted to the to-be-searched word sequence for each to-be-searched word sequence; wherein, the target word sequence belongs to the to-be-searched word sequence. search for subsequences of word sequences;
  • the text clustering module 640 is configured to cluster the target text data corresponding to the at least one target word sequence according to the at least one target word sequence, respectively, to obtain a text clustering result.
  • a target text data set to be clustered is obtained; wherein, the target text data set includes at least one piece of target text data; for each piece of target text data in the target text data set, the target text data is calculated the first importance score of at least one word in the target text data, and sort at least one word in the target text data based on the first importance score, and generate a sequence of words to be searched corresponding to the target text data;
  • a word sequence to be searched is searched in a pre-built dictionary tree for a target word sequence adapted to the word sequence to be searched; wherein, the target word sequence belongs to a subsequence of the word sequence to be searched; according to at least one The target word sequence performs clustering on the target text data corresponding to the at least one target word sequence to obtain a text clustering result.
  • the text clustering apparatus calculates the importance score of each word in the text data to be clustered, and sorts at least one word in the text data to be clustered based on the importance score, and generates the word to be searched Then, based on the pre-built dictionary tree, the target word sequence that is suitable for the search word is found, so that the text data is clustered based on the target word sequence, which simplifies the process of text clustering and greatly reduces the time complexity of text clustering. It effectively improves the efficiency and accuracy of text clustering.
  • the search word sequence generation module includes:
  • a first word frequency-inverse document frequency calculation unit configured to calculate the first word frequency-inverse document frequency of at least one word in the target text data for each piece of target text data in the target text data set;
  • the first importance score calculation unit is configured to calculate the first importance score of at least one word in the target text data according to at least one first word frequency-inverse document frequency, respectively.
  • the first word frequency-inverse document frequency calculation unit includes:
  • a first frequency determination subunit configured to respectively determine the first word frequency and the first inverse document frequency of each word in the target text data
  • a first word frequency-inverse document frequency calculation subunit configured to calculate the first word frequency-inverse document frequency of the corresponding word according to the first word frequency and the first inverse document frequency; wherein, the first word frequency-inverse document frequency is the product of the first word frequency and the first inverse document frequency.
  • the first frequency determination subunit is set to:
  • the parameter configuration information includes an inverse document frequency list, and the inverse document frequency list includes the inverse document frequency of each word contained in the dictionary tree;
  • the inverse document frequency list the inverse document frequency corresponding to each word in the target text data is respectively searched as the first inverse document frequency of each word in the target text data.
  • the parameter configuration information further includes a distribution deviation list; wherein, the distribution deviation list includes the distribution deviation of each word contained in the dictionary tree;
  • the device also includes:
  • the distribution deviation determination module is set to, before calculating the first importance score of at least one word in the target text data according to the at least one first word frequency-inverse document frequency, respectively, in the distribution deviation list, to search for the distribution deviation list corresponding to the The distribution deviation corresponding to each word in the target text data is used as the first distribution deviation of each word in the target text data;
  • the first importance score calculation unit is set to:
  • the first importance score is the The first word frequency - the product of the inverse document frequency and the deviation of the first distribution.
  • the target word sequence determination module is set to:
  • a pre-built dictionary tree is searched for a target word sequence adapted to the word sequence to be searched in the order from the root node to the child node.
  • the device further includes:
  • the corpus acquisition module is configured to acquire a general corpus and a target corpus before acquiring the target text data set to be clustered; wherein, the general corpus includes the target corpus, and the target corpus contains at least one piece of sample text data;
  • a distribution deviation calculation module configured to calculate the second distribution deviation of each word contained in the target corpus relative to the total corpus
  • the sample word sequence generation module is configured to, for each piece of sample text data in the target corpus, calculate the second importance score of the corresponding word according to the second distribution deviation of each word in the sample text data, and calculate the second importance score of the corresponding word according to the The second importance score sorts at least one word in each piece of sample text data in descending order to generate a sample word sequence corresponding to the sample text data;
  • a dictionary tree building module configured to build the dictionary tree based on at least one sample word sequence.
  • the sample word sequence generation module includes:
  • a second word frequency-inverse document frequency calculation unit configured to calculate the second word frequency-inverse document frequency of each word in the sample text data for each piece of sample text data in the target corpus;
  • the second importance score calculation unit is configured to calculate the second importance score of each word in the sample text data according to each second word frequency-inverse document frequency and the corresponding second distribution deviation, respectively.
  • the second word frequency-inverse document frequency calculation unit including:
  • a second frequency determination subunit configured to respectively determine the second word frequency and the second inverse document frequency of each word in the sample text data
  • the second word frequency-inverse document frequency calculation subunit is configured to calculate the second word frequency-inverse document frequency of the corresponding word in the sample text data according to the second word frequency and the second inverse document frequency.
  • the second frequency determination subunit is set to:
  • the second word frequency-inverse document frequency calculation subunit is set to:
  • w represents any word in the sample data text
  • tf(w) represents the second word frequency of the word w in the sample data text
  • idf(w) represents the second word frequency of the word w in the sample data text
  • Two inverse document frequency tf-idf(w) represents the second word frequency-inverse document frequency of the word w in the sample data text
  • m represents the number of times the word w appears in the sample data text
  • n represents the target The number of pieces of sample text data containing word w in the corpus
  • N represents the total number of pieces of sample text data contained in the target corpus.
  • the second importance score calculation unit is set to:
  • s(w) represents the second importance score of the word w in the sample text data
  • tf-idf a (w) represents the second word frequency-inverse document frequency of the word w in the sample data text
  • the shown distribution deviation calculation module is set to:
  • b represents the second distribution deviation of the word w in the target corpus relative to the total corpus
  • freq a (w) represents the frequency of the word w in the target corpus
  • freq(w) represents the word w in the total corpus.
  • the frequency of occurrence in the corpus t represents the number of occurrences of the word w in the target corpus
  • M represents the total number of words contained in the target corpus
  • t' represents the number of occurrences of the word w in the total corpus
  • M ' represents the total number of words contained in the total corpus.
  • the device further includes:
  • a number of occurrence determination module configured to determine the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences after constructing the dictionary tree based on at least one sample word sequence;
  • the dictionary tree pruning module is set to prune the dictionary tree according to the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences, until the number of nodes contained in the dictionary tree reaches up to the preset number.
  • the dictionary tree pruning module is set to:
  • the foregoing apparatus can execute the methods provided by all the foregoing embodiments of the present disclosure, and has functional modules corresponding to executing the foregoing methods.
  • functional modules corresponding to executing the foregoing methods For technical details that are not described in detail in the embodiments of the present disclosure, reference may be made to the methods provided by all the foregoing embodiments of the present disclosure.
  • FIG. 7 it shows a schematic structural diagram of an electronic device 300 suitable for implementing an embodiment of the present disclosure.
  • the electronic devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (PDAs), PADs (tablets), portable multimedia players (Portable Media Players). , PMP), mobile terminals such as in-vehicle terminals (such as in-vehicle navigation terminals), and stationary terminals such as digital televisions (TVs), desktop computers, etc., or various forms of servers, such as independent servers or server clusters.
  • PDAs personal digital assistants
  • PADs tablets
  • PMP portable multimedia players
  • PMP mobile terminals
  • in-vehicle terminals such as in-vehicle navigation terminals
  • stationary terminals such as digital televisions (TVs), desktop computers, etc.
  • servers such as independent servers or server clusters.
  • the electronic device shown in FIG. 7 is only an example, and should not
  • the electronic device 300 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 301, which may be stored in accordance with a program stored in a read-only storage device (Read-Only Memory, ROM) 302 or from a storage device
  • the device 305 loads a program into a random access memory (RAM) 303 to perform various appropriate actions and processes.
  • RAM random access memory
  • various programs and data required for the operation of the electronic device 300 are also stored.
  • the processing device 301, the ROM 302, and the RAM 303 are connected to each other through a bus 304.
  • An Input/Output (I/O) interface 305 is also connected to the bus 304 .
  • the following devices can be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD) Output device 307 , speaker, vibrator, etc.; storage device 308 including, eg, magnetic tape, hard disk, etc.; and communication device 309 .
  • Communication means 309 may allow electronic device 300 to communicate wirelessly or by wire with other devices to exchange data. While FIG. 7 illustrates electronic device 300 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing a recommended method of a word.
  • the computer program may be downloaded and installed from the network via the communication device 309, or from the storage device 305, or from the ROM 302.
  • the processing device 301 When the computer program is executed by the processing device 301, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above.
  • Computer readable storage media may include, but are not limited to, electrical connections having at least one wire, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable Read memory ((Erasable Programmable Read-Only Memory, EPROM) or flash memory), optical fiber, portable compact disk read only memory (Compact Disc-Read Only Memory, CD-ROM), optical storage device, magnetic storage device, or any of the above suitable combination.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable Read memory
  • CD-ROM Compact Disc-Read Only Memory
  • optical storage device magnetic storage device, or any of the above suitable combination.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, radio frequency (RF) (radio frequency), etc., or any suitable combination of the foregoing.
  • RF radio frequency
  • the client and server can use any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital data in any form or medium Communication (eg, a communication network) interconnects.
  • HTTP HyperText Transfer Protocol
  • Examples of communication networks include local area networks ("Local Area Network, LAN”), wide area networks ("Wide Area Network, WAN”), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), and any currently known or future developed networks.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the computer-readable medium carries at least one program, and when the at least one program is executed by the electronic device, the electronic device: acquires a target text data set to be clustered; wherein, the target text data set includes at least one target text data; for each piece of target text data in the target text data set, calculate the first importance score of each word in the target text data, and assign a value to each word in the target text data based on the first importance score.
  • Each word is sorted to generate a sequence of words to be searched corresponding to the target text data; for each sequence of words to be searched, a pre-built dictionary tree is searched for a sequence of target words adapted to the sequence of words to be searched; wherein, The target word sequence belongs to a subsequence of the to-be-searched word sequence; the corresponding target text data are clustered according to each of the target word sequences to obtain a text clustering result.
  • Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains at least one configurable function for implementing the specified logical function. Execute the instruction.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments of the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner. Among them, the name of the unit does not constitute a limitation of the unit itself under certain circumstances.
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (Application Specific Standard Products) Standard Parts, ASSP), system on chip (System on Chip, SOC), complex programmable logic device (Complex Programmable Logic Device, CPLD) and so on.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSP Application Specific Standard Products
  • SOC System on Chip
  • complex programmable logic device Complex Programmable Logic Device, CPLD
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include at least one wire-based electrical connection, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • flash memory flash memory
  • fiber optics compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • CD-ROM compact disk read only memory
  • magnetic storage devices or any suitable combination of the foregoing.
  • the present disclosure provides a text clustering method, including:
  • the target text data set includes at least one piece of target text data
  • the words are sorted, and a sequence of words to be searched corresponding to the target text data is generated;
  • a pre-built dictionary tree is searched for a target word sequence adapted to the word sequence to be searched; wherein, the target word sequence belongs to a subsequence of the word sequence to be searched;
  • the target text data corresponding to the at least one target word sequence is clustered according to the at least one target word sequence, respectively, to obtain a text clustering result.
  • calculating the first importance score of at least one word in the target text data including:
  • a first importance score of at least one word in the target text data is calculated according to at least one first word frequency-inverse document frequency, respectively.
  • calculating the first word frequency-inverse document frequency of at least one word in the target text data including:
  • the first word frequency-inverse document frequency is the first word frequency and the first inverse document frequency product of frequencies.
  • determining the first word frequency and the first inverse document frequency of each word in the target text data including:
  • the parameter configuration information includes an inverse document frequency list, and the inverse document frequency list includes the inverse document frequency of each word contained in the dictionary tree;
  • the inverse document frequency list the inverse document frequency corresponding to each word in the target text data is respectively searched as the first inverse document frequency of each word in the target text data.
  • the parameter configuration information further includes a distribution deviation list; wherein, the distribution deviation list includes the distribution deviation of each word contained in the dictionary tree;
  • the method further includes:
  • the distribution deviation list respectively find the distribution deviation corresponding to each word in the target text data as the first distribution deviation of each word in the target text data;
  • the first importance score is the The first word frequency - the product of the inverse document frequency and the deviation of the first distribution.
  • a pre-built dictionary tree is searched for a target word sequence adapted to the word sequence to be searched, including:
  • a pre-built dictionary tree is searched for a target word sequence adapted to the word sequence to be searched in the order from the root node to the child node.
  • the method before acquiring the target text data set to be clustered, the method further includes:
  • the total corpus includes the target corpus, and the target corpus contains at least one piece of sample text data;
  • the dictionary tree is constructed based on at least one sample word sequence.
  • calculate the second importance score of the corresponding word according to the second distribution deviation of each word in the sample text data including:
  • a second importance score of each word in the sample text data is calculated according to each second word frequency-inverse document frequency and the corresponding second distribution deviation, respectively.
  • separately calculating the second word frequency-inverse document frequency of each word in the sample text data including:
  • the second word frequency-inverse document frequency of the corresponding word in the sample text data is calculated according to the second word frequency and the second inverse document frequency.
  • determining the second word frequency and the second inverse document frequency of each word in the sample text data including:
  • Calculate the second word frequency-inverse document frequency of the corresponding word in the sample text data according to the second word frequency and the second inverse document frequency including:
  • w represents any word in the sample data text
  • tf(w) represents the second word frequency of the word w in the sample data text
  • idf(w) represents the second word frequency of the word w in the sample data text
  • Two inverse document frequency tf-idf(w) represents the second word frequency-inverse document frequency of the word w in the sample data text
  • m represents the number of times the word w appears in the sample data text
  • n represents the target The number of pieces of sample text data containing word w in the corpus
  • N represents the total number of pieces of sample text data contained in the target corpus.
  • calculate the second importance score of each word in the sample text data according to each second word frequency-inverse document frequency and the corresponding second distribution deviation including:
  • the second importance score of each word in the sample text data is calculated according to the following formula:
  • s(w) represents the second importance score of the word w in the sample text data
  • tf-idf a (w) represents the second word frequency-inverse document frequency of the word w in the sample data text
  • calculating the second distribution deviation of each word contained in the target corpus relative to the total corpus including:
  • b represents the second distribution deviation of the word w in the target corpus relative to the total corpus
  • freq a (w) represents the frequency of the word w in the target corpus
  • freq(w) represents the word w in the total corpus.
  • the frequency of occurrence in the corpus t represents the number of occurrences of the word w in the target corpus
  • M represents the total number of words contained in the target corpus
  • t' represents the number of occurrences of the word w in the total corpus
  • M ' represents the total number of words contained in the total corpus.
  • the method further includes:
  • the dictionary tree is pruned according to the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences, until the number of nodes included in the dictionary tree reaches a preset number.
  • the dictionary tree is pruned according to the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences, until the number of nodes contained in the dictionary tree reaches a preset number.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

Procédé et appareil de regroupement de texte, dispositif électronique et support de stockage. Le procédé consiste : à acquérir un ensemble de données de texte cibles à regrouper (110) ; pour chaque élément de données de texte cibles dans l'ensemble de données de texte cibles, à calculer un premier score d'importance d'un ou plusieurs mots dans chaque élément de données de texte cibles et à trier lesdits mots dans chaque élément de données de texte cibles sur la base du premier score d'importance de façon à produire une séquence de mots à rechercher correspondant à chaque élément de données de texte cibles (120) ; pour chaque séquence de mots, à rechercher dans un arbre de dictionnaire pré-construit une séquence de mots cibles adaptée à chaque séquence de mots, la séquence de mots cibles appartenant à une sous-séquence de chaque séquence de mots (130) ; et selon une ou plusieurs séquences de mots cibles, à regrouper respectivement les données de texte cibles correspondant auxdites séquences de mots cibles de façon à obtenir un résultat de regroupement de texte (140).
PCT/CN2021/136677 2020-12-31 2021-12-09 Procédé et appareil de regroupement de texte, dispositif électronique et support de stockage WO2022143069A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011630633.2A CN112632285A (zh) 2020-12-31 2020-12-31 一种文本聚类方法、装置、电子设备及存储介质
CN202011630633.2 2020-12-31

Publications (1)

Publication Number Publication Date
WO2022143069A1 true WO2022143069A1 (fr) 2022-07-07

Family

ID=75290541

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/136677 WO2022143069A1 (fr) 2020-12-31 2021-12-09 Procédé et appareil de regroupement de texte, dispositif électronique et support de stockage

Country Status (2)

Country Link
CN (1) CN112632285A (fr)
WO (1) WO2022143069A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117875262A (zh) * 2024-03-12 2024-04-12 青岛天一红旗软控科技有限公司 基于管理平台的数据处理方法
CN117891411A (zh) * 2024-03-14 2024-04-16 济宁蜗牛软件科技有限公司 一种海量档案数据优化存储方法
CN118012979A (zh) * 2024-04-10 2024-05-10 济南宝林信息技术有限公司 一种普通外科手术智能采集存储系统

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632285A (zh) * 2020-12-31 2021-04-09 北京有竹居网络技术有限公司 一种文本聚类方法、装置、电子设备及存储介质
CN117811851B (zh) * 2024-03-01 2024-05-17 深圳市聚亚科技有限公司 一种4g通信模块数据传输方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120166441A1 (en) * 2010-12-23 2012-06-28 Microsoft Corporation Keywords extraction and enrichment via categorization systems
CN109508456A (zh) * 2018-10-22 2019-03-22 网易(杭州)网络有限公司 一种文本处理方法和装置
CN110472043A (zh) * 2019-07-03 2019-11-19 阿里巴巴集团控股有限公司 一种针对评论文本的聚类方法及装置
CN111651596A (zh) * 2020-05-27 2020-09-11 软通动力信息技术有限公司 一种文本聚类的方法、装置、服务器及存储介质
CN112632285A (zh) * 2020-12-31 2021-04-09 北京有竹居网络技术有限公司 一种文本聚类方法、装置、电子设备及存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106713273B (zh) * 2016-11-23 2019-08-09 中国空间技术研究院 一种基于字典树剪枝搜索的协议关键字识别方法
CN109740165A (zh) * 2019-01-09 2019-05-10 网易(杭州)网络有限公司 字典树构建方法、语句搜索方法、装置、设备及存储介质
CN111090719B (zh) * 2019-10-11 2024-05-03 平安科技(上海)有限公司 文本分类方法、装置、计算机设备及存储介质
CN110826605A (zh) * 2019-10-24 2020-02-21 北京明略软件系统有限公司 一种跨平台识别用户的方法及装置
CN111221968B (zh) * 2019-12-31 2023-07-21 北京航空航天大学 基于学科树聚类的作者消歧方法及装置
CN112115232A (zh) * 2020-09-24 2020-12-22 腾讯科技(深圳)有限公司 一种数据纠错方法、装置及服务器

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120166441A1 (en) * 2010-12-23 2012-06-28 Microsoft Corporation Keywords extraction and enrichment via categorization systems
CN109508456A (zh) * 2018-10-22 2019-03-22 网易(杭州)网络有限公司 一种文本处理方法和装置
CN110472043A (zh) * 2019-07-03 2019-11-19 阿里巴巴集团控股有限公司 一种针对评论文本的聚类方法及装置
CN111651596A (zh) * 2020-05-27 2020-09-11 软通动力信息技术有限公司 一种文本聚类的方法、装置、服务器及存储介质
CN112632285A (zh) * 2020-12-31 2021-04-09 北京有竹居网络技术有限公司 一种文本聚类方法、装置、电子设备及存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117875262A (zh) * 2024-03-12 2024-04-12 青岛天一红旗软控科技有限公司 基于管理平台的数据处理方法
CN117875262B (zh) * 2024-03-12 2024-06-04 青岛天一红旗软控科技有限公司 基于管理平台的数据处理方法
CN117891411A (zh) * 2024-03-14 2024-04-16 济宁蜗牛软件科技有限公司 一种海量档案数据优化存储方法
CN118012979A (zh) * 2024-04-10 2024-05-10 济南宝林信息技术有限公司 一种普通外科手术智能采集存储系统

Also Published As

Publication number Publication date
CN112632285A (zh) 2021-04-09

Similar Documents

Publication Publication Date Title
WO2022143069A1 (fr) Procédé et appareil de regroupement de texte, dispositif électronique et support de stockage
US10649770B2 (en) κ-selection using parallel processing
CN111221984A (zh) 多模态内容处理方法、装置、设备及存储介质
CN112840336A (zh) 用于对内容项推荐进行排名的技术
US8930342B2 (en) Enabling multidimensional search on non-PC devices
CN107301195B (zh) 生成用于搜索内容的分类模型方法、装置和数据处理系统
JP2022046759A (ja) 検索方法、装置、電子機器及び記憶媒体
WO2023160500A1 (fr) Procédé et appareil d'affichage d'informations encyclopédiques, dispositif et support
JP2022191412A (ja) マルチターゲット画像テキストマッチングモデルのトレーニング方法、画像テキスト検索方法と装置
US20210374344A1 (en) Method for resource sorting, method for training sorting model and corresponding apparatuses
US9407589B2 (en) System and method for following topics in an electronic textual conversation
US11836174B2 (en) Method and apparatus of establishing similarity model for retrieving geographic location
WO2022156730A1 (fr) Procédé et appareil de traitement de texte, dispositif, et support
CN110275962B (zh) 用于输出信息的方法和装置
CN114385780B (zh) 程序接口信息推荐方法、装置、电子设备和可读介质
JP7140913B2 (ja) 映像配信時効の決定方法及び装置
CN113407814B (zh) 文本搜索方法、装置、可读介质及电子设备
CN113204691B (zh) 一种信息展示方法、装置、设备及介质
CN112287206A (zh) 信息处理方法、装置和电子设备
CN110209781B (zh) 一种文本处理方法、装置以及相关设备
CN111400456A (zh) 资讯推荐方法及装置
CN114298007A (zh) 一种文本相似度确定方法、装置、设备及介质
CN117131281B (zh) 舆情事件处理方法、装置、电子设备和计算机可读介质
CN113536763A (zh) 一种信息处理方法、装置、设备及存储介质
US20230085684A1 (en) Method of recommending data, electronic device, and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21913810

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21913810

Country of ref document: EP

Kind code of ref document: A1