WO2022126944A1 - Procédé de regroupement de textes, dispositif électronique et support de stockage - Google Patents

Procédé de regroupement de textes, dispositif électronique et support de stockage Download PDF

Info

Publication number
WO2022126944A1
WO2022126944A1 PCT/CN2021/087169 CN2021087169W WO2022126944A1 WO 2022126944 A1 WO2022126944 A1 WO 2022126944A1 CN 2021087169 W CN2021087169 W CN 2021087169W WO 2022126944 A1 WO2022126944 A1 WO 2022126944A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
texts
clusters
density
cluster
Prior art date
Application number
PCT/CN2021/087169
Other languages
English (en)
Chinese (zh)
Inventor
尹扬
郭鹏华
Original Assignee
上海朝阳永续信息技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海朝阳永续信息技术股份有限公司 filed Critical 上海朝阳永续信息技术股份有限公司
Publication of WO2022126944A1 publication Critical patent/WO2022126944A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Definitions

  • Embodiments of the present disclosure generally relate to the field of information processing, and in particular, to a method, an electronic device, and a computer storage medium for text clustering.
  • K-Means clustering Because the daily news focus topic is unpredictable and fixed, and some commonly used clustering models, such as K-Means clustering, need to specify the number of clusters K value in advance.
  • K-Means clustering For a supervised text classification model, not only the category of the text needs to be pre-specified, but also manually labeled training data for machine learning training. These prerequisites are also impossible for the huge amount of unknown news that emerges every day.
  • a method, an electronic device and a computer storage medium for text clustering which can realize multi-level text clustering.
  • a method for text clustering includes: based on a text library, determining the word frequency-inverse document frequency of each word in the plurality of first texts to be clustered; removing entity identifiers from the plurality of text titles in the plurality of first texts to generate a plurality of disembodied titles; determine a plurality of first feature representations associated with the plurality of disembodied titles based on the term frequency-inverse document frequency and bag-of-words model of each term in the plurality of disembodied titles; based on the plurality of first features represents and a first density radius, and performs density clustering on a plurality of first texts to generate a plurality of first text clusters and a plurality of unclustered second texts; based on each word in the plurality of second texts The word frequency-inverse document frequency and bag-of-words model of the clustering to generate a plurality of second text clusters, the second density radius is greater than the first density radius.
  • an electronic device includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to cause the at least one processor
  • the method according to the first aspect can be performed.
  • a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to the first aspect of the present disclosure.
  • FIG. 1 is a schematic diagram of an information processing environment 100 according to an embodiment of the present disclosure.
  • FIG. 2 is a schematic diagram of a method 200 for text clustering according to an embodiment of the present disclosure.
  • FIG. 3 is a schematic diagram of a method 300 for cluster segmentation according to an embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of a method 400 for generating cluster titles according to an embodiment of the present disclosure.
  • FIG. 5 is a block diagram of an electronic device used to implement the method for text clustering according to an embodiment of the present disclosure.
  • the term “including” and variations thereof mean open-ended inclusion, ie, "including but not limited to”.
  • the term “or” means “and/or” unless specifically stated otherwise.
  • the term “based on” means “based at least in part on”.
  • the terms “one example embodiment” and “one embodiment” mean “at least one example embodiment.”
  • the term “another embodiment” means “at least one additional embodiment.”
  • the terms “first”, “second”, etc. may refer to different or the same objects. Other explicit and implicit definitions may also be included below.
  • example embodiments of the present disclosure propose a scheme for text clustering.
  • the word frequency-inverse document frequency of each word in the plurality of first texts to be clustered is determined; the entity identifiers are removed from the plurality of text titles in the plurality of first texts to generate Multiple disembodied titles; determine multiple first feature representations associated with multiple disembodied titles based on the term frequency-inverse document frequency and bag-of-words model for each term in multiple disembodied titles; based on multiple first feature representations feature representation and a first density radius, density clustering a plurality of first texts to generate a plurality of first text clusters and a plurality of unclustered second texts; based on each of the plurality of second texts Term frequency of words-inverse document frequency and bag-of-words model, determining a plurality of second feature representations associated with the plurality of second texts; and performing analysis on the plurality
  • FIG. 1 shows a schematic diagram of an example of an information processing environment 100 according to an embodiment of the present disclosure.
  • the information processing environment 100 may include a computing device 110, a plurality of first texts 120-1 to 120-n (collectively 120) to be clustered, a text library 130, and a clustering result 140 of the plurality of first texts 120.
  • computing device 110 includes, but is not limited to, personal computers, desktop computers, server computers, multiprocessor systems, mainframe computers, distributed computing environments including any of the foregoing systems or devices, and the like, for example.
  • computing device 110 may have one or more processing units, including special purpose processing units such as image processing units GPUs, field programmable gate arrays FPGAs and application specific integrated circuits ASICs, and general purpose processing units such as central processing units CPUs unit.
  • special purpose processing units such as image processing units GPUs, field programmable gate arrays FPGAs and application specific integrated circuits ASICs
  • general purpose processing units such as central processing units CPUs unit.
  • each of the plurality of first texts 120 includes, for example, a text title and a text body.
  • the plurality of first texts 120 include, but are not limited to, a plurality of news texts, for example.
  • the text library 130 may, for example, include a large number of texts, such as millions. Its inverse document frequency may be determined in advance for each word in all texts included in the text library 130 and stored in the text library 130 for subsequent use.
  • the computing device 110 is configured to determine, based on the text library 130 , the word frequency-inverse document frequency of each word in the plurality of first texts 120 to be clustered; , to generate multiple non-entity titles; determine multiple first feature representations associated with multiple non-entity titles based on the term frequency-inverse document frequency and bag-of-words model of each term in multiple performing density clustering on a plurality of first texts 120 to generate a plurality of first text clusters and a plurality of unclustered second texts; based on the plurality of second texts word frequency-inverse document frequency and bag-of-words model for each word in , determining a plurality of second feature representations associated with the plurality of second texts; and based on the plurality of second feature representations and the second density radius, for the plurality of The second text is subjected to density clustering to generate a plurality of second text clusters, and the second density radius is larger than the first density radius.
  • FIG. 2 shows a flowchart of a method 200 for text clustering according to an embodiment of the present disclosure.
  • method 200 may be performed by computing device 110 as shown in FIG. 1 . It should be understood that method 200 may also include additional blocks not shown and/or blocks shown may be omitted, and the scope of the present disclosure is not limited in this regard.
  • the computing device 110 determines, based on the text library 130, the term frequency-inverse document frequency for each term in the plurality of first texts 120 to be clustered.
  • the plurality of first texts 120 include, but are not limited to, a plurality of news texts, for example.
  • the first text may include, for example, a text title and a text body, such as a news headline and a news body.
  • the computing device 110 may segment the first text to obtain a plurality of words, and then the computing device 110 determines that each word in the plurality of words is in the first text
  • the frequency of occurrence in a text that is, the word frequency
  • the inverse document frequency of each word in the text base 130 is determined (the more texts a word appears in the text base 130, the lower its inverse document frequency), and then the The term frequency for each term is multiplied by the inverse document frequency to generate the term frequency - inverse document frequency for each term.
  • the inverse document frequency formula can be where N is the total number of texts in the text library 130 , and N(x) is the number of texts in the text library 130 that include the word x.
  • the computing device 110 removes entity identification from the plurality of textual titles in the plurality of first texts 120 to generate a plurality of non-physical titles.
  • the computing device 110 may perform entity recognition on the plurality of text titles to determine entity identifiers, such as company names, etc., in the plurality of text titles. Subsequently, computing device 110 may remove the identified entity identifications from the plurality of textual titles to generate a plurality of non-entity titles.
  • the computing device 110 determines a plurality of first feature representations associated with the plurality of disembodied titles based on the term frequency-inverse document frequency and bag-of-words model for each term in the plurality of disembodied titles.
  • computing device 110 may determine a plurality of term frequencies-inverse document frequencies for the plurality of terms included in the disembodied title. Subsequently, the computing device 110 may generate a vector including a plurality of term frequencies-inverse document frequencies based on the bag-of-words model. Next, computing device 110 may L2 norm normalize the vector to generate a first feature representation associated with the disembodied title.
  • the computing device 110 density-clusters the plurality of first texts 120 based on the plurality of first feature representations and the first density radii to generate a plurality of first text clusters and a plurality of unclustered Second text.
  • the computing device 110 may determine the similarity between pairs of texts in the plurality of first texts 120 based on the plurality of first feature representations.
  • the similarity includes, but is not limited to, cosine similarity, for example.
  • the computing device 110 may perform density clustering (Density-Based Spatial Clustering of Applications with Noise, DBSCAN) on the plurality of first texts 120 based on the similarity and the first density radius to generate a plurality of first text clusters and Multiple second texts that are not clustered.
  • density clustering Density-Based Spatial Clustering of Applications with Noise, DBSCAN
  • the distance between pairs of texts in the plurality of first texts 120 may be represented by 1-similarity, and the neighborhood of each first text may be determined by the distance between pairs of texts and the first density radius .
  • its neighborhood may include first texts whose distance from the first text A is smaller than the first density radius among the plurality of first texts 120 .
  • the first text A is called core text.
  • the first text B is located in the neighborhood of the first text A and the first text A is the core text, then the first text B can be said to be densely connected by the first text A.
  • the first text B is said to be density reachable from the first text A.
  • core texts that are not clustered in the plurality of first texts 120 can be selected as seeds, and a first text set with a density of the core texts can be determined as the first text clustering, and then repeated.
  • the above selection and determination process is performed until all core texts are clustered, thereby generating multiple first text clusters and multiple second texts that are not clustered.
  • the computing device 110 determines a plurality of second feature representations associated with the plurality of second texts based on the term frequency-inverse document frequency and the bag-of-words model for each term in the plurality of second texts.
  • the computing device 110 may determine the plurality of third texts associated with the plurality of text titles in the plurality of second texts based on the word frequency-inverse document frequency and the bag-of-words model for each word in the plurality of second texts A feature representation and a plurality of fourth feature representations associated with the plurality of textual bodies of the plurality of second texts.
  • multiple term frequencies-inverse document frequencies of multiple words included in the text title may be vectorized to generate a third feature representation associated with the text title.
  • multiple term frequencies-inverse document frequencies of multiple words included in the text body can be vectorized to generate a fourth feature representation associated with the text body.
  • the computing device 110 may then weight the plurality of third feature representations and the plurality of fourth feature representations based on predetermined weights to generate a plurality of second feature representations associated with the plurality of second texts.
  • the third feature of its title is represented as Vt
  • the fourth feature of its text is represented as Vc
  • the predetermined weight is wt
  • L2 norm normalization can also be performed on the second feature representation.
  • the computing device 110 density-clusters the plurality of second texts based on the plurality of second feature representations and the second density radii, to generate a plurality of second text clusters, the second density radii being greater than the first density radius.
  • the computing device 110 may determine the similarity between pairs of texts in the plurality of second texts based on the plurality of second feature representations.
  • the similarity includes, but is not limited to, cosine similarity, for example.
  • the computing device 110 may perform density clustering on the plurality of second texts based on the similarity and the second density radius to generate a plurality of second text clusters.
  • multi-level text clustering can be realized.
  • cluster texts such as news information from different focuses and dimensions of articles, and support more fine-grained and hierarchical clustering of subdivided topics under large topics , which solves the problems of too single dimension and granularity of conventional clustering, lack of hierarchy, and inaccurate clustering.
  • density clustering is continuously expanded according to the density-connectable samples, it is easy to connect two different classes of text with individual connection points. For example, suppose there are two clusters, A and B, that talk about two unrelated events, E1 and E2, respectively.
  • One of the news x is a news summary, and it mentions event E1 and event E2 at the same time, then news x may be adjacent to some articles in category A and category B at the same time.
  • density clustering is likely to classify category A. It is combined with class B into a large class, so that class A and class B cannot be separated. Therefore, news reviews or other review articles have a great impact on density clustering.
  • this mixed large cluster may be chain-like extension, for example, multiple clusters such as A-B-C-D... etc. are concatenated into one class.
  • the present disclosure also provides a cluster segmentation method, after generating a plurality of second text clusters, cluster segmentation is performed on the plurality of second text clusters.
  • FIG. 3 shows a flowchart of a method 300 for cluster segmentation according to an embodiment of the present disclosure.
  • method 300 may be performed by computing device 110 as shown in FIG. 1 . It should be understood that method 300 may also include additional blocks not shown and/or blocks shown may be omitted, and the scope of the present disclosure is not limited in this regard.
  • the method 300 may include, for each second text cluster of the plurality of second text clusters, the computing device 110 performing the following steps.
  • the computing device 110 determines a core set of text from the second text cluster, any core text in the core text set having a distance from at least a predetermined number of texts in the second text cluster that is less than a second density radius .
  • the predetermined number may be, for example, any suitable number.
  • the computing device 110 determines a segmented text set from the core text set, the plurality of core texts remaining after any segmented text in the segmented text set is removed from the core text set may be divided into a plurality of first connected subsets.
  • the computing device 110 may traverse each core text in the core text set and search for the first connected subset among the plurality of remaining core texts from which the core text is removed from the core text set.
  • the computing device 110 may search the first connected subset by the following steps: select any core text p in the set V composed of the remaining multiple core texts as a seed, and search for the set V in the set V composed of all neighboring core texts of p Set P1, then search for set P2 composed of all core texts in set P1 and not yet found neighbor core texts, and then search for set P3 composed of all core texts in set P2 and not yet found neighbor core texts, and the cycle repeats. Until the new neighbor core text can no longer be found, assuming that the sets P1, P2, P3...Pn are finally obtained, all the core texts in the n sets and the seed core text p together form a first connected subset.
  • first connected subsets of V can be found by repeating the above steps for all remaining core texts in set V. If only one set is output, that is, V itself is returned, it means that V is connected as a whole and cannot be subdivided into multiple first connected subsets.
  • the first connected subset has the following characteristics: (1) The interior of the subset is connected: that is, a path can be found between any text x and any other text y: x-p1-p2-...-pn-y, where x is adjacent to p1, p1 is adjacent to p2, ..., pn is adjacent to y; (2) the distance between any text in any subset and any text in other subsets must be greater than the second density radius. That is, the above path cannot be found to connect any two texts in different subsets.
  • "neighbor” or "adjacent” may be defined as the distance between texts not exceeding the second density radius.
  • the removed core text is determined to be segmented text, and if the remaining multiple core texts cannot be divided into multiple first connected subsets, the removed core text Not split text.
  • the computing device 110 removes the segmented text set and the non-core text sets having a distance from the segmented text set less than the second density radius from the second text cluster to generate the remaining text set.
  • the computing device 110 determines whether the remaining text set can be divided into a plurality of second connected subsets.
  • whether the remaining text set can be divided into a plurality of second connected subsets is determined by searching the remaining text set for the second connected subset.
  • the specific search process is similar to that of the first connected subset, which can be referred to above, and will not be repeated here.
  • the computing device 110 determines at block 308 that the remaining text set can be divided into a plurality of second connected subsets, then at block 310 the plurality of second connected subsets are used as a plurality of third text clusters after segmentation Text clustering. For example, the text in the plurality of second connected subsets is marked with the corresponding third text clustering label.
  • the computing device 110 based on the second density radius, divides the segmented text set and the non-core text set that is less than the second density radius from the segmented text set (hereinafter referred to as the segmented text set and its non-core neighbor text set ) into multiple third text clusters.
  • the text is divided into multiple third text clusters, that is, Label multiple third text clusters. For example, to determine whether the text P is adjacent to the core text Pa in a third text cluster A (the distance between the texts is less than the second density radius), if P is adjacent to Pa, then P belongs to the third text cluster A , marked with an A. If P is adjacent to multiple core texts in multiple third text clusters, multiple different cluster labels can be applied.
  • the segmented text set and its non-core neighbor text sets that are not classified into the third text cluster, if the multiple texts include the segmented text and the non-core neighbor text associated with the segmented text (with the segmented text) The distance between the texts is less than the non-core text of the second density radius), then the segmented text and the non-core neighbor texts associated with the segmented text are regarded as new clusters, for example, a new cluster label is added, otherwise there will be more text is set to noise.
  • computing device 110 may also determine two of the text clusters. The maximum distance between the two texts, and if it is determined that the maximum distance is greater than the threshold, density clustering is performed on the text cluster based on the second feature representation set and the third density radius to generate a plurality of fourth text clusters, the third The third density radius is smaller than the second density radius.
  • the above-mentioned ordered three-time density clustering and one-time cluster segmentation overcome the problems of conventional clustering dimensions and granularity being too single, lack of hierarchy, and inaccurate clustering.
  • the present disclosure also provides a method for generating cluster titles.
  • FIG. 4 shows a flowchart of a method 400 for generating cluster titles according to an embodiment of the present disclosure.
  • method 400 may be performed by computing device 110 as shown in FIG. 1 . It should be understood that method 400 may also include additional blocks not shown and/or blocks shown may be omitted, as the scope of the present disclosure is not limited in this regard.
  • the method 400 may include performing the following steps on the computing device 110 for at least one of the plurality of first text clusters, the plurality of second text clusters, the plurality of third text clusters, and the plurality of fourth text clusters .
  • the computing device 110 divides the plurality of text titles in the text cluster into a plurality of title segments based on punctuation. For example, multiple text titles in text clustering are divided into multiple title segments based on punctuation such as commas, exclamation marks, question marks, semicolons, spaces, etc. In some embodiments, some specific rules may also be implemented, such as removing commas and spaces at the beginning and the end.
  • computing device 110 determines a plurality of first scores associated with the plurality of headline segments based on the segment frequency of occurrence of each term in the plurality of headline segments. Fragment frequency is how many title fragments the word appears in.
  • the first score of the title segment may be determined, for example, as an average of the frequency of occurrences of the multiple segments of the multiple terms included in the title segment.
  • the computing device 110 determines a plurality of feature representations associated with the plurality of headline segments based on the term frequency-inverse document frequency and bag-of-words model for each term in the plurality of headline segments.
  • a vector is generated as a feature representation of the title segment.
  • the computing device 110 density-clusters the plurality of title fragments based on the plurality of feature representations associated with the plurality of title fragments to generate a plurality of fragment clusters.
  • computing device 110 determines a plurality of second scores associated with the plurality of fragment clusters based on the plurality of first scores associated with the plurality of title fragments.
  • the plurality of first scores for the plurality of title fragments included in the fragment cluster may be summed as the second score for the fragment cluster.
  • the computing device 110 determines the first segment cluster with the second highest score from the plurality of segment clusters.
  • the computing device 110 determines the shortest title segment from the plurality of title segments in the first segment cluster that include the term with the most frequently occurring segment as the cluster title of the text cluster.
  • a title that can summarize the subject content of each cluster can be automatically generated, so that readers can get a general understanding of the main content of each cluster without viewing the specific articles in each cluster, and then select the cluster of interest to read.
  • the specific content of the articles in it further improves the efficiency of readers' grasp of information.
  • computing device 110 may also present a plurality of first text clusters, a plurality of second text clusters, a plurality of third text clusters, and a plurality of fourth text clusters and a plurality of titles associated with a plurality of first text clusters, a plurality of second text clusters, a plurality of third text clusters, and a plurality of fourth text clusters.
  • the computing device 110 may also obtain search results based on the search keywords input by the user, the search results include a plurality of first texts to be clustered, and present a plurality of first texts clusters, a plurality of second text clusters, a plurality of third text clusters, a plurality of fourth text clusters, and a plurality of first text clusters, a plurality of second text clusters, a plurality of third text clusters class, a plurality of cluster titles associated with a plurality of fourth text clusters.
  • the present disclosure has a wide range of uses and application prospects, and can be used in any scene or product containing news information.
  • daily news information can be clustered with the solution of the present disclosure, and displayed to users in the form of clusters, so as to improve the reading efficiency of users; it can also be used in news search scenarios: users enter search keywords, and then The solution of the present disclosure is used to cluster the news sets returned by the search, and return the final result to the user in the form of news clustering.
  • FIG. 5 shows a schematic block diagram of an example device 500 that may be used to implement embodiments of the present disclosure.
  • computing device 110 as shown in FIG. 1 may be implemented by device 500 .
  • device 500 includes a central processing unit (CPU) 501 that may be loaded into random access memory (RAM) 503 according to computer program instructions stored in read only memory (ROM) 502 or from storage unit 508 computer program instructions to perform various appropriate actions and processes.
  • RAM random access memory
  • ROM read only memory
  • RAM 503 various programs and data required for the operation of the device 500 can also be stored.
  • the CPU 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504.
  • An input/output (I/O) interface 505 is also connected to bus 504 .
  • I/O input/output
  • Various components in the device 500 are connected to the I/O interface 505, including: an input unit 506, such as a keyboard, mouse, microphone, etc.; an output unit 507, such as various types of displays, speakers, etc.; a storage unit 508, such as a disk, CD, etc.; and a communication unit 509, such as a network card, a modem, a wireless communication transceiver, and the like.
  • the communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
  • methods 200 - 400 may be performed by central processing unit 501 .
  • methods 200 - 400 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 508 .
  • part or all of the computer program may be loaded and/or installed on device 500 via ROM 502 and/or communication unit 509 .
  • the computer program is loaded into RAM 503 and executed by central processing unit 501, one or more of the actions of methods 200-400 described above may be performed.
  • a computer program product may include computer readable program instructions for carrying out various aspects of the present disclosure.
  • a computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Non-exhaustive list of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • flash memory static random access memory
  • SRAM static random access memory
  • CD-ROM compact disk read only memory
  • DVD digital versatile disk
  • memory sticks floppy disks
  • mechanically coded devices such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
  • Computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, light pulses through fiber optic cables), or through electrical wires transmitted electrical signals.
  • the computer readable program instructions described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • the computer program instructions for carrying out the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages.
  • Source or object code written in any combination including object-oriented programming languages, such as Smalltalk, C++, etc., and conventional procedural programming languages, such as the "C" language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through the Internet connect).
  • LAN local area network
  • WAN wide area network
  • custom electronic circuits such as programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs) can be personalized by utilizing state information of computer readable program instructions.
  • Computer readable program instructions are executed to implement various aspects of the present disclosure.
  • These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processing unit of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more functions for implementing the specified logical function(s) executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Les modes de réalisation de la présente divulgation concernent un procédé de regroupement de textes, un dispositif et un support de stockage, lesquels concernent le domaine du traitement d'informations. Le procédé consiste à : déterminer un produit fréquence de mot-fréquence inverse de document (TF-IDF) de chaque mot d'une pluralité de premiers textes ; éliminer des identifiants d'entités d'une pluralité de titres de textes de la pluralité de premiers textes, afin de générer une pluralité de non-titres d'entités ; selon un produit TF-IDF, déterminer une pluralité de premières représentations de caractéristiques associées à la pluralité de non-titres d'entités ; selon la pluralité de premières représentations de caractéristiques et selon un premier rayon de densité, effectuer un regroupement de densités sur la pluralité de premiers textes, afin de générer une pluralité de premiers groupes de textes et une pluralité de seconds textes non regroupés ; selon un produit TF-IDF, déterminer une pluralité de secondes représentations de caractéristiques associées à la pluralité de seconds textes ; et selon la pluralité de secondes représentations de caractéristiques et selon un second rayon de densité, effectuer un regroupement de densités sur la pluralité de seconds textes, afin de générer une pluralité de seconds groupes de textes, le second rayon de densité étant supérieur au premier rayon de densité. On peut ainsi réaliser un regroupement de textes à niveaux multiples.
PCT/CN2021/087169 2020-12-17 2021-04-14 Procédé de regroupement de textes, dispositif électronique et support de stockage WO2022126944A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011491126.5 2020-12-17
CN202011491126.5A CN112256842B (zh) 2020-12-17 2020-12-17 用于文本聚类的方法、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2022126944A1 true WO2022126944A1 (fr) 2022-06-23

Family

ID=74225827

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/087169 WO2022126944A1 (fr) 2020-12-17 2021-04-14 Procédé de regroupement de textes, dispositif électronique et support de stockage

Country Status (2)

Country Link
CN (1) CN112256842B (fr)
WO (1) WO2022126944A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115544213A (zh) * 2022-11-28 2022-12-30 上海朝阳永续信息技术股份有限公司 获取文本中的信息的方法、设备和存储介质
CN118016225A (zh) * 2024-04-09 2024-05-10 山东第一医科大学附属省立医院(山东省立医院) 一种肾移植术后电子健康记录数据智能管理方法
CN118569254A (zh) * 2024-08-01 2024-08-30 浙江度衍信息技术有限公司 基于nlp的公文数据采集分析方法及系统

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112256842B (zh) * 2020-12-17 2021-03-26 上海朝阳永续信息技术股份有限公司 用于文本聚类的方法、电子设备和存储介质
CN113011155B (zh) * 2021-03-16 2023-09-05 北京百度网讯科技有限公司 用于文本匹配的方法、装置、设备和存储介质
CN113792784B (zh) * 2021-09-14 2022-06-21 上海任意门科技有限公司 用于用户聚类的方法、电子设备和存储介质
CN115577124B (zh) * 2022-11-10 2023-04-07 上海朝阳永续信息技术股份有限公司 用于交互金融数据的方法、设备和介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170337262A1 (en) * 2016-05-19 2017-11-23 Quid, Inc. Pivoting from a graph of semantic similarity of documents to a derivative graph of relationships between entities mentioned in the documents
CN110209808A (zh) * 2018-08-08 2019-09-06 腾讯科技(深圳)有限公司 一种基于文本信息的事件生成方法以及相关装置
CN110489558A (zh) * 2019-08-23 2019-11-22 网易传媒科技(北京)有限公司 文章聚合方法和装置、介质和计算设备
CN111143479A (zh) * 2019-12-10 2020-05-12 浙江工业大学 基于dbscan聚类算法的知识图谱关系抽取与rest服务可视化融合方法
CN112256842A (zh) * 2020-12-17 2021-01-22 上海朝阳永续信息技术股份有限公司 用于文本聚类的方法、电子设备和存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108520009A (zh) * 2018-03-19 2018-09-11 北京工业大学 一种英文文本聚类方法及系统
CN109189934B (zh) * 2018-11-13 2024-07-19 平安科技(深圳)有限公司 舆情推荐方法、装置、计算机设备及存储介质
CN111985228B (zh) * 2020-07-28 2023-05-30 招联消费金融有限公司 文本关键词提取方法、装置、计算机设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170337262A1 (en) * 2016-05-19 2017-11-23 Quid, Inc. Pivoting from a graph of semantic similarity of documents to a derivative graph of relationships between entities mentioned in the documents
CN110209808A (zh) * 2018-08-08 2019-09-06 腾讯科技(深圳)有限公司 一种基于文本信息的事件生成方法以及相关装置
CN110489558A (zh) * 2019-08-23 2019-11-22 网易传媒科技(北京)有限公司 文章聚合方法和装置、介质和计算设备
CN111143479A (zh) * 2019-12-10 2020-05-12 浙江工业大学 基于dbscan聚类算法的知识图谱关系抽取与rest服务可视化融合方法
CN112256842A (zh) * 2020-12-17 2021-01-22 上海朝阳永续信息技术股份有限公司 用于文本聚类的方法、电子设备和存储介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115544213A (zh) * 2022-11-28 2022-12-30 上海朝阳永续信息技术股份有限公司 获取文本中的信息的方法、设备和存储介质
CN118016225A (zh) * 2024-04-09 2024-05-10 山东第一医科大学附属省立医院(山东省立医院) 一种肾移植术后电子健康记录数据智能管理方法
CN118569254A (zh) * 2024-08-01 2024-08-30 浙江度衍信息技术有限公司 基于nlp的公文数据采集分析方法及系统

Also Published As

Publication number Publication date
CN112256842B (zh) 2021-03-26
CN112256842A (zh) 2021-01-22

Similar Documents

Publication Publication Date Title
WO2022126944A1 (fr) Procédé de regroupement de textes, dispositif électronique et support de stockage
US11734329B2 (en) System and method for text categorization and sentiment analysis
CN107609121B (zh) 基于LDA和word2vec算法的新闻文本分类方法
CN108399228B (zh) 文章分类方法、装置、计算机设备及存储介质
Qian et al. Multi-modal event topic model for social event analysis
Li et al. Filtering out the noise in short text topic modeling
Gaikwad et al. Text mining methods and techniques
US9589208B2 (en) Retrieval of similar images to a query image
US8108413B2 (en) Method and apparatus for automatically discovering features in free form heterogeneous data
WO2022095374A1 (fr) Procédé et appareil d'extraction de mots-clés, ainsi que dispositif terminal et support de stockage
TW202009749A (zh) 人機對話方法、裝置、電子設備及電腦可讀媒體
Fang et al. Word-of-mouth understanding: Entity-centric multimodal aspect-opinion mining in social media
US20130060769A1 (en) System and method for identifying social media interactions
CN110162771B (zh) 事件触发词的识别方法、装置、电子设备
Prastyo et al. Indonesian Sentiment Analysis: An Experimental Study of Four Kernel Functions on SVM Algorithm with TF-IDF
CN107357895B (zh) 一种基于词袋模型的文本表示的处理方法
US20190318191A1 (en) Noise mitigation in vector space representations of item collections
WO2023240878A1 (fr) Procédé et appareil de reconnaissance de ressource, et dispositif et support d'enregistrement
CA3131157A1 (fr) Systeme et procede pour categorisation de texte et analyse de sentiments
Patel et al. Dynamic lexicon generation for natural scene images
WO2020131004A1 (fr) Traitement automatisé indépendant du domaine de texte en forme libre
Negara et al. Topic modeling using latent dirichlet allocation (LDA) on twitter data with Indonesia keyword
Maiya et al. Exploratory analysis of highly heterogeneous document collections
Hunegnaw Sentiment analysis model for Afaan Oromoo short message service text: A machine learning approach
Ahmad et al. News article summarization: Analysis and experiments on basic extractive algorithms

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21904883

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21904883

Country of ref document: EP

Kind code of ref document: A1