CN112100374A - Text clustering method and device, electronic equipment and storage medium - Google Patents

Text clustering method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112100374A
CN112100374A CN202010883973.XA CN202010883973A CN112100374A CN 112100374 A CN112100374 A CN 112100374A CN 202010883973 A CN202010883973 A CN 202010883973A CN 112100374 A CN112100374 A CN 112100374A
Authority
CN
China
Prior art keywords
text
event
determining
events
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010883973.XA
Other languages
Chinese (zh)
Inventor
陈涛
黄丽达
苏国锋
苗雨加
史盼盼
李志鹏
刘鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Beijing Global Safety Technology Co Ltd
Original Assignee
Tsinghua University
Beijing Global Safety Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, Beijing Global Safety Technology Co Ltd filed Critical Tsinghua University
Priority to CN202010883973.XA priority Critical patent/CN112100374A/en
Publication of CN112100374A publication Critical patent/CN112100374A/en
Priority to PCT/CN2021/111903 priority patent/WO2022042297A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The application provides a text clustering method, a text clustering device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining similar events corresponding to the texts in the events according to the acquired texts to be processed and the corresponding entity characteristics as well as the acquired events and the corresponding entity characteristics in the event database; and determining the event described by the text according to the text and the corresponding similar event, and adding the text to a text set under the event described by the text. According to the method, similar events corresponding to the texts in the events are determined according to the texts to be processed and corresponding entity features, the events described by the texts are determined and the texts are added to a text set under the events described by the texts, so that the distinction of the texts to be processed with little difference is realized, the similar events corresponding to the texts to be processed in the event database are more accurately determined, and the accuracy of text clustering is improved.

Description

Text clustering method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of emergency disposal technologies, and in particular, to a text clustering method and apparatus, an electronic device, and a storage medium.
Background
At present, in the field of emergency disposal, the problems that social network text information is difficult to integrate and the text value density is too low exist, so that emergency disposal work cannot rapidly mine emergency situations from the internet, such as event propagation situations, heat trends and the like.
In the related technology, the problems are mainly solved by text clustering methods such as K-Means clustering, a density-based clustering method (DBSCAN), maximum Expectation (EM) clustering of a Gaussian Mixture Model (GMM) and the like, but the clustering model of the text clustering method belongs to an unsupervised clustering model, the entity characteristics of the text in a training set are not highly emphasized, so that the text with small difference is not easy to distinguish, and the clustering accuracy is low.
Disclosure of Invention
The object of the present application is to solve at least to some extent one of the above mentioned technical problems.
Therefore, a first objective of the present application is to provide a text clustering method, where similar events corresponding to texts in events are determined according to texts and corresponding entity features, and the events and corresponding entity features in an event database, so as to determine the events described by the texts and add the texts to a text set described by the texts, thereby implementing distinction of texts to be processed that are not very different, more accurately determining similar events corresponding to the texts to be processed in the event database, and improving accuracy of text clustering.
A second object of the present application is to provide a text clustering apparatus.
A third object of the present application is to provide an electronic device.
A fourth object of the present application is to propose a computer readable storage medium.
In order to achieve the above object, an embodiment of a first aspect of the present application provides a text clustering method, including: acquiring a text to be processed and corresponding entity characteristics; obtaining an event database, wherein the event database comprises: each event and corresponding entity characteristics, and a text set under each event; according to the text and the corresponding entity characteristics, the events and the corresponding entity characteristics, determining similar events corresponding to the text in the events; determining an event described by the text according to the text and the corresponding similar event; and adding the text to a text set under the event described by the text.
The text clustering method of the embodiment of the application obtains the text to be processed and the corresponding entity characteristics; obtaining an event database, wherein the event database comprises: each event, corresponding entity characteristics and a text set under each event; according to the text and the corresponding entity characteristics, the events and the corresponding entity characteristics, determining similar events corresponding to the text in the events; determining an event described by the text according to the text and the corresponding similar event; and adding the text to a text set under the event described by the text. According to the method, similar events corresponding to the texts in the events are determined according to the texts and the corresponding entity features, the events described by the texts and the events in the event database and the corresponding entity features, the text is added into a text set described by the texts, the to-be-processed texts with small differences are distinguished, the similar events corresponding to the to-be-processed texts in the event database are more accurately determined, and the accuracy of text clustering is improved.
In order to achieve the above object, a second embodiment of the present application provides a text clustering device, including: the first acquisition module is used for acquiring a text to be processed and corresponding entity characteristics; a second obtaining module, configured to obtain an event database, where the event database includes: each event, corresponding entity characteristics and a text set under each event; a first determining module, configured to determine, according to the text and the corresponding entity features, the events and the corresponding entity features, similar events corresponding to the text in the events; the second determining module is used for determining the event described by the text according to the text and the corresponding similar event; and the adding module is used for adding the text into the text set under the event described by the text.
The text clustering device of the embodiment of the application acquires texts to be processed and corresponding entity characteristics; obtaining an event database, wherein the event database comprises: each event, corresponding entity characteristics and a text set under each event; according to the text and the corresponding entity characteristics, the events and the corresponding entity characteristics, determining similar events corresponding to the text in the events; and determining the event described by the text according to the text and the corresponding similar event, and adding the text to a text set under the event described by the text. The device can determine similar events corresponding to the texts in the events according to the texts and the corresponding entity features, the events described by the texts and the events in the event database and the corresponding entity features, so that the events described by the texts are determined and the texts are added into the text set described by the texts, the to-be-processed texts with little difference are distinguished, the similar events corresponding to the to-be-processed texts in the event database are more accurately determined, and the accuracy of text clustering is improved.
To achieve the above object, a third aspect of the present application provides an electronic device, including: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the text clustering method as described above when executing the program.
In order to achieve the above object, a fourth aspect of the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the text clustering method as described above.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic flow chart diagram of a text clustering method according to an embodiment of the present application;
FIG. 2 is a diagram illustrating a long term short term memory neural network model for determining the generic relationship between text and most similar events according to an embodiment of the present application;
FIG. 3 is a diagram illustrating a text clustering effect according to an embodiment of the present application;
FIG. 4 is a schematic flow chart diagram illustrating a text clustering method according to another embodiment of the present application;
fig. 5 is a schematic structural diagram illustrating obtaining of a coefficient of variation between a text and an event through a preset guided aggregation model according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a text clustering device according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.
The text clustering method, the text clustering device, the electronic device, and the storage medium according to the embodiments of the present application are described below with reference to the drawings.
Fig. 1 is a schematic flow chart of a text clustering method according to an embodiment of the present application. As shown in fig. 1, the method mainly comprises the following steps:
step 101, obtaining a text to be processed and corresponding entity features.
In the embodiment of the application, the text to be processed may be downloaded through a network, or may be obtained by accessing network platform data (e.g., an earthquake network platform) corresponding to an emergency, for example, when the text to be processed is "6 months, 11 days, and 22 days, 29 days, 3.6-level earthquakes occur in leaf city county in xinjiang shi prefecture".
It should be noted that the entity feature corresponding to the text to be processed may include at least one of the following features: event time, event location, and event type.
In the embodiment of the present application, if the entity features corresponding to the text to be processed are different, the manner of obtaining the corresponding entity features is also different.
As an example, when the entity feature corresponding to the text to be processed includes an event time, the event time in the text may be extracted by inputting the text to be processed into a preset event time extraction model.
For example, the text to be processed may be input into a transformer model (e.g., BERT model) and the event time in the text may be extracted.
As another example, when the entity features corresponding to the text to be processed include an event location, each word in the text may be obtained by segmenting the text; matching each word in the text with a place in a preset place dictionary to obtain a place matched with the word in the text; and determining the matched place as an event place corresponding to the text.
That is to say, when the entity features corresponding to the text to be processed include event locations, the text may be segmented by a preset segmentation algorithm (e.g., a dictionary-based segmentation algorithm) to obtain each word in the text; and then, matching each word in the text with a place in a preset place dictionary, acquiring the place matched with the word in the text, and taking the acquired place matched with the word in the text as an event place corresponding to the text.
As another example, when the entity features corresponding to the text to be processed include a time type, performing word segmentation on the text to obtain each word in the text; acquiring words related to each event type in each word; determining the weight of each event type according to the words in each word and the associated words of each event type and the weight of the associated words; and determining the event type corresponding to the text according to the weight of each event type.
That is to say, when the entity features corresponding to the text to be processed include the time type, the text may be segmented by a preset segmentation algorithm (e.g., a segmentation algorithm based on a dictionary) to obtain each word in the text; determining the weight of each event type according to the words in each word associated with each event type and the weight of the associated words, for example, performing weight setting on each associated word according to the association degree of the associated word and the event type, wherein the higher the association degree of the associated word and the event type is, the larger the weight of the corresponding associated word is; and then, for each event type, summing the weights of the words associated with the event type, and taking the summed result as the weight of the corresponding event type. And then, determining the event type corresponding to the text according to the weight of each event type, for example, comparing the weights of the event types, and taking the corresponding event type with the larger weight as the event type corresponding to the text. It should be noted that the event types may be natural disaster events, accident disaster events, public health events, social security events, and the like, and the event types may be classified according to actual needs, which is not limited in this application.
Step 102, obtaining an event database, wherein the event database comprises: each event and corresponding entity characteristics, and a text set under each event.
In the embodiment of the application, the event database may be preset, and in order to ensure that the information of the event database is more comprehensive, the event database may include each event and corresponding entity characteristics. The entity characteristics corresponding to each event can be obtained by extracting the event time, the event location, the event type and the like of the event. In addition, the event database may further include a text set under each event, and the entity feature corresponding to each event may be further selected by selecting the entity feature corresponding to each text in the text set under each event, and using the selected entity feature as the entity feature corresponding to the event.
Step 103, determining similar events corresponding to the text in each event according to the text and the corresponding entity characteristics, each event and the corresponding entity characteristics.
Optionally, the text, the corresponding entity features, each event and the corresponding entity features may be input into a preset guidance aggregation model, and the probability that the text belongs to each event is obtained; according to the probability that the text belongs to each event, similar events corresponding to the text in each event are determined, and details are described in the following embodiments.
And step 104, determining the event described by the text according to the text and the corresponding similar event.
Step 105, adding the text to the text set under the event described by the text.
In the embodiment of the application, the text and the corresponding similar events can be input into a long-short term memory neural network model, and the probability that the similar events are the events described by the text is determined; when the probability is greater than a preset probability threshold value, determining that the similar event is an event described by the text, and adding the text to a text set under the event described by the text; and when the probability is smaller than or equal to a preset probability threshold value, generating a new event according to the text, and updating the new event into an event database.
That is to say, as shown in fig. 2, the text to be processed and the corresponding similar events may be input into a long and short term memory neural network model trained in advance, the long and short term memory neural network model may determine the probability that the similar event is the event described by the text to be processed, when the probability is greater than a preset probability threshold, the long and short term memory neural network model may output the similar event as the event described by the text to be processed, and add the text to be processed into a text set under the event described by the text, when the probability is less than or equal to the preset probability threshold, the long and short term memory neural network model may generate a new event according to the text to be processed, and output the text to be processed as the new event, and update the new event into an event database at the same time.
For better explaining the above embodiments, as shown in fig. 3, fig. 3 is a diagram illustrating an effect of a text clustering method according to an embodiment of the present application. As can be seen from fig. 3, the text clustering method according to the embodiment of the present application can distinguish between texts with small differences, more accurately determine similar events corresponding to texts to be processed in the event database, and improve the accuracy of text clustering.
In summary, according to the text and the corresponding entity features, each event in the event database and the corresponding entity features, the similar event corresponding to the text in each event is determined, so that the event described by the text is determined, the text is added to the text set described by the text, the distinction of the texts to be processed with little difference is realized, the similar event corresponding to the texts to be processed in the event database is more accurately determined, and the accuracy of text clustering is improved.
In order to improve the accuracy of text clustering, in the embodiment of the present application, as shown in fig. 4, fig. 4 is a schematic flow diagram of another text clustering method provided in the embodiment of the present application. The text, the corresponding entity characteristics, each event and the corresponding entity characteristics can be input into a preset guide aggregation model, and the probability that the text belongs to each event is obtained; according to the probability that the text belongs to each event, determining similar events corresponding to the text in each event, in step 103 of the embodiment shown in fig. 1, the following steps may be included:
step 401, inputting the text and the corresponding entity features, each event and the corresponding entity features into a preset guidance aggregation model, and obtaining a probability that the text belongs to each event.
In the embodiment of the application, the text to be processed and the corresponding entity characteristics, each event in the event database and the corresponding entity characteristics can be input into a preset guide aggregation model, and the model can output the probability that the text to be processed belongs to each event. It should be noted that the preset guidance aggregation model may include a plurality of regression network submodels, and each regression network submodel may be configured to output a probability that the text belongs to each event.
Step 402, according to the probability that the text belongs to each event, determining similar events corresponding to the text in each event.
Optionally, for each event, obtaining a plurality of probabilities that the text belongs to the event; determining the average probability of the text belonging to the event according to the plurality of probabilities of the text belonging to the event; determining a variation coefficient between the text and the event according to a plurality of probabilities of the text belonging to the event and the average probability; and determining the event with the minimum corresponding coefficient of variation as a similar event corresponding to the text.
That is to say, the preset guidance aggregation model may include a plurality of regression network submodels, each regression network submodel may be configured to output a probability that the text belongs to each event, and therefore, for each event, each regression network submodel in the preset guidance aggregation model may output a probability that the text to be processed belongs to the event, so that a plurality of probabilities that the text to be processed belongs to the event may be obtained, then, the plurality of probabilities are averaged, and an average value result is used as an average probability that the text to be processed belongs to the event; then, according to a plurality of probabilities that the text to be processed belongs to the event and the average probability that the text to be processed belongs to the event, determining the sum of difference between the plurality of probabilities that the text to be processed belongs to the event and the average probability; and calculating the variation coefficient between the text to be processed and each event according to the difference sum and the number of the regression network submodels in the preset guide aggregation model, and taking the event with the minimum variation coefficient in the variation coefficients between the text to be processed and each event as the similar event corresponding to the text to be processed. Therefore, texts with small differences can be distinguished, similar events corresponding to the texts to be processed in the event database are determined more accurately, and accuracy of text clustering is improved.
For example, as shown in fig. 5, taking an example that a preset guided aggregation model (Bagging integration model) includes a regression network model A, B, C, inputting a text to be processed and corresponding entity features, events in an event database and corresponding entity features into the Bagging integration model, for each event, the regression network model a outputs a probability Y1 that the text to be processed belongs to the event, the regression network model B outputs a probability Y2 that the text to be processed belongs to the event, the regression network model C outputs a probability Y3 that the text to be processed belongs to the event, and an average probability that the text to be processed belongs to the event is
Figure BDA0002654997770000061
The coefficient of variation between the text to be processed and the event is
Figure BDA0002654997770000062
And then, comparing the variation coefficients between the text to be processed and each event respectively, and taking the event with the minimum variation coefficient in the variation coefficients between the text to be processed and each event as a similar event corresponding to the text to be processed.
In summary, the probability that the text belongs to each event is obtained by inputting the text and the corresponding entity characteristics, each event and the corresponding entity characteristics into a preset guide aggregation model; according to the probability that the text belongs to each event, the similar events corresponding to the text in each event are determined, so that the texts with little difference can be distinguished, the similar events corresponding to the text to be processed in the event database can be determined more accurately, and the accuracy of text clustering is improved.
The text clustering method of the embodiment of the application obtains the text to be processed and the corresponding entity characteristics; obtaining an event database, wherein the event database comprises: each event, corresponding entity characteristics and a text set under each event; according to the text and the corresponding entity characteristics, the events and the corresponding entity characteristics, determining similar events corresponding to the text in the events; and determining the event described by the text according to the text and the corresponding similar event, and adding the text to a text set under the event described by the text. According to the method, similar events corresponding to the texts in the events are determined according to the texts and the corresponding entity features, the events described by the texts and the events in the event database and the corresponding entity features, the text is added into a text set described by the texts, the to-be-processed texts with small differences are distinguished, the similar events corresponding to the to-be-processed texts in the event database are more accurately determined, and the accuracy of text clustering is improved.
Fig. 6 is a schematic structural diagram of a text clustering device according to an embodiment of the present application. As shown in fig. 6, the text clustering apparatus 600 includes: a first obtaining module 610, a second obtaining module 620, a first determining module 630, a second determining module 640, and an adding module 650.
The first obtaining module 610 is configured to obtain a text to be processed and a corresponding entity feature; a second obtaining module 620, configured to obtain an event database, where the event database includes: each event, corresponding entity characteristics and a text set under each event; a first determining module 630, configured to determine, according to the text and the corresponding entity features, each event and the corresponding entity features, similar events corresponding to the text in each event; a second determining module 640, configured to determine, according to the text and the corresponding similar event, an event described by the text; and the adding module is used for adding the text to the text set under the event described by the text.
As a possible implementation manner of the embodiment of the present application, the entity feature includes at least one of the following features: event time, event location, and event type; the first obtaining module 610 is specifically configured to input a text into a preset event time extraction model, and extract event time in the text; segmenting words of the text to obtain each word in the text; matching each word in the text with a place in a preset place dictionary to obtain a place matched with the word in the text; determining the matched place as an event place corresponding to the text; segmenting words of the text to obtain each word in the text; acquiring words related to each event type in each word; for each event type, adding the weights of the words related to the event type to obtain the weight of the event type; and determining the event type corresponding to the text according to the weight of each event type.
As a possible implementation manner of the embodiment of the present application, the first determining module 630 is specifically configured to input the text, the corresponding entity features, each event, and the corresponding entity features into a preset guidance aggregation model, and obtain a probability that the text belongs to each event; and determining similar events corresponding to the text in each event according to the probability that the text belongs to each event.
As a possible implementation manner of the embodiment of the present application, the guiding aggregation model includes: a plurality of regression network submodels, each regression network submodel for outputting a probability that the text belongs to each event; the first determining module 630 is further configured to, for each event, obtain a plurality of probabilities that the text belongs to the event; determining the average probability of the text belonging to the event according to the plurality of probabilities of the text belonging to the event; determining a variation coefficient between the text and the event according to a plurality of probabilities of the text belonging to the event and the average probability; and determining the event with the minimum corresponding coefficient of variation as a similar event corresponding to the text.
As a possible implementation manner of the embodiment of the present application, the first determining module 630 is further configured to determine, according to a plurality of probabilities that a text belongs to an event and an average probability, a sum of differences between the plurality of probabilities that the text belongs to the event and the average probability; and calculating the variation coefficient between the text and the event according to the sum of the difference and the number of the regression network submodels.
As a possible implementation manner of the embodiment of the present application, the second determining module 640 is specifically configured to input the text and the corresponding similar events into the long-term and short-term memory neural network model, and determine the probability that the similar events are the events described by the text; when the probability is greater than a preset probability threshold value, determining that the similar event is an event described by the text; and when the probability is smaller than or equal to a preset probability threshold value, generating a new event according to the text, and updating the new event into an event database.
The text clustering device of the embodiment of the application acquires texts to be processed and corresponding entity characteristics; obtaining an event database, wherein the event database comprises: each event, corresponding entity characteristics and a text set under each event; according to the text and the corresponding entity characteristics, the events and the corresponding entity characteristics, determining similar events corresponding to the text in the events; and determining the event described by the text according to the text and the corresponding similar event, and adding the text to a text set under the event described by the text. According to the method, similar events corresponding to the texts in the events are determined according to the texts and the corresponding entity features, the events described by the texts and the events in the event database and the corresponding entity features, the text is added into a text set described by the texts, the to-be-processed texts with small differences are distinguished, the similar events corresponding to the to-be-processed texts in the event database are more accurately determined, and the accuracy of text clustering is improved.
In order to implement the foregoing embodiments, the present application further provides an electronic device, and fig. 7 is a schematic structural diagram of the electronic device provided in the embodiments of the present application. The electronic device includes:
memory 1001, processor 1002, and computer programs stored on memory 1001 and executable on processor 1002.
The processor 1002, when executing the program, implements the text clustering method provided in the above-described embodiments.
Further, the electronic device further includes:
a communication interface 1003 for communicating between the memory 1001 and the processor 1002.
A memory 1001 for storing computer programs that may be run on the processor 1002.
Memory 1001 may include high-speed RAM memory and may also include non-volatile memory (e.g., at least one disk memory).
The processor 1002 is configured to implement the text clustering method according to the foregoing embodiment when executing the program.
If the memory 1001, the processor 1002, and the communication interface 1003 are implemented independently, the communication interface 1003, the memory 1001, and the processor 1002 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus.
Optionally, in a specific implementation, if the memory 1001, the processor 1002, and the communication interface 1003 are integrated on one chip, the memory 1001, the processor 1002, and the communication interface 1003 may complete communication with each other through an internal interface.
The processor 1002 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present Application.
In order to implement the foregoing embodiments, the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the text clustering method according to the foregoing embodiments.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims (14)

1. A text clustering method, comprising:
acquiring a text to be processed and corresponding entity characteristics;
obtaining an event database, wherein the event database comprises: each event, corresponding entity characteristics and a text set under each event;
according to the text and the corresponding entity characteristics, the events and the corresponding entity characteristics, determining similar events corresponding to the text in the events;
determining an event described by the text according to the text and the corresponding similar event;
and adding the text to a text set under the event described by the text.
2. The method of claim 1, wherein the entity characteristics comprise at least one of the following characteristics: event time, event location, and event type;
the acquisition mode of the event time corresponding to the text is that the text is input into a preset event time extraction model, and the event time in the text is extracted;
the acquisition mode of the event place corresponding to the text is that the text is segmented to acquire each word in the text; matching each word in the text with a place in a preset place dictionary to obtain a place matched with the word in the text; determining the matched place as an event place corresponding to the text;
the method for acquiring the event type corresponding to the text comprises the steps of segmenting the text to acquire each word in the text; acquiring words related to each event type in each word; for each event type, adding the weights of the words related to the event type to obtain the weight of the event type; and determining the event type corresponding to the text according to the weight of each event type.
3. The method according to claim 1, wherein the determining similar events corresponding to the text in the events according to the text and the corresponding entity features, the events and the corresponding entity features comprises:
inputting the text and the corresponding entity characteristics, the events and the corresponding entity characteristics into a preset guide aggregation model, and acquiring the probability that the text belongs to each event;
and determining similar events corresponding to the text in the events according to the probability that the text belongs to the events.
4. The method of claim 3, wherein the guided aggregation model comprises: a plurality of regression network submodels, each regression network submodel for outputting a probability that the text belongs to each event;
the determining, according to the probability that the text belongs to each event, a similar event corresponding to the text in each event includes:
for each event, acquiring a plurality of probabilities that the text belongs to the event;
determining an average probability that the text belongs to the event according to a plurality of probabilities that the text belongs to the event;
determining a coefficient of variation between the text and the event according to a plurality of probabilities that the text belongs to the event and the average probability;
and determining the corresponding event with the minimum coefficient of variation as a similar event corresponding to the text.
5. The method of claim 4, wherein determining a coefficient of variation between the text and the event according to the plurality of probabilities that the text belongs to the event and the average probability comprises:
determining a variance sum between a plurality of probabilities that the text belongs to the event and the average probability according to the plurality of probabilities that the text belongs to the event and the average probability;
and calculating the variation coefficient between the text and the event according to the difference sum and the number of the regression network submodels.
6. The method of claim 1, wherein determining the event described by the text based on the text and the corresponding similar event comprises:
inputting the text and the corresponding similar events into a long-short term memory neural network model, and determining the probability that the similar events are the events described by the text;
when the probability is larger than a preset probability threshold value, determining that the similar event is the event described by the text;
and when the probability is smaller than or equal to a preset probability threshold value, generating a new event according to the text, and updating the new event into the event database.
7. A text clustering apparatus, comprising:
the first acquisition module is used for acquiring a text to be processed and corresponding entity characteristics;
a second obtaining module, configured to obtain an event database, where the event database includes: each event, corresponding entity characteristics and a text set under each event;
a first determining module, configured to determine, according to the text and the corresponding entity features, the events and the corresponding entity features, similar events corresponding to the text in the events;
the second determining module is used for determining the event described by the text according to the text and the corresponding similar event;
and the adding module is used for adding the text into the text set under the event described by the text.
8. The apparatus of claim 7, wherein the physical characteristics comprise at least one of: event time, event location, and event type;
the first obtaining module is specifically configured to,
inputting the text into a preset event time extraction model, and extracting event time in the text;
segmenting the text to obtain each word in the text; matching each word in the text with a place in a preset place dictionary to obtain a place matched with the word in the text; determining the matched place as an event place corresponding to the text;
segmenting the text to obtain each word in the text; acquiring words related to each event type in each word; for each event type, adding the weights of the words related to the event type to obtain the weight of the event type; and determining the event type corresponding to the text according to the weight of each event type.
9. The apparatus of claim 7, wherein the first determining module is specifically configured to,
inputting the text and the corresponding entity characteristics, the events and the corresponding entity characteristics into a preset guide aggregation model, and acquiring the probability that the text belongs to each event;
and determining similar events corresponding to the text in the events according to the probability that the text belongs to the events.
10. The apparatus of claim 9, wherein the guided aggregation model comprises: a plurality of regression network submodels, each regression network submodel for outputting a probability that the text belongs to each event;
the first determining means is further configured to,
for each event, acquiring a plurality of probabilities that the text belongs to the event;
determining an average probability that the text belongs to the event according to a plurality of probabilities that the text belongs to the event;
determining a coefficient of variation between the text and the event according to a plurality of probabilities that the text belongs to the event and the average probability;
and determining the corresponding event with the minimum coefficient of variation as a similar event corresponding to the text.
11. The apparatus of claim 10, wherein the first determining module is further configured to,
determining a variance sum between a plurality of probabilities that the text belongs to the event and the average probability according to the plurality of probabilities that the text belongs to the event and the average probability;
and calculating the variation coefficient between the text and the event according to the difference sum and the number of the regression network submodels.
12. The apparatus of claim 7, wherein the second determining module is specifically configured to,
inputting the text and the corresponding similar events into a long-short term memory neural network model, and determining the probability that the similar events are the events described by the text;
when the probability is larger than a preset probability threshold value, determining that the similar event is the event described by the text;
and when the probability is smaller than or equal to a preset probability threshold value, generating a new event according to the text, and updating the new event into the event database.
13. An electronic device, comprising:
memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the text clustering method according to any one of claims 1 to 6 when executing the program.
14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method for clustering text according to any one of claims 1 to 6.
CN202010883973.XA 2020-08-28 2020-08-28 Text clustering method and device, electronic equipment and storage medium Pending CN112100374A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010883973.XA CN112100374A (en) 2020-08-28 2020-08-28 Text clustering method and device, electronic equipment and storage medium
PCT/CN2021/111903 WO2022042297A1 (en) 2020-08-28 2021-08-10 Text clustering method, apparatus, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010883973.XA CN112100374A (en) 2020-08-28 2020-08-28 Text clustering method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112100374A true CN112100374A (en) 2020-12-18

Family

ID=73758188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010883973.XA Pending CN112100374A (en) 2020-08-28 2020-08-28 Text clustering method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN112100374A (en)
WO (1) WO2022042297A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221538A (en) * 2021-05-19 2021-08-06 北京百度网讯科技有限公司 Event library construction method and device, electronic equipment and computer readable medium
WO2022042297A1 (en) * 2020-08-28 2022-03-03 清华大学 Text clustering method, apparatus, electronic device, and storage medium
WO2023125589A1 (en) * 2021-12-29 2023-07-06 北京辰安科技股份有限公司 Emergency monitoring method and apparatus

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114663815B (en) * 2022-03-28 2022-11-08 深圳市实信达科技开发有限公司 Information security method and system based on artificial intelligence and cloud platform

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110106807A1 (en) * 2009-10-30 2011-05-05 Janya, Inc Systems and methods for information integration through context-based entity disambiguation
CN102708096B (en) * 2012-05-29 2014-10-15 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof
US10269450B2 (en) * 2013-05-22 2019-04-23 Quantros, Inc. Probabilistic event classification systems and methods
CN105404686B (en) * 2015-12-10 2018-08-31 湖南科技大学 A kind of media event place name address matching method based on geographical feature level participle
CN105677873B (en) * 2016-01-11 2019-03-26 中国电子科技集团公司第十研究所 Text Intelligence association cluster based on model of the domain knowledge collects processing method
CN111414754A (en) * 2020-03-19 2020-07-14 中国建设银行股份有限公司 Emotion analysis method and device of event, server and storage medium
CN112100374A (en) * 2020-08-28 2020-12-18 清华大学 Text clustering method and device, electronic equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022042297A1 (en) * 2020-08-28 2022-03-03 清华大学 Text clustering method, apparatus, electronic device, and storage medium
CN113221538A (en) * 2021-05-19 2021-08-06 北京百度网讯科技有限公司 Event library construction method and device, electronic equipment and computer readable medium
CN113221538B (en) * 2021-05-19 2023-09-19 北京百度网讯科技有限公司 Event library construction method and device, electronic equipment and computer readable medium
WO2023125589A1 (en) * 2021-12-29 2023-07-06 北京辰安科技股份有限公司 Emergency monitoring method and apparatus

Also Published As

Publication number Publication date
WO2022042297A1 (en) 2022-03-03

Similar Documents

Publication Publication Date Title
CN112100374A (en) Text clustering method and device, electronic equipment and storage medium
CN103400577B (en) The acoustic model method for building up of multilingual speech recognition and device
CN109933656B (en) Public opinion polarity prediction method, public opinion polarity prediction device, computer equipment and storage medium
CN107291684B (en) Word segmentation method and system for language text
CN111382255A (en) Method, apparatus, device and medium for question and answer processing
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN111666761A (en) Fine-grained emotion analysis model training method and device
CN112016553A (en) Optical Character Recognition (OCR) system, automatic OCR correction system, method
EP3726435A1 (en) Deep neural network training method and apparatus, and computer device
CN113657098B (en) Text error correction method, device, equipment and storage medium
CN113094478B (en) Expression reply method, device, equipment and storage medium
CN112395391B (en) Concept graph construction method, device, computer equipment and storage medium
CN115098556A (en) User demand matching method and device, electronic equipment and storage medium
WO2022116438A1 (en) Customer service violation quality inspection method and apparatus, computer device, and storage medium
CN113515593A (en) Topic detection method and device based on clustering model and computer equipment
CN109871540B (en) Text similarity calculation method and related equipment
CN114141235A (en) Voice corpus generation method and device, computer equipment and storage medium
JP6326940B2 (en) Method and apparatus for evaluating phrases in intermediate language, and machine translation method and apparatus
CN113378543B (en) Data analysis method, method for training data analysis model and electronic equipment
CN115859128B (en) Analysis method and system based on interaction similarity of archive data
CN112906386B (en) Method and device for determining text characteristics
CN115329751B (en) Keyword extraction method, device, medium and equipment for network platform text
JP6679391B2 (en) Place name notation determination device
CN115034203A (en) Medical long text information extraction method and device
CN116050380A (en) Log analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination