CN114398534A

CN114398534A - Event cluster text retrieval system

Info

Publication number: CN114398534A
Application number: CN202210001964.2A
Authority: CN
Inventors: 刘彬; 王玉娟; 王锐; 王震; 倪晔玮
Original assignee: Shanghai Posts & Telecommunications Designing Consulting Institute Co ltd
Current assignee: Shanghai Posts & Telecommunications Designing Consulting Institute Co ltd
Priority date: 2021-01-05
Filing date: 2022-01-04
Publication date: 2022-04-26
Anticipated expiration: 2042-01-04
Also published as: CN114398534B

Abstract

The invention provides an event clustering text retrieval system, which comprises: the system comprises a processor, a memory for storing computer programs, a crawling database and a display interface. The crawling database stores an event text vector and a corresponding event word segmentation weight obtained by performing word segmentation on an event text, and an associated text vector and a corresponding associated word segmentation weight obtained by performing word segmentation on an associated text related to the event text. The processor is used for calculating the similarity between any event text vector and the corresponding associated text vector based on different similarity calculation formulas, and presenting the corresponding associated text on the display interface in a similarity descending manner. The invention can improve the acquisition efficiency and the text pertinence.

Description

Event cluster text retrieval system

Technical Field

The invention relates to the field of physics, in particular to an information processing technology, and specifically relates to an event clustering text retrieval system.

Background

On the internet, an event is presented in the form of text, or a text. At this point the user wishes to be able to read all the text relating to the event. At present, when a user wants to know a certain event, keywords are input on the internet for retrieval, and the internet presents relevant texts of the event to the user according to a time sequence. However, the presentation in time sequence is not uniform in subject and not strong in pertinence, and is not beneficial to the user. Therefore, there is a need to cluster such texts to provide texts with topic unification and strong pertinence.

Disclosure of Invention

Aiming at the technical problem, the invention provides an event clustering text retrieval system which can provide relevant texts with uniform special subjects and strong pertinence aiming at a certain event.

The technical scheme adopted by the invention is as follows:

the invention provides an event clustering text retrieval system, which is arranged at a cloud end and is used for concurrently executing the processing of a plurality of event texts, and the system comprises: the system comprises a processor, a memory for storing computer programs, a crawling database and a display interface;

the crawling database stores event text vectors obtained by performing word segmentation on event texts and corresponding event word segmentation weights, and associated text vectors obtained by performing word segmentation on M associated texts related to the event texts and corresponding associated word segmentation weights; wherein, any event text vector E ═ (E)₁，e₂，......，e_m) And corresponding event participlesWeight WE ═ WE₁，we₂，......，we_m)，e_iFor the ith participle, we, in event text vector E_iFor word segmentation e_iM is the number of participles in the event text vector E, and the value of i is 1 to m; any one of the associated text vectors P_j＝(p_j1，p_j2，......，p_jn) And corresponding associated participle weights WP_j＝(wp_j1，wp_j2，......，wp_jn)，p_jtFor associated text vectors P_jT-th participle of (1), wp_jtFor word segmentation p_jtJn is the associated text vector P_jThe number of the medium participles, j is 1 to M, and t is 1 to n;

for any event text vector E corresponding associated text vector P_jThe processor is used for executing the computer program to realize the following steps:

obtaining E # P_j＝(b₁，b₂，......，b_V) And V is E ^ N and P_jThe number of medium participles;

comparing the number m of the participles in the event text vector E with a preset first threshold value D1;

selecting a preset similarity calculation method to calculate the event text vector E and the associated text vector P based on the comparison result_jSimilarity of (2)_j(ii) a The preset similarity calculation method includes a first similarity calculation method

And second similarity calculation method

W1_kFor word segmentation b_kParticiple weight for corresponding participle in event text vector E, W2_kFor word segmentation b_kIn associating text vector P_jThe word weight of the corresponding word;

and traversing the M associated text vectors, and presenting the associated texts on the display interface in a similarity descending manner.

According to the event clustering text retrieval system provided by the embodiment of the invention, the corresponding similarity calculation method is selected to calculate the similarity between the event text and the associated text based on the word segmentation quantity of the event text vector, so that the acquisition efficiency is improved and the calculation resources of the server are saved on the premise of ensuring the accuracy of acquiring the similarity. In addition, the associated texts are presented in a similarity descending manner, so that the presented associated texts have pertinence, and the user experience can be improved.

Detailed Description

In order to make the technical problems, technical solutions and advantages to be solved by the present invention clearer, the following detailed description is given with reference to specific embodiments.

In some flows described in the specification and claims of this invention, a number of operations are included in a particular order, but it should be clearly understood that these operations may be performed out of order or in parallel as they appear herein, with the order of the operations being indicated by the numbers 101, 102, etc. merely to distinguish between the various operations, which by themselves do not represent any order of execution. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel.

The technical solutions in the embodiments of the present invention are clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides an event clustering text retrieval system which is arranged at a cloud end, is particularly deployed on a cloud end server and is used for concurrently executing processing of a plurality of event texts. The system comprises: the system comprises a processor, a memory for storing computer programs, a crawling database and a display interface.

In the embodiment of the invention, the event text vector obtained by performing word segmentation processing on the event text and the corresponding event text vector are stored in the crawling databaseAnd performing word segmentation processing on the M associated texts related to the event text to obtain associated text vectors and corresponding associated word segmentation weights. Wherein, any event text vector E ═ (E)₁，e₂，......，e_m) And corresponding event participle weight WE ═ WE₁，we₂，......，we_m)，e_iFor the ith participle, we, in event text vector E_iFor word segmentation e_iM is the number of participles in the event text vector E, and the value of i is 1 to m. The invention can use the existing word segmentation technology to perform word segmentation processing on the event text. The weight of each participle may be determined based on prior art techniques. Preferably we_iFor word segmentation e_iNumber of occurrences in the event text.

In addition, any of the associated text vectors P_j＝(p_j1，p_j2，......，p_jn) And corresponding associated participle weights WP_j＝(wp_j1，wp_j2，......，wp_jn)，p_jtFor associated text vectors P_jT-th participle of (1), wp_jtFor word segmentation p_jtJn is the associated text vector P_jThe number of the middle participles, j is 1 to M, and t is 1 to n. In an embodiment of the present invention, the associated text is a text obtained according to the event text, and the obtaining manner may be a manner in the prior art, for example, a specific method for reporting content of related listed companies in the royal flush software, or any other clustering manner in the prior art. The invention can use the existing word segmentation technology to perform word segmentation processing on the associated text. The weight of each participle may be determined based on prior art techniques. Preferably, wp_jtFor word segmentation p_jtNumber of occurrences in the associated text.

In the embodiment of the invention, the processor is used for calculating the similarity between any event text vector and the corresponding associated text vector, and presenting the corresponding associated text on the display interface in a way of descending the similarity. The specific execution function of the processor of the present invention is described below by way of embodiments 1 to 5.

(example 1)

In this embodiment, for any event text vector E, the corresponding associated text vector P_jThe processor is used for executing the computer program to realize the following steps:

s101, obtaining an event text vector E and an associated text vector P_jThe intersection of (E &) P_j＝(b₁，b₂，......，b_V) (ii) a V is E ^ N and P_jThe number of medium participles.

S102, obtaining

s103, traversing the M associated text vectors, and presenting the associated texts on the display interface in a similarity descending manner.

In the embodiment, the associated texts can be presented in a similarity descending manner, and compared with the prior art, the presented associated texts can have pertinence.

(example 2)

s201, obtaining E # P_j＝(b₁，b₂，......，b_V) And V is E ^ N and P_jThe number of medium participles.

S202, comparing the number m of the participles in the event text vector E with a preset first threshold value D1.

S203, selecting a preset similarity calculation method to calculate the event text vector E and the associated text vector P based on the comparison result_jSimilarity of (2)_j(ii) a The preset similarity calculation method includes a first similarity calculation method

And second similarity calculation method

W1_kFor word segmentation b_kParticiple weight for corresponding participle in event text vector E, W2_kFor word segmentation b_kIn associating text vector P_jThe word weight of the corresponding word; the method specifically comprises the following steps:

s2031, if m>D1, selecting the first similarity calculation method to calculate

S2032, if m is less than or equal to D1, selecting the second similarity calculation method to calculate

S204, traversing the M associated text vectors, and presenting the associated texts on the display interface in a similarity descending manner.

Compared with the embodiment 1, the embodiment 2 can determine different similarity calculation modes according to the number of the event text participles, when the number of the event text participles is greater than a preset first threshold, the similarity is calculated by adopting the weight of the intersected participles of the event text vector and the associated text vector, and when the number of the event text participles does not exceed the preset first threshold, the similarity is calculated by directly adopting the number of the intersected participles of the event text vector and the associated text vector, the number of the participles of the event text vector and the number of the participles of the associated text vector, so that the acquisition efficiency can be improved and the calculation resources of the server can be saved on the premise of ensuring the accuracy of acquiring the similarity.

(example 3)

s301, obtaining E # P_j＝(b₁，b₂，......，b_V) And V is E ^ N and P_jThe number of medium participles.

S302, obtaining an event text vector E and an associated text vector P_jIs the union of E ^ P_j＝(b₁，b₂，......，b_U) U is E U-P_jThe number of medium participles.

S303, comparing the number m of the participles in the event text vector E with a preset first threshold value D1;

s304, selecting a preset similarity calculation method to calculate the event text vector E and the associated text vector P based on the comparison result_jSimilarity of (2)_j(ii) a The preset similarity calculation method includes a first similarity calculation method

Second similarity calculation method

And third similarity calculation method

s3041, if m>D1, selecting the third similarity calculation method to calculate

S3042, if m is less than or equal to D1, selecting a second similarity calculation method to calculate

S305, traversing the M associated text vectors, and presenting the associated texts on the display interface in a similarity descending manner.

Similar to embodiment 2, compared to embodiment 1, embodiment 3 can determine different similarity calculation manners according to the number of event text participles, when the number of event text participles is greater than a preset first threshold, calculate the similarity by using the weight of the participles of the union of the event text vector and the associated text vector, and when the number of event text participles does not exceed the preset first threshold, directly calculate the similarity by using the number of the participles of the intersection of the event text vector and the associated text vector, the number of the participles of the event text vector and the number of the participles of the associated text vector, thereby improving the acquisition efficiency and saving the calculation resources of the server on the premise of ensuring the accuracy of acquiring the similarity.

(example 4)

s401, obtaining E # P_j＝(b₁，b₂，......，b_V) And V is E ^ N and P_jThe number of medium participles.

S402, comparing the number m of the participles in the event text vector E with a preset first threshold value D1.

S403, associating the text vector P_jThe number jn of the medium participles is compared with a preset second threshold value D2.

S404, selecting a preset similarity calculation method to calculate the event text vector E and the associated text vector P based on the comparison result_jSimilarity of (2)_j(ii) a The preset similarity calculation method includes a first similarity calculation method

And second similarity calculation method

s4041, if m>D1, selecting the first similarity calculation method to calculate

S4042, if m is less than or equal to D1 and jn is less than or equal to D2, selecting a second similarity calculation method to calculate

S405, traversing the M associated text vectors, and presenting the associated texts on the display interface in a similarity descending manner.

Further, in the embodiment of the present invention, the preset similarity calculation method further includes a fourth similarity calculation method

Step S404 further includes:

s4043, if m is less than or equal to D1 and jn>D2, selecting the fourth similarity calculation method to calculate

With respect to embodiment 1, embodiment 4 can determine different similarity calculation manners according to the number of event text and associated text participles, calculate similarity using weights of the intersected participles of an event text vector and an associated text vector when the number of event text participles is greater than a preset first threshold, calculate similarity using the number of the intersected participles of the event text vector and the associated text vector and the number of the participles of the event text vector and the associated text vector directly when the number of event text participles does not exceed the preset first threshold and the number of the associated text participles does not exceed a preset second threshold, and calculate similarity using weights of the intersected participles of the event text vector and the associated text vector and the number of the intersected participles of the associated text vector directly when the number of event text participles does not exceed the preset first threshold and the number of the associated text participles is greater than the preset second threshold, therefore, on the premise of ensuring the similarity acquisition accuracy, the acquisition efficiency can be improved, and the computing resources of the server are saved.

In the above embodiment of the present invention, the preset first threshold may have a value range of, for example, 20 to 100, and preferably, D1 is 50. Preferably, the preset second threshold may be equal to the preset first threshold, i.e., D2 — D1.

In the embodiment of the present invention, the memory and the processor can be general-purpose memory and processor, which are not specifically limited herein, and when the processor runs the computer program stored in the memory, the problems of low efficiency of associated text retrieval, and insufficient uniformity and pertinence of the presented text topic in the related art can be solved.

To sum up, in the event cluster text retrieval system provided by the embodiment of the present invention, when calculating the similarity between the event text and the associated text, a corresponding similarity calculation method is selected to calculate the similarity between the event text and the associated text based on the number of the participles of the event text vector or based on both the number of the participles of the event text vector and the number of the participles of the associated text vector, when the number of the participles of the event text vector is greater than a preset first threshold, the similarity is calculated by using the weights of the participles of the intersection or the union of the event text vector and the associated text vector, and when the number of the participles of the event text vector does not exceed the preset first threshold, the similarity is calculated by using the number of the participles of the intersection or the union of the event text vector and the number of the participles of the event text vector and the participles of the associated text vector, on the premise of ensuring the similarity acquisition accuracy, the acquisition efficiency is improved and the server computing resources are saved. In addition, the associated texts are presented in a similarity descending manner, so that the presented associated texts have pertinence, and the user experience can be improved.

The above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An event clustering text retrieval system is characterized by comprising: the system comprises a processor, a memory for storing computer programs, a crawling database and a display interface; is arranged at the cloud end and is used for concurrently executing the processing of a plurality of event texts,

the crawling database stores event text vectors obtained by performing word segmentation on event texts and corresponding event word segmentation weights, and associated text vectors obtained by performing word segmentation on M associated texts related to the event texts and corresponding associated word segmentation weights; wherein, any event text vector E ═ (E)₁，e₂，......，e_m) And corresponding event participle weight WE ═ WE₁，we₂，......，we_m)，e_iFor the ith participle, we, in event text vector E_iFor word segmentation e_iM is the number of participles in the event text vector E, and the value of i is 1 to m; any one of the associated text vectors P_j＝(p_j1，p_j2，......，p_jn) And corresponding associated participle weights WP_j＝(wp_j1，wp_j2，......，wp_jn)，p_jtFor associated text vectors P_jT-th participle of (1), wp_jtFor word segmentation p_jtJn is the associated text vector P_jThe number of the medium participles, j is 1 to M, and t is 1 to n;

corresponding association of text vector E for any eventText vector P_jThe processor implements the following steps by executing the computer program:

And second similarity calculation method

2. The event clustering text retrieval system of claim 1, wherein the event text vector E and the associated text vector P are calculated by selecting a preset similarity calculation method based on the comparison result_jSimilarity of (2)_jThe method comprises the following steps:

if m is>D1, selecting the first similarity calculation method to calculate

Otherwise, selecting a second similarity calculation method to calculate

3. The event clustering text retrieval system of claim 1, further comprising, before the comparing the number m of participles in the event text vector E with a preset first threshold D1:

obtaining E U P_jE∪P_j＝(b₁，b₂，......，b_U) U is E U-P_jE∪P_jThe number of medium participles;

the preset similarity calculation method further comprises a third similarity calculation method

4. The event clustering text retrieval system of claim 3, wherein the event text vector E and the associated text vector P are calculated by selecting a preset similarity calculation method based on the comparison result_jSimilarity of (2)_jThe method comprises the following steps:

if m is>D1, selecting the third similarity calculation method to calculate

Otherwise, selecting a second similarity calculation method to calculate

5. The event clustering text retrieval system of claim 3, further comprising, after comparing the number m of participles in the event text vector E with a preset first threshold D1:

will associate a text vector P_jThe number jn of the medium participles is compared with a preset second threshold value D2.

6. The event clustering text retrieval system of claim 5, whereinThen, the event text vector E and the associated text vector P are calculated by selecting a predetermined similarity calculation method based on the comparison result_jSimilarity of (2)_jThe method comprises the following steps:

if m is>D1, selecting the first similarity calculation method to calculate

If m ≦ D1 and jn ≦ D2, then the second similarity calculation method is selected to calculate

7. The event cluster text retrieval system of claim 6, wherein the predetermined similarity calculation method further comprises a fourth similarity calculation method

And further comprising:

if m.ltoreq.D 1 and jn>D2, selecting the fourth similarity calculation method to calculate

8. The event clustering text retrieval system of claim 5, wherein the event text vector E and the associated text vector P are calculated by selecting a preset similarity calculation method based on the comparison result_jSimilarity of (2)_jThe method comprises the following steps:

if m is>D1, selecting the third similarity calculation method to calculate

9. The event cluster text retrieval system according to claim 1, wherein the preset first threshold value ranges from 20 to 100.

10. The event cluster text retrieval system of claim 5, wherein the second predetermined threshold is equal to the first predetermined threshold.