CN117807227A

CN117807227A - Miao event detection method based on large model abstract vector and electronic equipment

Info

Publication number: CN117807227A
Application number: CN202311714093.XA
Authority: CN
Inventors: 陈利明; 张宝玉; 李宗倍; 窦康
Original assignee: China Telecom Digital Intelligence Technology Co Ltd
Current assignee: China Telecom Digital Intelligence Technology Co Ltd
Priority date: 2023-12-13
Filing date: 2023-12-13
Publication date: 2024-04-02

Abstract

The invention provides a method for detecting a seedling event based on a large model abstract vector, which belongs to the technical field of large models and comprises the following steps: s1, generating a semantic vector for each appeal event by using a large model; s2, clustering by using semantic similarity; s3, processing all events in each predefined category obtained in the step S2 through a large model to obtain abstract vectors of the predefined category; s4, calculating the cosine similarity of the semantic vector of each appeal event obtained in the step S1 and each abstract vector obtained in the step S3, and reclassifying according to the cosine similarity of the semantic vector and the abstract vector to obtain an actual category; s5, counting the number of the complaint events contained in each actual category obtained in the step S4, wherein the number of the complaint events contained in the actual category meets specific requirements and is a Miao category, and all the complaint events contained in the Miao category are Miao events; the method has generalization capability, does not need to classify the events in advance, and can detect the seedling events efficiently and autonomously.

Description

Miao event detection method based on large model abstract vector and electronic equipment

Technical Field

The invention belongs to the technical field of large models, and particularly relates to a method for detecting a seedling event based on a large model abstract vector and electronic equipment.

Background

Events are the focus of network information attention, whereas Miao events are the leading events of a certain serious event (possibly related to social civil problems, administrative sensitive problems and emergent events) which can show the sign before the serious event occurs. The occurrence of the major event and the development trend of the major event can be mastered in time by finding the Miao event and developing and analyzing.

In the field of civil appeal, a work order is defined as an event, and unlike a network event, the civil appeal event does not have a quantifiable index such as browsing amount, clicking amount, reading amount, message leaving amount, forwarding amount and the like. This has a great impact on the algorithm technology and system design.

Within network events, the prior art is typically implemented using business rule matching, text clustering, conditional causal relationship extraction algorithms.

(1) Business rule matching typically uses a certain number of satisfied rules to interpret by way of business key categories, whether keywords are included or multiple keywords are included simultaneously (synonyms), time range limitations, and so forth.

(2) Text clustering, text clustering algorithms generally proceed by the following steps:

data preprocessing: preprocessing the original text, such as removing stop words, punctuation marks and the like, to obtain meaningful features.

Feature extraction: and extracting the characteristics of the text by means of a word bag model, a TF-IDF algorithm, word embedding and the like, and converting the text into a numerical vector.

The TF-IDF (term frequency-inverse document frequency) here is: word frequency-document frequency is a common weighting technique for information retrieval and data mining, is commonly used for mining keywords in articles, has simple and efficient algorithm, and is commonly used for cleaning the initial text data by industry.

Clustering: and clustering the text vectors by using K-Means, DBSCAN, hierarchical clustering algorithm and the like.

K-Means here is: the clustering algorithm based on Euclidean distance considers that the closer the distance between two targets is, the larger the similarity is;

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is: the density-based clustering method with noise is a density-based spatial clustering algorithm. The algorithm divides regions of sufficient density into clusters and finds arbitrarily shaped clusters in the noisy spatial database, which defines clusters as the largest set of densely connected points.

(3) Conditional causal relation extraction algorithm

The method is essentially an algorithm with rule fusion and supervised classification. The specific process is shown in CN110705597A, and the main contents are as follows:

the method comprises the steps of taking a cause event in a cause event pair of a Miao cause event as a Miao event, storing the event into a Miao event sample library, taking data of the Miao event sample library as a training set, training a first Miao event classifier based on machine learning, taking cause and effect links of the cause event pair of the Miao cause and effect as a Miao event judgment rule, storing the cause and effect links into a Miao event judgment rule library, and constructing a second Miao event classifier based on rules by using the Miao event judgment rule library; extracting events from a designated network platform to obtain a plurality of structured events, unifying the structured events which refer to the same event in the plurality of structured events into a common-finger event, generalizing the common-finger event to obtain an abstract event of the network platform, respectively processing the abstract event by using a first seedling event classifier and a second seedling event classifier, and integrating the results of the first seedling event classifier and the second seedling event classifier to serve as a detection result of the seedling event of the network platform.

The problems of the technical scheme are as follows:

(1) Business rules can solve a part of the known problems and new things that are not within the rules cannot be exhausted. The scheme needs to solve the problem of poor generalization capability of the business rules.

(2) The clustering algorithm is essentially that the clustering of known samples can be divided into several classes which cannot be predicted in advance, and the existence of several seedling events cannot be judged. The scheme needs to solve the problem of preselection setting of the number of clustering categories.

(3) The condition causality extraction algorithm process is mixed with excessive manual labeling process, but efficiency is reduced due to labeling and supervised training, and generalization capability can be greatly influenced along with the time. The scheme needs to identify the Miao event by using an unsupervised means while ensuring generalization capability, so that the response capability of the Miao event is improved.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a method for detecting the Miao event based on a large model abstract vector and electronic equipment, which have generalization capability, do not need to classify the event in advance, and can efficiently and autonomously detect the Miao event without manual labeling and supervision training.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the method for detecting the seedling head event based on the large model abstract vector comprises the following steps:

s1, generating a semantic vector for each appeal event by using a large model; performing dimension reduction on the semantic vector to obtain a dimension reduction vector; carrying out statistical feature extraction on the semantic vector to obtain a statistical feature vector, and carrying out vector splicing on the dimension reduction vector and the statistical feature vector to obtain a spliced vector;

s2, clustering by using semantic similarity, wherein the clustering method specifically comprises the following sub-steps:

s21, selecting one appeal event from all appeal events, calculating cosine similarity of spliced vectors of the appeal event and all other appeal events, wherein the cosine similarity is not smaller than two appeal events of a set value and belongs to the same type of predefined category, and if the cosine similarity of any appeal event and the appeal event is not smaller than the set value, the appeal event is singly classified into one type of predefined category;

s22, eliminating classified events from all the appeal events, and then repeating the step S21;

s23, repeating the step S22 until all the appeal events are classified;

s3, processing all events in each predefined category obtained in the step S2 through a large model to obtain abstract vectors of the predefined category; the abstract vector dimension is the same as the semantic vector dimension;

s4, calculating the cosine similarity between the semantic vector of each appeal event obtained in the step S1 and each abstract vector obtained in the step S3; if the cosine similarity between the semantic vector of a certain appeal event and a plurality of abstract vectors is not smaller than a set value, classifying the appeal event into a predefined category corresponding to the abstract vector with the largest cosine similarity, and classifying the predefined category into an independent actual category; if the cosine similarity between the semantic vector of a certain appeal event and any abstract vector is smaller than a set value, classifying the appeal event unit into an independent actual category;

s5, counting the number of the complaint events contained in each actual category obtained in the step S4, wherein the number of the complaint events contained in the actual category meets specific requirements, the complaint events are the Miao categories, and all the complaint events contained in the Miao categories are Miao events.

Preferably, the semantic vector of the ith appeal event generated using the large model in step S1 is V _i ^o Where i=1, 2,3., N, N is the total number of complaint events to be analyzed;

semantic vector V by UMAP _i ^o Dimension reduction is carried out to obtain a dimension reduction vector V _i ^p ；

The statistical feature vector of the ith appeal event is V _i ^s The statistical feature vector V _i ^s Comprising semantic vector V _i ^o And a number of minutes;

the splicing vector is V _i ，V _i The dimension of the (E) is a dimension-reducing vector V _i ^p And statistical feature vector V _i ^s Is a sum of the dimensions of (a) and (b).

Preferably, the cosine similarity of the spliced vectors in the step S21 is calculated by the following formula

Wherein V is _k The splice vector is the k-th appeal event, and k is not equal to i; m is the dimension number of the spliced vector;and->Respectively represent the splice vectors V _i And V _k The value in the t dimension;

preferably, the set value in step S21 is 0.9, and the predefined category obtained in step S2 is C _j Where j=1, 2, …, n, n is the total predefined number of categories.

Preferably, said step S3 processes the predefined class C by a large model _j The obtained abstract vector is S _j Where j=1, 2,..n.

Preferably, in the step S4, the semantic vector V of the appeal event is calculated by the following formula _i ^o And abstract vector S _j Cosine similarity of (2)

Wherein, (V) _i ^o ) ^t Andrespectively semantic vectors V _i ^o And abstract vector S _j The value in the t dimension; w is the semantic vector V _i ^o And abstract vector S _j Is a dimension of (c).

Preferably, the actual category obtained in the step S4 is CE _x X=1, 2,..u, u is the total number of actual categories.

Preferably, the xth actual class CE in the step S5 _x The number of appeal events contained in the system isSatisfy->Is the actual class CE of (2) _x Is a Miao ethnic category, wherein α and β are calculated by the formula

α＝max(2,Ceil(N×0.5％))，β＝max(10,Ceil(N×5％))

Where the Ceil () function represents an up integer.

A computer-readable storage medium storing a computer program that causes a computer to execute the large model digest vector-based head event detection method of any one of the above.

An electronic device, comprising: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor is used for realizing the method for detecting the event based on the large model abstract vector when executing the computer program.

The beneficial effects of the invention are as follows:

(1) And generating a summary of the predefined category space by using the generation capability of the large model, and further generating a summary vector. And calculating the similarity between each (civil) appeal event semantic vector and all abstract vectors, and logically judging the actual category to which the (civil) appeal event belongs or forming a new actual category for the similarity value.

The method solves the problem that the number of clustering spaces (namely clustering categories) is unknown in advance.

The abstract generated by the large model, and the generated abstract vector have stronger generalization capability than the semantics of the original event.

(2) The semantic vector (i.e., the original vector) of the civil appeal event and the dimension-reduced vector after dimension reduction splice the statistical eigenvectors (i.e., eigenvectors) of the original vector to form a new semantic vector (i.e., a spliced vector).

The spliced vector has stronger generalization capability than the original vector, and is helpful to form a predefined category space with strong generalization capability when calculating semantic similarity.

(3) And selecting a spliced vector of the civil appeal event, and calculating a similarity value with spliced vectors of other civil appeal events. A threshold determination is made of the similarity value to determine whether it should be grouped into a category. In the rest of the civil appeal events, the same measures are taken.

The method reflects an unsupervised idea, and as the method is used for comparing the classified appeal events, the rest appeal events do not compare any appeal event in the category any more (even if the previous two appeal events do not compare any more, the two appeal events are not compared), compared with the two-to-two comparison method of the total amount of civil appeal events (whether the two appeal events are classified or not, only two objects which are not compared need to be compared), the calculated amount is less, and the judgment logic is more concise; and this helps to form a predefined class space with better generalization capability due to the class aggregation by stitching vectors.

(4) The actual category classification formed by the judgment of the method of the invention adopts a mode of merging the upper and lower limit values of service definition with the total amount of civil complaints to identify the Miao events, which is more robust and faster than the method of causality, and also enables more Miao events to be recalled.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings of the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Before explaining the method provided by the invention, the large model adopted by the invention is LLM (Large Language Model): the large language model is an artificial intelligent model, aims to understand and generate human language, and has parameters of about one hundred billion levels. The large-scale language model is pre-trained on a large amount of text data, and further has the capabilities of text generation, text understanding, interactive dialogue and the like after means of supervised fine tuning, reinforcement learning based on human feedback and the like.

As shown in fig. 1, the present invention provides a method for detecting a Miao event based on a summary vector of a large model, which includes the following steps (note that the present embodiment is directed to a civil complaint event, and part of the content omits the civil complaint but indicates the civil complaint event):

s1, generating semantic vectors of each civil appeal event based on a large modelWhere i represents the i-th civil appeal event, i=1, 2,3,..n, N represents a total of N events in the range to be analyzed; defined herein->The number of dimensions of (2) is 128 dimensions.

Semantic vector V by UMAP _i ^o Dimension reduction is carried out to obtain a dimension reduction vector V _i ^p Defined hereinIs 32-dimensional. Other dimensions are also possible, but less than +.>Is a dimension of (c).

UMAP (Uniform ManifoldApproximation and Projection) here is: the uniform manifold approximation and projection is a data dimension reduction technology which can be used for general nonlinear dimension reduction, and the technology remarkably improves the calculation speed and better reserves the global structure of the data.

The semantic vector is subjected to statistical feature extraction to obtain a statistical feature vector V _i ^s The statistical feature vector V _i ^s IncludedArithmetic mean, 10% quantile, 20% quantile, vector,30% quantile, 40% quantile, 50% quantile, 60% quantile, 70% quantile, 80% quantile, 90% quantile, for a total of 10 dimensions.

Vector splicing is carried out on the dimension reduction vector and the statistical feature vector to obtain a spliced vector V _i The method comprises the steps of carrying out a first treatment on the surface of the The splice vector V _i There are a total of 42 dimensions.

Splice vector V _i Compared to the semantic vector V _i ^o Has stronger generalization capability.

s21, selecting one of all the appeal events (such as the kth appeal event), and calculating a spliced vector V of the appeal event _k Splice vector V with all remaining appeal events _i Euclidean distance similarity (i.e., cosine similarity) of (i.e., k) is calculated by

Wherein V is _k The splicing vector of the kth appeal event is i noteq k; m is the number of dimensions of the splice vector, in this embodiment m=42 (other values may be used according to circumstances);and->Respectively represent the splice vectors V _i And V _k And (5) taking a value in the t dimension.

sim(V _i ，V _k ) Not less than 0.90 as V _k Semantically similar one predefined class C _k And V is combined with _i And V _k Fall under the predefined category C _j The method comprises the steps of carrying out a first treatment on the surface of the All satisfy sim (V) _i ，V _k ) V of 0.90 or more _i The corresponding appeal events are both categorized with the kth appeal event into predefined category C _k 。

Here, 0.90 is used as a threshold (set value), the higher the threshold adjustment, the more strict the similarity of the semantics is, and the threshold size can be adjusted according to the actual situation.

If the cosine similarity between the splice vector of all the appeal events and the splice vector of the kth appeal event is less than 0.9, classifying the kth appeal event as a predefined class C alone _k (C herein) _k C as above _k Is the same concept, and is a numbered value representing the predefined category to which the kth appeal event belongs).

S22, eliminating the classified appeal event from all appeal events to be analyzed, and repeating the step S21;

s23, repeating the step S22 until all the appeal events are classified, and finally forming n predefined categories C _j Where j=1, 2, …, n; n is the total predefined number of categories.

S3, processing all events in each predefined category obtained in the step S2 through a large model to obtain abstract vectors of the predefined category; the method comprises the following steps:

predefined class C for each class _j All appeal events within, generating predefined category C with large model _j Corresponding summaries, i.e. the predefined category C _j The main content of the Chinese complaint event; generation of C using large models _j Semantic vector S of class abstract _j ，j＝1,2,...,n。

S4, calculating the semantic vector V of each appeal event obtained in the step S1 _i ^o And each abstract vector S obtained in step S3 _j The semantic vector V of the appeal event is calculated by the following formula _i ^o And abstract vector S _j Cosine similarity of (2)

Wherein, (V) _i ^o ) ^t Andrespectively semantic vectors V _i ^o And abstract vector S _j The value in the t dimension; w is the semantic vector V _i ^o And abstract vector S _j W=128 in this embodiment (w may be other values according to different situations).

If there are multiple similarity values(here again 0.85 is a set point, which can be modified according to the actual requirements), where j=1, 2,..n, then +.>The corresponding appeal event is classified into a predefined category with the maximum similarity, and the predefined category is classified into an independent actual category, and the semantic vector exists>Sum-summary vector S _j The cosine similarity of (2) is not less than 0.85, the summary vector S is then calculated _j The corresponding predefined category is renamed (categorized) to a new actual category, corresponding to just replacing the predefined category with a name; the reason for this change is that the predefined categories are more generic than the actual categories, which are specific classifications made for all of the appeal events to be analyzed at this time, and that the two need to be distinguished.

If all areWhere j=1, 2, n, the appeal event is individually categorized into one actual category.

Each semantic vector V is calculated _i ^o And each abstract vector S _j The cosine similarity of the (2) is classified according to the cosine similarity to obtain the final actual vector class CE _x X=1, 2,; u is the total number of actual categories, note that u and n can beThe same or different, and is determined according to actual conditions.

S5, traversing the number of appeal events in all actual categories, namely the x actual category CE _x The number of appeal events contained in the system isSatisfy->Is the actual class CE of (2) _x Is a Miao ethnic category, wherein α and β are calculated by the formula

α＝max(2,Ceil(N×0.5％))，β＝max(10,Ceil(N×5％))

Wherein the Ceil () function represents an up integer; max (x, y) represents the larger of x and y, and N represents the total number of all events to be analyzed; all complaint events contained in the Miao class are Miao events.

The present invention also provides a computer-readable storage medium storing a computer program for causing a computer to execute the above-described method for detecting a head event based on a large model digest vector.

The present invention also provides an electronic device including: the processor executes the computer program to realize the method for detecting the event of the seedling head based on the large model abstract vector.

In the embodiments disclosed herein, a computer storage medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The computer storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a computer storage medium would include one or more wire-based electrical connections, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the invention without departing from the principles thereof are intended to be within the scope of the invention as set forth in the following claims.

Claims

1. The method for detecting the seedling event based on the large model abstract vector is characterized by comprising the following steps of:

s23, repeating the step S22 until all the appeal events are classified;

s4, calculating the cosine similarity between the semantic vector of each appeal event obtained in the step S1 and each abstract vector obtained in the step S3; if the cosine similarity between the semantic vector of a certain appeal event and a plurality of abstract vectors is not smaller than a set value, classifying the appeal event into a predefined category corresponding to the abstract vector with the largest cosine similarity, and classifying the predefined category into an independent actual category; if the cosine similarity between the semantic vector of a certain appeal event and any abstract vector is smaller than a set value, the appeal event is singly classified into an independent actual category;

2. The method for detecting a Miao event based on a large model summary vector according to claim 1, wherein: the semantic vector generated in step S1 using the large model is V _i ^o Where i=1, 2,3 …, N is the total number of appeal events to be analyzed;

the splicing vector is V _i ，V _i Is of the dimension ofDimension-reducing vector V _i ^p And statistical feature vector V _i ^s Is a sum of the dimensions of (a) and (b).

3. The method for detecting a Miao event based on a large model summary vector according to claim 2, wherein: the cosine similarity of the spliced vector in the step S21 is calculated by the following formula

Wherein V is _k The splice vector is the k-th appeal event, and k is not equal to i; m is the dimension number of the spliced vector;and->Respectively represent the splice vectors V _i And V _k And (5) taking a value in the t dimension.

4. A method for detecting a Miao event based on a large model summary vector according to claim 3, wherein: the set value in the step S21 is 0.9, and the predefined category obtained in the step S2 is C _j Where j=1, 2, …, n, n is the total predefined number of categories.

5. The method for detecting a Miao event based on a large model summary vector according to claim 4, wherein: said step S3 processes the predefined class C by a large model _j The obtained abstract vector is S _j Where j=1, 2,..n.

6. The method for detecting a Miao event based on a large model summary vector according to claim 5, wherein: in the step S4, the semantic vector V of the appeal event is calculated through the following steps _i ^o And abstract vector S _j Cosine similarity of (2)

7.V _i ^o The method for detecting a Miao event based on a large model summary vector according to claim 6, wherein: the actual category obtained in the step S4 is CE _x X=1, 2, …, u, u being the total number of actual categories.

8. The method for detecting a Miao event based on a large model summary vector according to claim 7, wherein: the xth actual class CE in step S5 _x The number of appeal events contained in the system isSatisfy->Is the actual class CE of (2) _x Is a Miao ethnic category, wherein α and β are calculated by the formula

α＝max(2，Ceil(N×0.5％))，β＝max(10，Ceil(N×5％))

Where the Ceil () function represents an up integer.

9. A computer-readable storage medium storing a computer program, wherein the computer program causes a computer to execute the large model digest vector-based churn event detection method according to any one of claims 1 to 8.

10. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the large model summary vector based head event detection method according to any of claims 1-8 when the computer program is executed.