CN117149457A

CN117149457A - Information self-adaptive distribution strategy and flow automatic arrangement system of message middleware

Info

Publication number: CN117149457A
Application number: CN202311096337.2A
Authority: CN
Inventors: 侯睿; 刘峤; 楚博策; 王梅瑞; 骆妲; 高万年; 贾成刚; 谢炀; 甘洋镭; 代婷婷; 张文宝; 朱进; 王士成; 陈金勇
Original assignee: University of Electronic Science and Technology of China; CETC 54 Research Institute
Current assignee: University of Electronic Science and Technology of China; CETC 54 Research Institute
Priority date: 2023-08-29
Filing date: 2023-08-29
Publication date: 2023-12-01

Abstract

The invention discloses an information self-adaptive distribution strategy and a flow automatic arrangement method of a message middleware, and belongs to the field of data processing. The method of the invention takes the user demand as the guide, designs an information distribution self-adaptive strategy, and aims to change the passive subscription message mode of the traditional user so as to realize the active reception of the information of interest by the user. Acquiring a user interest label based on an LDA topic modeling technology, and constructing a user interest knowledge graph based on Neo4 j; based on an attention mechanism, constructing a mapping relation between the global semantic information of the user interest knowledge graph and the global semantic information of the message text to be distributed, and realizing semantic self-adaptive distribution of information content; constructing self-adaptive distribution of information modes by using an immune optimization algorithm so as to improve the channel utilization rate of information distribution; and finally, constructing an information self-adaptive distribution system and automatic arrangement of the flow based on the flow engine technology, and realizing modularized customization of distribution data nodes and automatic generation arrangement of the distribution flow.

Description

Information self-adaptive distribution strategy and flow automatic arrangement system of message middleware

Technical Field

The invention relates to the technical field of data processing, in particular to an information self-adaptive distribution strategy and flow automatic arrangement system of message middleware.

Background

The data sharing and the data distribution among the nodes in the system are realized in a plurality of technical modes, and the most important are the following three modes: federal database systems, data warehouse and middleware technology. But the most popular is the middleware technology at present, which can shield the difference between various heterogeneous data, realize interconnection, intercommunication and interoperation between nodes and provide a unified interface for users to access the heterogeneous data. However, with increasing network conditions of information and data, there are two problems associated with sending information in a conventional subscription/distribution manner: passive reception at the subscribing end results in an increasing amount of spam; the topic matching of the point-to-point in the data link domain between the publishing user and the subscribing user has the problem of low utilization efficiency of channel resources. Therefore, the invention provides an information self-adaptive distribution strategy and a flow automatic arrangement system of message middleware, which aim to solve the problems that a user terminal is difficult to acquire information on demand due to the fact that the user cannot self-adapt to the user demand of message content acquired by the user and the information distribution strategy is inflexible due to the fact that the user passively receives the message content in a traditional subscription distribution mechanism.

Disclosure of Invention

The invention aims to solve the problem that the passive receiving of the subscription terminal in the prior art causes the continuous increase of junk information; the topic matching of point-to-point in the data link domain between the publishing user and the subscribing user has the problem of low utilization efficiency of channel resources.

In order to solve the difficult problems in the prior art, the invention provides an information self-adaptive distribution strategy and flow automatic arrangement system of message middleware, which can realize self-adaptive distribution of corresponding messages based on user interests and demands; meanwhile, according to the network condition, different message modes are distributed in a self-adaptive mode; and can realize the automatic flow arrangement of data distribution. The self-adaptive information distribution strategy comprises the following specific steps:

step S10: constructing a topic content extraction model based on a TF-IDF, textRank, LDA algorithm, which is used for acquiring the user interested topic content of a user receiving historical message text, and further carrying out feature screening based on an AdaBoost algorithm so as to acquire the user interested topic words;

step S20: constructing a triplet < user, relation, interest tag phrase > of the user and interest topic group, and constructing a user interest knowledge graph based on a Neo4j tool;

step S30: based on the Bert model and the attention mechanism, constructing a semantic mapping relation between the user interest knowledge graph and the message to be received, and realizing the self-adaptive distribution of the information content to interested users;

step S40: in a scene of network change and limitation, based on the user priority and the network link state, an information modality (text, atlas, video, voice and the like) self-adaptive distribution strategy is realized based on an immune optimization algorithm, and an information distribution modality optimal combination is selected, so that the utilization efficiency of a channel is improved;

step S50: and designing an automatic arrangement system meeting the requirement of the data self-adaptive distribution module based on the ETL and the flow engine, and constructing an information distribution flow.

In the above technical solution, the specific steps of step S10 are as follows:

step S101, the TF-IDF algorithm uses a word segmentation tool to segment the input document, removes stop words and low-frequency words, and only reserves the words with the name part of speech as a candidate word set. The TF-IDF value of the candidate word is calculated, wherein the TF value (Term Frequency) is calculated as follows:

where m is the number of times the word w appears in the text, n is the total number of words in the text. Calculating the IDF value (Inverse Document Frequency) of the candidate word, i.e. the reverse document frequency, by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the quotient to calculate:

step S102, the TextRank algorithm uses a word segmentation tool to segment the input document, removes stop words and low-frequency words, and only reserves the words with the name part of speech as a candidate word set. Constructing the reserved words into a semantic relation undirected graph, and calculating the TextRank value of the candidate words:

where d is the damping coefficient, typically set to 0.85, in (V _i ) For word set pointing to word i, for word Out (V _i ) A set of words pointed to. w (w) _ij For word node V _i Node V with word _j Weighting of edges.

Step S103, the LDA algorithm uses a word segmentation tool to segment the input document, removes stop words and low-frequency words, and only reserves name part-of-speech words as a candidate word set. Calculating a P (word/topic) according to the probability of the topic word appearing in the topic, calculating P (topic/document) according to the probability of a topic appearing in the document, and calculating the probability of the word appearing according to P (word/topic) and P (topic/document):

P(Word/Text)＝∑ _Topic P(Word/Topic)×P(Topic/Text)

step S104, combining weak classifiers (TF-IDF, textRank, LDA algorithm classifiers) by using an AdaBoost algorithm to construct strong classifiers, wherein the idea is to change the weights of training samples in a data set to learn a plurality of classifiers, and integrating the classifiers according to a certain rule so as to improve classification performance; and finally, taking the obtained subject term as a tag term of interest to the user.

In the above technical solution, the specific steps of step S20 are as follows:

step S201, extracting named entity words (users, subject words of interest of the users) and relation words, adding the named entity words and the relation words into a knowledge graph database, and importing Neo4j to realize knowledge graph visualization construction.

In the above technical solution, the specific steps of step S30 are as follows:

step S301, the information text is originally represented as i= { w ₁ ，…，w _n And n is the number of words in the text, wherein the interest tag entity words of the knowledge graph are expressed as U= { w ₁ ，…，w _m Using a pre-trained language model BERT to obtain a sentence context feature vector tableIndication I _h ＝{h _CLS ，h ₁ ，…，h _n }，U＝{h _CLS ，h ₁ ，…，h _m }。

Step S302, based on the attention mechanism, the cosine similarity of the text of the message and the interest tag entity word of the knowledge graph is calculated according to the following calculation formula:

wherein U is _e Representing a representation of a user's interest feature, I _j Representing a text characteristic representation of the message.

Step S303, based on the semantic mapping model of the user interest word and the message text, a cosine similarity threshold beta is set, and the user-message greater than the beta value is composed into a to-be-distributed list (U _i ，I _j )。

In the above technical solution, the specific steps of step S40 are as follows:

step S401, acquiring user channel communication status based on the network probe, and according to the to-be-distributed list (U _i ，I _j ) The modality (text, picture, video, etc.) in which the information to be distributed contains the data and the specification thereof are acquired.

Step S402, defining the above-mentioned user and data to be distributed as [ user, text size, picture size, video size, voice size ]]＝[U _i ，T _i ，I _i ，V _i ]。

Step S403, selecting an optimal transmit data list [ U ] based on an immune algorithm _j ，T _j ，I _j ，V _j ]。

In the above technical solution, the specific steps of step S60 are as follows:

step S501, a manner of processing the data conversion flow using the ETL tool: including extraction of test data, wash conversion, loading.

In step S502, the flow definition file is designed to contain information (node position, size, shape, etc.) required for each node to be visually displayed.

Step S503, defining a processing node and a circulation mode based on the workflow engine.

Step S504, the needed flow definition file is stored and deployed in the operation environment of the workflow engine, and the execution flow decides the subsequent data processing flow according to the identification information in the current analysis converted data.

The beneficial effects of the invention are as follows:

firstly, acquiring interested subject content in a user history message based on an LDA subject modeling technology, and constructing a user interest knowledge graph based on Neo4 j; secondly, constructing a mapping relation between the global semantic information of the user interest knowledge graph and the global semantic information of the message text to be distributed based on an attention mechanism, and realizing semantic self-adaptive distribution of information content; furthermore, under the scene of considering the limited network state, constructing the self-adaptive distribution of the information modes (text, picture, video, voice and the like) by utilizing an immune optimization algorithm so as to improve the channel utilization rate of the information distribution; and finally, constructing an information self-adaptive distribution system and automatic arrangement of the flow based on the flow engine technology, and realizing modularized customization of distribution data nodes and automatic generation arrangement of the flow.

Drawings

FIG. 1 is a flow chart of the overall technology of the invention.

Fig. 2 is a technical roadmap for implementing the information adaptive distribution strategy of the present invention.

FIG. 3 is a technical roadmap for implementing the automatic flow layout of information distribution according to the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, the present invention provides an overall technical solution:

inputting the information source text into an information self-adaptive distribution strategy, performing content self-adaptive selection by the information distribution strategy according to user interests, performing modal self-adaptive selection according to network conditions, and finally performing information automatic distribution system construction by a process arrangement technology, wherein the system is finally used for message distribution management of message middleware.

Referring to fig. 2, the detailed steps of the information adaptation strategy are as follows:

step 1: the method comprises the following steps of constructing a topic content extraction model based on a TF-IDF, textRank, LDA algorithm, acquiring user interested topic contents of a user received history message text, and further performing feature screening based on an AdaBoost algorithm so as to acquire user interested topic words;

and 1-1, performing word segmentation on an input document by using a word segmentation tool by using a TF-IDF algorithm, removing stop words and low-frequency words, and only preserving the words with the names and parts of speech as a candidate word set. The TF-IDF value of the candidate word is calculated, wherein the TF value (Term Frequency) is calculated as follows:

and 1-2, performing word segmentation on the input document by using a word segmentation tool by using a textrank algorithm, removing stop words and low-frequency words, and only preserving the words with the name part of speech as a candidate word set. Constructing the reserved words into a semantic relation undirected graph, and calculating the TextRank value of the candidate words:

And 1-3, performing word segmentation on the input document by using a word segmentation tool by using an LDA algorithm, removing stop words and low-frequency words, and only preserving the words with the names and the parts of speech as a candidate word set. Calculating a P (word/topic) according to the probability of the topic word appearing in the topic, calculating P (topic/document) according to the probability of a topic appearing in the document, and calculating the probability of the word appearing according to P (word/topic) and P (topic/document):

P(Word/Text)＝∑ _Topic P(Word/Topic)×P(Topic/Text)

step 1-4, combining weak classifiers (TF-IDF, textRank, LDA algorithm classifiers) by using AdaBoost algorithm to construct strong classifiers, wherein the idea is to change the weights of training samples in a data set to learn a plurality of classifiers, and integrating the classifiers according to a certain rule so as to improve classification performance; and finally, taking the obtained subject term as a tag term of interest to the user.

Step 2: constructing a triplet < user, relation, interest tag phrase > of the user and interest topic group, and constructing a user interest knowledge graph based on a Neo4j tool;

and 2-1, extracting named entity words (subject words of interest of users) and relation words, adding the named entity words and the relation words into a knowledge graph database, and importing Neo4j to realize knowledge graph visualization construction.

Step 3: based on the Bert model and the attention mechanism, constructing a semantic mapping relation between the user interest knowledge graph and the message to be received, and realizing the self-adaptive distribution of the information content to interested users;

step 3-1, the information text is originally expressed as i= { w ₁ ，…，w _n And n is the number of words in the text, wherein the interest tag entity words of the knowledge graph are expressed as U= { w ₁ ，…，w _m }, makeObtaining sentence context feature vector representation I using a pre-trained language model BERT _h ＝{h _CLS ，h ₁ ，…，h _n }，U＝{h _CLS ，h ₁ ，…，h _m }。

And 3-2, calculating cosine similarity of the message text and the interest tag entity words of the knowledge graph based on an attribute mechanism, wherein the calculation formula is as follows:

wherein U is _i Representing a representation of a user's interest feature, I _j Representing a text characteristic representation of the message.

Step 3-3, based on the semantic mapping model of the user interest word and the message text, setting a cosine similarity threshold beta, and forming the user-message greater than the beta value into a list to be distributed (U _i ，I _j )。

Step 4: in a scene of network change and limitation, based on the user priority and the network link state, an information modality (text, atlas, video, voice and the like) self-adaptive distribution strategy is realized based on an immune optimization algorithm, and an information distribution modality optimal combination is selected, so that the utilization efficiency of a channel is improved;

step 4-1, acquiring the user channel communication state based on the network probe, and obtaining the user channel communication state according to the to-be-distributed list (U _i ，I _j ) The modality (text, picture, video, etc.) in which the information to be distributed contains the data and the specification thereof are acquired.

Step 4-2, defining the user and the data to be distributed as [ user, text size, picture size, video size, voice size ]]＝[U _i ，T _i ，I _i ，V _i ]。

Step 4-3, selecting an optimal transmit data list [ U ] based on an immune algorithm _j ，T _j ，I _j ，V _j ]The detailed steps are as follows:

(1) antigen recognition: the objective function and various constraints are input as antigens to the immune algorithm.

(2) Initial antibody production: the initial antibody population was randomly generated.

(3) Affinity calculation: the adaptation value of the antibody was calculated.

(4) And (3) immune treatment: immune treatment includes immune selection, cloning, mutation and suppression.

(5) And (3) immune selection: the antibodies with higher affinity are selected according to the affinity of the antibodies.

(6) Cloning: selected higher affinity antibodies were replicated.

(7) Variation: the cloned individuals are subjected to crossover and mutation operations so as to change the affinity of the cloned individuals.

(8) Inhibition: and selecting variant antibodies, and retaining antibodies with higher affinity.

(9) Group refreshing: the immune selected antibodies and the immune inhibited antibodies form a collection, and the antibodies with higher affinity are reserved, so that the antibodies enter a new population. Insufficient portions of the new population are randomly generated to increase diversity.

Referring to fig. 3, the detailed steps for constructing the information adaptive distribution system based on the process engine technology are as follows:

step 5: and designing an automatic arrangement system meeting the requirement of the data self-adaptive distribution module based on the ETL and the flow engine, and constructing an information distribution flow.

Step 5-1, processing the data conversion flow by using the ETL tool: including extraction of test data, wash conversion, loading.

Step 5-2, designing a flow definition file to contain information (node position, size, shape, etc.) required for each node to be visually displayed.

Step 5-3, defining a processing node and a circulation mode based on a workflow engine, and specifically comprising the following steps:

(1) starting node: indicating the start of a process flow, there can be only one start node.

(2) End node: there may be a plurality of processes that indicate the end of a process flow, and none (there is no end node flow, the program can run, but it is certainly unreasonable).

(3) Task node: the core node represents an approval node, the process progress can be stopped at the task nodes, and the AP I needs to be called to push the process to be carried out. Task nodes include automatic tasks and manual tasks, and we mainly perform data processing, so automatic tasks are mainly used.

(4) Gateway node: the method is a node for performing flow control, for example, after the node of all nodes is checked by the gateway node, the node can be stopped, if and only if all the nodes are checked, the gateway node can continue pushing backwards, and the gateway node can be designed with a plurality of outlets.

And step 5-4, storing and deploying the needed flow definition file into the operation environment of the workflow engine, and determining the subsequent data processing flow by the execution flow according to the identification information in the current analysis converted data.

An information self-adaptive distribution strategy and flow automatic arrangement system of message middleware can be used for message distribution management of the message middleware.

Although the present invention has been described with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements and changes may be made without departing from the spirit and principles of the present invention.

Claims

1. An information self-adaptive distribution strategy and flow automatic arrangement system of message middleware is characterized in that the system comprises the following steps:

step S20: constructing a triplet < user relation, interest tag phrase > of the user and the interest topic group, and constructing a user interest knowledge graph based on a Neo4j tool;

step S40: in a scene of network change and limitation, based on the user priority and the network link state, an information modal self-adaptive distribution strategy is realized based on an immune optimization algorithm, an information distribution modal optimal combination is selected, and the utilization efficiency of a channel is improved;

2. The information self-adaptive distribution strategy and flow automatic arrangement system of message middleware according to claim 1, wherein: the specific steps of the step S10 are as follows:

step S101, a TF-IDF algorithm uses a word segmentation tool to segment an input document, stop words and low-frequency words are removed, only words with names and parts of speech are reserved as a candidate word set, TF-IDF values of the candidate words are calculated, and a TF value calculation formula is as follows:

wherein m is the number of times that the word w appears in the text, n is the total number of words in the text, the IDF value of the candidate word, namely the reverse document frequency, is calculated, the total number of files is divided by the number of files containing the word, and the quotient is obtained to be calculated by taking the logarithm:

step S102, a TextRank algorithm uses a word segmentation tool to segment an input document, removes stop words and low-frequency words, reserves only words with name parts of speech as a candidate word set, constructs reserved words into a semantic relation undirected graph, and calculates TextRank values of the candidate words:

where d is the damping coefficient, set to 0.85, in (V _i ) For word set pointing to word i, for word Out (V _i ) Directed word set, w _ij For word node V _i Node V with word _j Weighting of edges;

step S103, the LDA algorithm uses a Word segmentation tool to segment an input document, removes stop words and low-frequency words, only reserves name part-of-speech words as a candidate Word set, calculates a P (Word/Topic) according to the probability that a subject Word appears in the subject, calculates a P (Topic/Text) according to the probability that a certain subject in the document appears, and calculates the probability that the Word appears according to the P (Word/Topic) and the P (Topic/Text):

P(Word/Text)＝∑ _Topic P(Word/Topic)×P(Topic/Text)

step S104, combining weak classifiers by using an AdaBoost algorithm to construct strong classifiers, changing weights of training samples in a data set to learn a plurality of classifiers, and integrating the classifiers according to a certain rule so as to improve classification performance; and finally, taking the obtained subject term as a tag term of interest to the user.

3. The information self-adaptive distribution strategy and flow automatic arrangement system of message middleware according to claim 1, wherein: the specific steps of the step S20 are as follows:

step S201, extracting named entity words and related words, adding the named entity words and the related words into a knowledge graph database, and importing Neo4j to realize knowledge graph visualization construction, wherein the named entity words comprise subject words of interest of users and users.

4. The information self-adaptive distribution strategy and flow automatic arrangement system of message middleware according to claim 1, wherein: the specific steps of the step S30 are as follows:

step S301, the information text is originally represented as i= { w ₁ ，…，w _n And n is the number of words in the text, wherein the interest tag entity words of the knowledge graph are expressed as U= { w ₁ ，…，w _m Use of a pre-trained language model BERT to obtain sentence context feature vector representation I _h ＝{h _CLS ，h ₁ ，…，h _n }，U＝{h _CLS ，h ₁ ，…，h _m }；

wherein U is _i Representing a representation of a user's interest feature, I _j Representing a text feature representation of the message;

5. The information self-adaptive distribution strategy and flow automatic arrangement system of message middleware according to claim 4, wherein: the specific steps of the step S40 are as follows:

step S401, acquiring user channel communication status based on the network probe, and according to the to-be-distributed list (U _i ，I _j ) Acquiring a mode of information to be distributed, which contains data, and a specification of the mode;

step S402, defining the user and the data to be distributed as [ user, text size, picture size, video size, voice size ]]＝[U _i ，T _i ，I _i ，V _i ]；

6. The information self-adaptive distribution strategy and flow automatic arrangement system of message middleware according to claim 1, wherein: the specific steps of the step S50 are as follows:

step S501, the data conversion flow is processed using ETL: test data extraction, cleaning conversion and loading;

in step S502, the flow definition file is designed to include information required for each node to be visually displayed, including the node position, size, and shape.

Step S503, defining a processing node and a circulation mode based on a workflow engine;