CN112948579A - Method, device and system for processing message text information and computer equipment - Google Patents

Method, device and system for processing message text information and computer equipment Download PDF

Info

Publication number
CN112948579A
CN112948579A CN202110128033.4A CN202110128033A CN112948579A CN 112948579 A CN112948579 A CN 112948579A CN 202110128033 A CN202110128033 A CN 202110128033A CN 112948579 A CN112948579 A CN 112948579A
Authority
CN
China
Prior art keywords
message
data
text information
clustering
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110128033.4A
Other languages
Chinese (zh)
Inventor
李镇涛
杨培浩
谢仕义
李升�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Ocean University
Original Assignee
Guangdong Ocean University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Ocean University filed Critical Guangdong Ocean University
Priority to CN202110128033.4A priority Critical patent/CN112948579A/en
Publication of CN112948579A publication Critical patent/CN112948579A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The application provides a method, a device and a system for processing message text information and computer equipment. The message text information processing method comprises the steps of obtaining first message data and second message data according to message text information; merging and classifying the first message leaving data and the second message leaving data to obtain multi-class merged word segmentation data; clustering operation is respectively carried out on each type of merged participle data to obtain clustering values of a plurality of clustering clusters; and adjusting the sequencing priority of each cluster according to the cluster value. The message subjects and the message details are merged and classified, then the merged participle data are clustered, so that the types of the participle data are reduced, after the corresponding cluster values are obtained, the messages are sorted according to different priorities, the importance degree of the message text information is conveniently sorted, the content of the message text is conveniently and quickly classified, and the message data processing efficiency is improved.

Description

Method, device and system for processing message text information and computer equipment
Technical Field
The invention relates to the technical field of message text mining, in particular to a message text information processing method, a message text information processing device, a message text information processing system, computer equipment and a storage medium.
Background
In recent years, the technology of network inquiry platforms such as WeChat, microblog and city leader mailbox is continuously improved, and the text data volume related to various social feelings and civilizations is also continuously improved. The text mining technology is a means of natural language processing technology, and in the face of increasingly huge information amount, an artificial intelligence related technology is urgently needed to be applied to deeply analyze data and research rules and relations of various information so as to better improve management level and processing efficiency of left-word text information. The method has remarkable significance for the technical development of the data mining neighborhood by carrying out related text mining research and technical application exploration on the message contents. Therefore, the effective application of data mining technology to various analysis of left message and text information is an urgent need for data processing at present.
However, the traditional message information processing method still depends on manual work to perform message division and hot spot arrangement, so that the processing workload of message data is increased, and the processing efficiency of message text information is reduced.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides a method, a device, a system, computer equipment and a storage medium for processing message text information, which can improve the message data processing efficiency.
The purpose of the invention is realized by the following technical scheme:
a method for processing message text information, the method comprising:
acquiring first message leaving data and second message leaving data according to the message leaving text information, wherein the first message leaving data corresponds to message leaving details in the message leaving text information, and the second message leaving data corresponds to message leaving subjects in the message leaving text information;
merging and classifying the first message leaving data and the second message leaving data to obtain multi-class merged word segmentation data;
clustering operation is respectively carried out on each type of merged participle data to obtain clustering values of a plurality of clustering clusters;
and adjusting the sequencing priority of each cluster according to the cluster value.
In one embodiment, the performing a clustering operation on each type of merged participle data respectively includes: performing primary clustering operation on the merged participle data to obtain a plurality of first clustering clusters; and acquiring a plurality of second cluster clusters according to the plurality of first cluster clusters.
In one embodiment, the performing an initial clustering operation on the merged participle data includes: and performing hierarchical clustering on the merged participle data.
In one embodiment, the obtaining a plurality of second cluster clusters from a plurality of first cluster clusters includes: and performing secondary clustering on the first clustering clusters to obtain a plurality of second clustering clusters.
In one embodiment, the performing secondary clustering on the plurality of first clustering clusters to obtain a plurality of second clustering clusters includes: acquiring an initial central point of each first cluster; and outputting a plurality of second cluster clusters according to the initial central points.
In one embodiment, the merging and classifying the first message data and the second message data to obtain multiple types of merged participle data includes: performing word segmentation operation on the first message data once to obtain message detail attribute data; and performing secondary word segmentation operation on the second message data according to the message detail attribute data to obtain merged word segmentation data.
A message text information processing apparatus, the apparatus comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring first message leaving data and second message leaving data according to message leaving text information, the first message leaving data corresponds to message leaving details in the message leaving text information, and the second message leaving data corresponds to message leaving subjects in the message leaving text information;
the first processing module is used for carrying out merging and classifying operation on the first message leaving data and the second message leaving data to obtain multi-class merging word segmentation data;
the second processing module is used for respectively carrying out clustering operation on each type of merged participle data to obtain clustering values of a plurality of clustering clusters;
and the sequencing module is used for adjusting the sequencing priority of each cluster according to the cluster value.
A message text information processing system comprising: a text message storage device, a reply text message processing device and the message text message processing device of the above embodiment; the first input end of the text information storage device is used for receiving message text information, the output end of the text information storage device is connected with the input end of the message text information processing device, the output end of the message text information processing device is connected with the input end of the reply text information processing device, the output end of the reply text information processing device is connected with the second input end of the text information storage device, and the reply text information processing device is used for sending reply text information corresponding to the message text information to the text information storage device.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring first message leaving data and second message leaving data according to the message leaving text information, wherein the first message leaving data corresponds to message leaving details in the message leaving text information, and the second message leaving data corresponds to message leaving subjects in the message leaving text information;
merging and classifying the first message leaving data and the second message leaving data to obtain multi-class merged word segmentation data;
clustering operation is respectively carried out on each type of merged participle data to obtain clustering values of a plurality of clustering clusters;
and adjusting the sequencing priority of each cluster according to the cluster value.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring first message leaving data and second message leaving data according to the message leaving text information, wherein the first message leaving data corresponds to message leaving details in the message leaving text information, and the second message leaving data corresponds to message leaving subjects in the message leaving text information;
merging and classifying the first message leaving data and the second message leaving data to obtain multi-class merged word segmentation data;
clustering operation is respectively carried out on each type of merged participle data to obtain clustering values of a plurality of clustering clusters;
and adjusting the sequencing priority of each cluster according to the cluster value.
Compared with the prior art, the invention has at least the following advantages:
the message subjects and message details are merged and classified, then the merged participle data are clustered, so that the types of the participle data are reduced, namely the number of data processing is reduced, after the corresponding cluster values are obtained, the messages are sorted according to different priorities, the importance degrees of message text information are conveniently sorted, the content of the message text is conveniently and quickly classified, and the message data processing efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a flowchart of a message text information processing method in an embodiment;
FIG. 2 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
It will be understood that when an element is referred to as being "secured to" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. The terms "vertical," "horizontal," "left," "right," and the like as used herein are for illustrative purposes only and do not represent the only embodiments.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
The invention relates to a method for processing message text information. In one embodiment, the message text information processing method includes acquiring first message data and second message data according to message text information, wherein the first message data corresponds to message details in the message text information, and the second message data corresponds to a message subject in the message text information; merging and classifying the first message leaving data and the second message leaving data to obtain multi-class merged word segmentation data; clustering operation is respectively carried out on each type of merged participle data to obtain clustering values of a plurality of clustering clusters; and adjusting the sequencing priority of each cluster according to the cluster value. The message subjects and message details are merged and classified, then the merged participle data are clustered, so that the types of the participle data are reduced, namely the number of data processing is reduced, after the corresponding cluster values are obtained, the messages are sorted according to different priorities, the importance degrees of message text information are conveniently sorted, the content of the message text is conveniently and quickly classified, and the message data processing efficiency is improved.
Please refer to fig. 1, which is a flowchart illustrating a method for processing a left message according to an embodiment of the present invention. The method for processing the message text information comprises part or all of the following steps.
S100: and acquiring first message leaving data and second message leaving data according to the message leaving text information, wherein the first message leaving data corresponds to message leaving details in the message leaving text information, and the second message leaving data corresponds to message leaving subjects in the message leaving text information.
In this embodiment, the message text information is text data information corresponding to message content, and the message text information includes the message details and the message subject, and the message subject corresponds to the message details. In the message text under each message topic, the message details describe the content of the message topic in detail, that is, the message details of each message text correspond to the message topics one to one, which facilitates the binding of the message topics and the message details, thereby facilitating the acquisition of the corresponding message details according to different message topics, and enabling the message topics and the message details in the message text information to be relatively independent and correlated. In another embodiment, in order to further bind the message details and the message subject, the message text information further includes a message time, a message number, and a message user identity code, where the message user identity code is an identity certificate of a message user and is used to distinguish from other message text information, and the message time corresponds to the message details and the message subject, so as to determine the generation time of the message text information, and the message time may also be used as a distinguishing condition for different message text information; the message numbers and the message themes form a one-to-one corresponding relation, so that each message theme has a unique message number, the message themes with the same message content can be distinguished conveniently, and message details with the same message theme can be distinguished.
S200: and carrying out merging and classifying operation on the first message leaving data and the second message leaving data to obtain multi-class merged word segmentation data.
In this embodiment, the first message data and the second message data both have a plurality of leave-word words, that is, the first message data corresponds to a plurality of leave-word words in message details, and the second message data corresponds to a plurality of leave-word words in a message subject. Because the message subject is a specific description of the content of the message details, the message words in the message subject can all appear in the message details, so that the message words corresponding to the first message data are partially overlapped with the message words corresponding to the second message data, and the second message data and the first message data have partially identical data. In order to avoid missing important information in the message text information, merging the first message data and the second message data, namely performing word segmentation on the first message data and the second message data simultaneously, performing word segmentation and word stay removal on the message words in the first message data and the second message data by using a jieba word segmentation device, so that the number of the word stay in the message text information is reduced, main information retention is ensured, the message theme of the message text information and the data analysis amount of message details are reduced, and the processing efficiency of the message text information is improved. The merge classification operation also classifies the message data processed by the jieba segmenter, that is, classifies the data processed by the jieba segmenter for the first message data and the second message data, for example, performs SVM classification based on TF-IDF on the data processed by the jieba segmenter for the first message data and the second message data, and the classification model used in the merge classification operation is an SVM (Support Vector Machine) model of TF-IDF (Term Frequency-Inverse text Frequency index). And classifying the left message texts by using a support vector machine on the basis of TF-IDF, wherein in the text classification process, an SVM (support vector machine) model takes the input training texts as a point in a geometric space, constructs a hyperplane capable of separating different samples by learning training samples, and determines the classification of the test samples according to the positions of the test samples on two sides of the hyperplane. The SVM model has good classification effect on data with small scale, and is also suitable for certain high-dimensional scenes due to the kernel function characteristic.
S300: and respectively carrying out clustering operation on each type of merged participle data to obtain clustering values of a plurality of clustering clusters.
In this embodiment, the first message data and the second message data are divided into multiple classes of the merged segmentation data, each classification class includes multiple segmentation data, each segmentation data is formed by merging and classifying message words in message text information, so that the message words with the same class are divided into the same class, and classification of segmentation attributes of the first message data and the second message data is achieved, that is, classification of the first message data and the second message data is performed based on a TF-IDF SVM classification model, where classification of the message words is performed according to its own attribute, for example, classification classes of the TF-IDF SVM classification model are first-level classes, and 15 first-level classes are labeled "city and countryside construction": 1, "environmental protection": 2, "transportation": 3, "educational style": 4, "labor and social security": 5, "business travel": 6, "sanitary family planning": 7, "party government affairs": 8, "homeland resources": 9, "inspection and supervision": 10, "economic management": 11, "science and technology and information industry": 12, "civil affairs": 13, "rural agriculture": 14, "politics": 15. and correspondingly dividing the merged participle data into the same category according to the first-level category. In this way, when the clustering operation is performed on each type of the merged segmented word data, multiple data in each type of the merged segmented word data are divided for the second time, but the clustering operation is performed on data with similar attributes in each type of the merged segmented word data, so that data attribute clustering is performed on the data in each type of the merged segmented word data, that is, attribute data clustering analysis is performed on the segmented word data subjected to the merging and classifying operation, and the data amount of the message text information is reduced for the second time, so that the multiple clustering clusters in each type can be sequenced subsequently, the data amount of subsequent arrangement is reduced, and the message data processing efficiency is improved.
S400: and adjusting the sequencing priority of each cluster according to the cluster value.
In this embodiment, the cluster value is the cluster number of each cluster, and the cluster value is used to represent the cluster number in each cluster, so as to conveniently show the number of clusters with the same or similar data attributes in each cluster. Because the cluster is formed by clustering the merged participle data, and the number of the cluster is the number of clusters corresponding to the merged participle data, the cluster comprises at least one participle data, and the data attributes of the participle data in each cluster are the same or similar. Thus, the number of the participle data is represented in the cluster value of the cluster, the number of the participles with the same or similar data attributes in the current cluster is indicated, the frequency of the participles with the same or similar data attributes in the message text information is conveniently displayed, the frequency of the participles with the same or similar data attributes in the message text is conveniently displayed, the priority of each cluster in each type of merged participle data is conveniently sequenced, and for example, the heat of each cluster in each type of merged participle data is sequenced. In another embodiment, the cluster value is a sum of praise and antilog of the message corresponding to each participle data in the cluster, for example, the sum of praise and antilog of each cluster in each merged participle data is sorted, and the top 5 clusters with larger values are selected as the hot spot problem.
According to the method for processing the message text information, the message subjects and the message details are combined and classified, then the combined word segmentation data are clustered, so that the types of the word segmentation data are reduced, namely the number of data processing is reduced, after the corresponding clustering values are obtained, the messages are sorted according to different priorities, the importance degree of the message text information is conveniently sorted, the content of the message text is conveniently and rapidly classified, and the message data processing efficiency is improved.
In one embodiment, the performing a clustering operation on each type of merged participle data respectively includes: performing primary clustering operation on the merged participle data to obtain a plurality of first clustering clusters; and acquiring a plurality of second cluster clusters according to the plurality of first cluster clusters. In this embodiment, the merging and word-segmentation data is subjected to the preliminary clustering operation to obtain a plurality of first clustering clusters, and the merging and word-segmentation data is subjected to primary clustering, so that the merging and word-segmentation data generates corresponding clustering clusters, that is, the first clustering clusters, and clustering is performed according to the centers of the first clustering clusters, so that the clustering operation of the second clustering clusters is based on the first clustering clusters, and on the premise that the first clustering clusters have clustering center points, the clustering effect of the second clustering clusters is effectively improved.
Further, the performing an initial clustering operation on the merged participle data includes: and performing hierarchical clustering on the merged participle data. In this embodiment, before the second cluster is obtained, hierarchical clustering is performed on the merged participle data, and the hierarchical clustering creates a hierarchical nested cluster tree by calculating similarities between data points of different classes. In the clustering tree, the original data points of different classes are the lowest layer of the tree, and the top layer of the tree is the root node of a cluster, namely, a bottom-up clustering strategy is adopted. The specific idea is to regard each sample in the data set as an initial cluster, then find out two clusters closest to the two clusters to merge, and repeat the steps continuously until a preset number of clusters or a certain condition is reached. Wherein the average distance is used to calculate the distance between two clusters. Therefore, after the hierarchical clustering, the merged word segmentation data forms a plurality of first clustering clusters, the plurality of first clustering clusters provide a plurality of central points, a center of mass is provided for obtaining the second clustering cluster, namely, the hierarchical clustering provides a clustering central point for the second clustering cluster, the central point of the first clustering cluster is conveniently used as a clustering input point for forming the second clustering cluster, and the global convergence effect is improved under the condition of ensuring the convergence of the second clustering cluster.
Further, the obtaining a plurality of second cluster clusters according to the plurality of first cluster clusters includes: and performing secondary clustering on the first clustering clusters to obtain a plurality of second clustering clusters. In this embodiment, the secondary clustering is to cluster the first clustering cluster, that is, the secondary clustering is to cluster the merged participle data again, that is, the secondary clustering is based on the hierarchical clustering, so that the convergence effect of the merged participle data is improved. Moreover, after two different clusters are performed, the number of cluster clusters formed by merging word segmentation data is reduced, the number of sequencing priorities of each cluster is reduced, and the clustering effect of the second cluster is further improved.
Further, the performing secondary clustering on the plurality of first clustering clusters to obtain a plurality of second clustering clusters includes: acquiring an initial central point of each first cluster; and outputting a plurality of second cluster clusters according to the initial central points. In this embodiment, the quadratic clustering is K-Means clustering, and the K-Means clustering method is as follows: (1) randomly determining k initial points as mass centers; (2) finding the closest centroid for each point in the data set, and distributing the centroid to the cluster corresponding to the centroid; (3) updating the centroid of each cluster to be the average of all the points in the cluster; (4) and (4) repeating the steps (2) and (3) until the distribution result of the clusters is not changed any more. And after the merged participle data are subjected to hierarchical clustering, the initial central point of the first clustering cluster is used as the input of the central point of the K-Means clustering, and the number of the first clustering clusters formed after the merged participle data are subjected to the hierarchical clustering is the same as the K value of the K-Means clustering, namely, the central point is randomly selected for many times to train the K-Means to determine a proper K value, so that the clustering result with the best effect is conveniently selected.
In one embodiment, the merging and classifying the first message data and the second message data to obtain multiple types of merged participle data includes: performing word segmentation operation on the first message data once to obtain message detail attribute data; and performing secondary word segmentation operation on the second message data according to the message detail attribute data to obtain merged word segmentation data. In this embodiment, the first message leaving data corresponds to message details, the second message leaving data corresponds to a message subject, and after the second message leaving data is classified, word segmentation is performed on the second message leaving data, so that the formed classified message subject data is lost. When the first message leaving data and the second message leaving data are segmented, repeated segmentation data in the first message leaving data and the second message leaving data are deleted, namely one segmentation is reserved, so that the situation that the segmentation in the message leaving subject is lost is caused. In order to avoid the situation, the message detail attribute data of the first message data is acquired, and secondary word segmentation operation is performed on the second message data on the basis of the message detail attribute data, so that repeated word segmentation formed after word segmentation is performed on the second message data is reserved, words used for displaying message topics in the combined word segmentation data are ensured, the integrity of the word segmentation data of the message topics is improved, and the loss rate of the word segmentation data of the message topics is reduced.
The message text information is a summary of message information, and includes a message number, a message user identity code, message time, message topics and message details, where the message topics correspond to the message details one by one, each message topic further corresponds to a message number, a message user identity code and a message time, the message number is used to store the message text information, the message user identity code is used to display identity information of an uploader of the message text information, and the message time is used to display upload submission time of the message text information. When the sorting priority of each cluster is determined, the sorting priority is determined according to the corresponding cluster value of each cluster, namely, the determination is based on the number of one cluster of each type of merged participle data. However, when a message-leaving person swipes a message maliciously, a plurality of messages with the same message subject and message details appear, so that the heat of the messages rises rapidly, and the messages actually have invalid messages, so that the heat of the messages is prompted by mistake, and the display accuracy of the message problem is reduced.
In order to improve the display accuracy of the message problem, the merging and classifying operation is performed on the first message data and the second message data to obtain multi-class merged word segmentation data, and the method further comprises the following steps:
performing preliminary word segmentation operation on each first message data to obtain corresponding first word segmentation data;
detecting whether any two first word segmentation data are the same;
when two first word segmentation data are the same, acquiring the identity codes of the message leaving users corresponding to the two same first word segmentation data;
detecting whether the identity codes of the two message leaving users are the same;
and when the two message leaving user identity codes are the same, message leaving information corresponding to one of the message leaving user identity codes is removed.
In this embodiment, the first message data is specific content, that is, message details, in the message information uploaded by each message user, and after the first message data is subjected to the preliminary word segmentation operation, important message information therein is screened, so that the similarity of the specific content of each first message data is conveniently judged. The two first-part word data are the same, which indicates that the left message contents are the same in a plurality of left message details, and the left message information with the same left message details needs to be simplified in order to avoid the situation that the processing efficiency is reduced due to repeated leaving of some left message users. Therefore, when two pieces of first word segmentation data are identical, the identity codes of the message leaving users corresponding to the two identical first word segmentation data are obtained, so that the message leaving users with the same message details can be conveniently determined, and if the two identity codes of the message leaving users with the same message details are identical, the condition that the uploaders with the same message details are the same message leaving users is indicated, namely the situation that the same message leaving user repeatedly leaves a message is indicated. The deleted message information also has message subject and message user identity code corresponding to the same message details.
Further, in order to avoid a situation that the processing amount of message information increases due to repeated message leaving of the same uploader using a plurality of message user identity codes, the method for detecting whether the two message user identity codes are the same further includes the following steps:
when the two message leaving user identity codes are different, performing preliminary word segmentation operation on the second message leaving data corresponding to the two message leaving user identity codes to obtain two second word segmentation data;
detecting whether the matching degree of the two second word segmentation data is larger than or equal to a preset matching value or not;
and when the matching degree of the two second word segmentation data is greater than or equal to the preset matching value, removing the message information with longer message leaving time.
In this embodiment, under the condition that the two pieces of first message data are the same, that is, under the condition that the two message details are the same, although the identity codes of the message users corresponding to the two pieces of first message data are different, that is, the message users corresponding to the two pieces of first message details are different, that is, the message details uploaded by the two different message users are the same, the identity codes of the message users are only virtual accounts, and the same uploading person can use different virtual accounts to leave messages, which may cause the situation of leaving messages repeatedly, thereby aggravating the processing amount of message information. Therefore, whether the matching degree of the two second word segmentation data is larger than or equal to a preset matching value or not is detected, and the situation that whether the messages uploaded by different message users are repeated or not is distinguished. Moreover, because the second message data is a message subject, the word segmentation data formed after word segmentation processing is more prominent, the number of the second word segmentation data is reduced, and the second message data is analyzed, so that the matching degree between the two second word segmentation data can be conveniently and rapidly determined. The matching degree of the two second word segmentation data is greater than or equal to the preset matching value, which indicates that the degree of similarity between the message topics corresponding to the two same message details is high, i.e. indicates that the message topics corresponding to the two same message details have more similar word segmentations, so that under the condition of the same message details, the message topics of different message users can be conveniently determined to be the same, and further the message repetition condition exists between the two different message users. By removing the message information with longer message leaving time, the message information of the later message-uploading user in different message leaving users who leave messages repeatedly is deleted, so that the data volume of subsequent merging and classifying operation and clustering operation on the message text information is reduced, and the processing efficiency on the message text information is further improved.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
The present application further provides a message text information processing apparatus, which is implemented by using the message text information processing method described in any of the above embodiments. In one embodiment, the message text information processing device has a functional module corresponding to each step for implementing the message text information processing method. The message text information processing device comprises an acquisition module, a first processing module, a second processing module and a sequencing module, wherein:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring first message leaving data and second message leaving data according to message leaving text information, the first message leaving data corresponds to message leaving details in the message leaving text information, and the second message leaving data corresponds to message leaving subjects in the message leaving text information;
the first processing module is used for carrying out merging and classifying operation on the first message leaving data and the second message leaving data to obtain multi-class merging word segmentation data;
the second processing module is used for respectively carrying out clustering operation on each type of merged participle data to obtain clustering values of a plurality of clustering clusters;
and the sequencing module is used for adjusting the sequencing priority of each cluster according to the cluster value.
In the message text information processing device, the first processing module combines and classifies the message subjects and message details, and then the second processing module performs clustering operation on the combined word segmentation data, so that the types of the word segmentation data are reduced, namely the number of data processing is reduced.
In one embodiment, the second processing module is further configured to perform a preliminary clustering operation on the merged segmented word data to obtain a plurality of first clustering clusters; and acquiring a plurality of second cluster clusters according to the plurality of first cluster clusters. In this embodiment, the second processing module performs a primary clustering on the merged segmented word data to generate a corresponding cluster, i.e., the first cluster, so that the merged segmented word data is clustered according to the center of the first cluster, thereby the clustering operation of the second cluster is based on the first cluster, and the clustering effect of the second cluster is effectively improved on the premise that the first cluster has a cluster center point.
Further, the second processing module is further configured to perform hierarchical clustering on the merged participle data. In this embodiment, before the second processing module obtains the second cluster, the second processing module performs hierarchical clustering on the merged participle data, where the hierarchical clustering creates a hierarchical nested cluster tree by calculating similarities between data points of different classes. In the clustering tree, the original data points of different classes are the lowest layer of the tree, and the top layer of the tree is the root node of a cluster, namely, a bottom-up clustering strategy is adopted. The specific idea is to regard each sample in the data set as an initial cluster, then find out two clusters closest to the two clusters to merge, and repeat the steps continuously until a preset number of clusters or a certain condition is reached. Wherein the average distance is used to calculate the distance between two clusters. Thus, after the hierarchical clustering, the second processing module forms a plurality of first clustering clusters according to the merged participle data, the plurality of first clustering clusters provide a plurality of central points, a centroid is provided for obtaining the second clustering cluster, namely, the hierarchical clustering provides a clustering central point for the second clustering cluster, the central point of the first clustering cluster is conveniently used as a clustering input point for forming the second clustering cluster, and the second processing module further improves the global convergence effect under the condition of ensuring the convergence of the second clustering cluster.
Further, the second processing module is further configured to perform secondary clustering on the plurality of first clustering clusters to obtain a plurality of second clustering clusters. In this embodiment, the secondary clustering is to cluster the first clustering cluster, that is, the secondary clustering is to cluster the merged participle data again, that is, the secondary clustering is based on the hierarchical clustering, so that the convergence effect of the merged participle data is improved. Moreover, after the merged participle data are clustered by the second processing module, the number of cluster clusters formed by the merged participle data is reduced, the number of sequencing priorities of each cluster is reduced, and the clustering effect of the second cluster is further improved.
Furthermore, the second processing module is further configured to obtain an initial central point of each first cluster; and outputting a plurality of second cluster clusters according to the initial central points. In this embodiment, the quadratic clustering is K-Means clustering, and the K-Means clustering method is as follows: (1) randomly determining k initial points as mass centers; (2) finding the closest centroid for each point in the data set, and distributing the centroid to the cluster corresponding to the centroid; (3) updating the centroid of each cluster to be the average of all the points in the cluster; (4) and (4) repeating the steps (2) and (3) until the distribution result of the clusters is not changed any more. After the merged participle data is subjected to hierarchical clustering, the initial central point of the first clustering cluster is used as the input of the central point of the K-Means clustering, and the second processing module is used for randomly selecting the central point training K-Means for multiple times to determine a proper K value, so that the second processing module can select the clustering result with the best effect.
In one embodiment, the first processing module is further configured to perform a word segmentation operation on the first message data to obtain message detail attribute data; and performing secondary word segmentation operation on the second message data according to the message detail attribute data to obtain merged word segmentation data. In this embodiment, the first message data corresponds to message details, the second message data corresponds to a message subject, and after the first processing module performs a classifying operation on the second message data, that is, the first processing module performs a word segmentation operation on the second message data, so that the formed classified message subject data may be lost. When the first message leaving data and the second message leaving data are segmented, repeated segmentation data in the first message leaving data and the second message leaving data are deleted, namely one segmentation is reserved, so that the situation that the segmentation in the message leaving subject is lost is caused. In order to avoid the situation, the first processing module obtains the message detail attribute data of the first message data, and performs secondary word segmentation operation on the second message data on the basis of the message detail attribute data, so that repeated word segmentation formed after word segmentation of the second message data is reserved, word segmentation used for displaying message topics in the combined word segmentation data is ensured, the integrity of the word segmentation data of the message topics is improved, and the loss rate of the word segmentation data of the message topics is reduced.
The present application further provides a message text information processing system, including: a text message storage device, a reply text message processing device and the message text message processing device of any of the above embodiments; the first input end of the text information storage device is used for receiving message text information, the output end of the text information storage device is connected with the input end of the message text information processing device, the output end of the message text information processing device is connected with the input end of the reply text information processing device, the output end of the reply text information processing device is connected with the second input end of the text information storage device, and the reply text information processing device is used for sending reply text information corresponding to the message text information to the text information storage device.
In this embodiment, the first processing module in the message text information processing device merges and classifies the message topics and the message details, and then the second processing module in the message text information processing device performs clustering operation on the merged participle data, so that the types of the participle data are reduced, that is, the number of data processing is reduced.
In one embodiment, data transmission is realized among the text information storage device, the reply text information processing device and the message text information processing device through a communication device, for example, the message text information processing system comprises the text information storage device, the reply text information processing device, the message text information processing device, a first communication device, a second communication device and a third communication device, wherein the text information storage device relates to message information of a plurality of users such as message numbers, message users, message subjects, message details, message time, reply messages, reply time and the like and a plurality of corresponding reply information; the user in the text information storage device seeks help by filling in a message subject and issuing a message according to message details; the text information storage device provides information for the message text information processing device through the first communication device, wherein the information comprises message topics, message details, message time and the like of each user; the message text information processing device comprises a data processing module, a model identification module and a heat sorting module; the message text information processing device is in information interaction with the reply text information processing device through a second communication device; the reply text information processing device comprises a plurality of reply modules; the reply text information processing device collects the classified message information which is sorted in the second communication device of the natural language processing technical system and is sorted in the descending order of the heat degree, and delivers the message information to the corresponding reply module; the reply text message processing device transmits the reply opinion and the reply time to the corresponding position of the text message storage device through the third communication device. The message text information processing system obtains data through message information of a user, obtains the data which is provided with labels and is sorted in descending order according to the heat after certain processing, delivers messages to each reply module according to the labels for processing, and replies after the reply module obtains the information. The first communication device is a database and interacts with the message text information processing device through inquiring the content in the message table; the second communication device is a database and performs information interaction with the reply text information processing device through the query model and the popularity; the third communication device is a database, and the reply text information processing device submits the reply opinions and the reply time to the database table.
For specific limitations of the message text information processing apparatus, reference may be made to the above limitations of the message text information processing method, which are not described herein again. All or part of each module in the message text information processing device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 2. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing data such as message subject, message details, message number, message user identity code, message time, reply opinion, reply time and the like in the message text information. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a message text information processing method.
Those skilled in the art will appreciate that the architecture shown in fig. 2 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps in the above method embodiments when executing the computer program.
In one embodiment, the present application further provides a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the steps in the above-described method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for processing message text information is characterized by comprising the following steps:
acquiring first message leaving data and second message leaving data according to the message leaving text information, wherein the first message leaving data corresponds to message leaving details in the message leaving text information, and the second message leaving data corresponds to message leaving subjects in the message leaving text information;
merging and classifying the first message leaving data and the second message leaving data to obtain multi-class merged word segmentation data;
clustering operation is respectively carried out on each type of merged participle data to obtain clustering values of a plurality of clustering clusters;
and adjusting the sequencing priority of each cluster according to the cluster value.
2. The method for processing the message text information according to claim 1, wherein the clustering operation is performed on each type of merged participle data, respectively, and comprises:
performing primary clustering operation on the merged participle data to obtain a plurality of first clustering clusters;
and acquiring a plurality of second cluster clusters according to the plurality of first cluster clusters.
3. The method according to claim 2, wherein the performing of the initial clustering operation on the merged participle data comprises:
and performing hierarchical clustering on the merged participle data.
4. The method of claim 2, wherein the obtaining a plurality of second cluster clusters from the plurality of first cluster clusters comprises:
and performing secondary clustering on the first clustering clusters to obtain a plurality of second clustering clusters.
5. The method according to claim 4, wherein the performing secondary clustering on the plurality of first cluster clusters to obtain a plurality of second cluster clusters comprises:
acquiring an initial central point of each first cluster;
and outputting a plurality of second cluster clusters according to the initial central points.
6. The method for processing the message text information according to any one of claims 1 to 5, wherein the merging and classifying the first message data and the second message data to obtain multi-class merged participle data comprises:
performing word segmentation operation on the first message data once to obtain message detail attribute data;
and performing secondary word segmentation operation on the second message data according to the message detail attribute data to obtain merged word segmentation data.
7. A message text information processing apparatus, characterized in that the apparatus comprises:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring first message leaving data and second message leaving data according to message leaving text information, the first message leaving data corresponds to message leaving details in the message leaving text information, and the second message leaving data corresponds to message leaving subjects in the message leaving text information;
the first processing module is used for carrying out merging and classifying operation on the first message leaving data and the second message leaving data to obtain multi-class merging word segmentation data;
the second processing module is used for respectively carrying out clustering operation on each type of merged participle data to obtain clustering values of a plurality of clustering clusters;
and the sequencing module is used for adjusting the sequencing priority of each cluster according to the cluster value.
8. A message text information processing system, comprising: a text information storage device, a reply text information processing device, and a message text information processing device according to claim 7; the first input end of the text information storage device is used for receiving message text information, the output end of the text information storage device is connected with the input end of the message text information processing device, the output end of the message text information processing device is connected with the input end of the reply text information processing device, the output end of the reply text information processing device is connected with the second input end of the text information storage device, and the reply text information processing device is used for sending reply text information corresponding to the message text information to the text information storage device.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN202110128033.4A 2021-01-29 2021-01-29 Method, device and system for processing message text information and computer equipment Pending CN112948579A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110128033.4A CN112948579A (en) 2021-01-29 2021-01-29 Method, device and system for processing message text information and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110128033.4A CN112948579A (en) 2021-01-29 2021-01-29 Method, device and system for processing message text information and computer equipment

Publications (1)

Publication Number Publication Date
CN112948579A true CN112948579A (en) 2021-06-11

Family

ID=76239959

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110128033.4A Pending CN112948579A (en) 2021-01-29 2021-01-29 Method, device and system for processing message text information and computer equipment

Country Status (1)

Country Link
CN (1) CN112948579A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133238A (en) * 2016-02-29 2017-09-05 阿里巴巴集团控股有限公司 A kind of text message clustering method and text message clustering system
CN111930936A (en) * 2020-06-28 2020-11-13 山东师范大学 Method and system for excavating platform message text
CN112015891A (en) * 2020-07-17 2020-12-01 山东师范大学 Method and system for classifying messages of network inquiry platform based on deep neural network
CN112148880A (en) * 2020-09-28 2020-12-29 深圳壹账通智能科技有限公司 Customer service dialogue corpus clustering method, system, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133238A (en) * 2016-02-29 2017-09-05 阿里巴巴集团控股有限公司 A kind of text message clustering method and text message clustering system
CN111930936A (en) * 2020-06-28 2020-11-13 山东师范大学 Method and system for excavating platform message text
CN112015891A (en) * 2020-07-17 2020-12-01 山东师范大学 Method and system for classifying messages of network inquiry platform based on deep neural network
CN112148880A (en) * 2020-09-28 2020-12-29 深圳壹账通智能科技有限公司 Customer service dialogue corpus clustering method, system, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112819023A (en) Sample set acquisition method and device, computer equipment and storage medium
CN111325245B (en) Repeated image recognition method, device, electronic equipment and computer readable storage medium
CN110197207B (en) Method and related device for classifying unclassified user group
CN111444344A (en) Entity classification method, entity classification device, computer equipment and storage medium
CN108241867B (en) Classification method and device
CN112836509A (en) Expert system knowledge base construction method and system
CN111460145A (en) Learning resource recommendation method, device and storage medium
CN111680506A (en) External key mapping method and device of database table, electronic equipment and storage medium
CN114416929A (en) Sample generation method, device, equipment and storage medium of entity recall model
CN111597336B (en) Training text processing method and device, electronic equipment and readable storage medium
CN110674413B (en) User relationship mining method, device, equipment and storage medium
CN110941638A (en) Application classification rule base construction method, application classification method and device
CN113742488B (en) Embedded knowledge graph completion method and device based on multitask learning
CN114610758A (en) Data processing method and device based on data warehouse, readable medium and equipment
CN112948579A (en) Method, device and system for processing message text information and computer equipment
CN114511085A (en) Entity attribute value identification method, apparatus, device, medium, and program product
CN113704623A (en) Data recommendation method, device, equipment and storage medium
CN113591881A (en) Intention recognition method and device based on model fusion, electronic equipment and medium
CN113869398A (en) Unbalanced text classification method, device, equipment and storage medium
CN113779248A (en) Data classification model training method, data processing method and storage medium
CN112507170A (en) Data asset directory construction method based on intelligent decision and related equipment thereof
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment
CN113553326A (en) Spreadsheet data processing method, device, computer equipment and storage medium
CN111291829A (en) Automatic determination method and system for selected pictures
CN117648635B (en) Sensitive information classification and classification method and system and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination